mbapi.texi 45 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988
  1. \input texinfo
  2. @setfilename mbapi.info
  3. @settitle Multibyte API
  4. @setchapternewpage off
  5. @c Open issues:
  6. @c What's the best way to report errors? Should functions return a
  7. @c magic value, according to C tradition, or should they signal a
  8. @c Guile exception?
  9. @c
  10. @node Working With Multibyte Strings in C
  11. @chapter Working With Multibyte Strings in C
  12. Guile allows strings to contain characters drawn from a wide variety of
  13. languages, including many Asian, Eastern European, and Middle Eastern
  14. languages, in a uniform and unrestricted way. The string representation
  15. normally used in C code --- an array of @sc{ASCII} characters --- is not
  16. sufficient for Guile strings, since they may contain characters not
  17. present in @sc{ASCII}.
  18. Instead, Guile uses a very large character set, and encodes each
  19. character as a sequence of one or more bytes. We call this
  20. variable-width encoding a @dfn{multibyte} encoding. Guile uses this
  21. single encoding internally for all strings, symbol names, error
  22. messages, etc., and performs appropriate conversions upon input and
  23. output.
  24. The use of this variable-width encoding is almost invisible to Scheme
  25. code. Strings are still indexed by character number, not by byte
  26. offset; @code{string-length} still returns the length of a string in
  27. characters, not in bytes. @code{string-ref} and @code{string-set!} are
  28. no longer guaranteed to be constant-time operations, but Guile uses
  29. various strategies to reduce the impact of this change.
  30. However, the encoding is visible via Guile's C interface, which gives
  31. the user direct access to a string's bytes. This chapter explains how
  32. to work with Guile multibyte text in C code. Since variable-width
  33. encodings are clumsier to work with than simple fixed-width encodings,
  34. Guile provides a set of standard macros and functions for manipulating
  35. multibyte text to make the job easier. Furthermore, Guile makes some
  36. promises about the encoding which you can use in writing your own text
  37. processing code.
  38. While we discuss guaranteed properties of Guile's encoding, and provide
  39. functions to operate on its character set, we do not actually specify
  40. either the character set or encoding here. This is because we expect
  41. both of them to change in the future: currently, Guile uses the same
  42. encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU Emacs
  43. as well) to use Unicode and UTF-8, with some extensions. This will make
  44. it more comfortable to use Guile with other systems which use UTF-8,
  45. like the GTk user interface toolkit.
  46. @menu
  47. * Multibyte String Terminology::
  48. * Promised Properties of the Guile Multibyte Encoding::
  49. * Functions for Operating on Multibyte Text::
  50. * Multibyte Text Processing Errors::
  51. * Why Guile Does Not Use a Fixed-Width Encoding::
  52. @end menu
  53. @node Multibyte String Terminology, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C, Working With Multibyte Strings in C
  54. @section Multibyte String Terminology
  55. In the descriptions which follow, we make the following definitions:
  56. @table @dfn
  57. @item byte
  58. A @dfn{byte} is a number between 0 and 255. It has no inherent textual
  59. interpretation. So 65 is a byte, not a character.
  60. @item character
  61. A @dfn{character} is a unit of text. It has no inherent numeric value.
  62. @samp{A} and @samp{.} are characters, not bytes. (This is different
  63. from the C language's definition of @dfn{character}; in this chapter, we
  64. will always use a phrase like ``the C language's @code{char} type'' when
  65. that's what we mean.)
  66. @item character set
  67. A @dfn{character set} is an invertible mapping between numbers and a
  68. given set of characters. @sc{ASCII} is a character set assigning
  69. characters to the numbers 0 through 127. It maps @samp{A} onto the
  70. number 65, and @samp{.} onto 46.
  71. Note that a character set maps characters onto numbers, @emph{not
  72. necessarily} onto bytes. For example, the Unicode character set maps
  73. the Greek lower-case @samp{alpha} character onto the number 945, which
  74. is not a byte.
  75. (This is what Internet standards would call a "coding character set".)
  76. @item encoding
  77. An encoding maps numbers onto sequences of bytes. For example, the
  78. UTF-8 encoding, defined in the Unicode Standard, would map the number
  79. 945 onto the sequence of bytes @samp{206 177}. When using the
  80. @sc{ASCII} character set, every number assigned also happens to be a
  81. byte, so there is an obvious trivial encoding for @sc{ASCII} in bytes.
  82. (This is what Internet standards would call a "character encoding
  83. scheme".)
  84. @end table
  85. Thus, to turn a character into a sequence of bytes, you need a character
  86. set to assign a number to that character, and then an encoding to turn
  87. that number into a sequence of bytes.
  88. Likewise, to interpret a sequence of bytes as a sequence of characters,
  89. you use an encoding to extract a sequence of numbers from the bytes, and
  90. then a character set to turn the numbers into characters.
  91. Errors can occur while carrying out either of these processes. For
  92. example, under a particular encoding, a given string of bytes might not
  93. correspond to any number. For example, the byte sequence @samp{128 128}
  94. is not a valid encoding of any number under UTF-8.
  95. Having carefully defined our terminology, we will now abuse it.
  96. We will sometimes use the word @dfn{character} to refer to the number
  97. assigned to a character by a character set, in contexts where it's
  98. obvious we mean a number.
  99. Sometimes there is a close association between a particular encoding and
  100. a particular character set. Thus, we may sometimes refer to the
  101. character set and encoding together as an @dfn{encoding}.
  102. @node Promised Properties of the Guile Multibyte Encoding, Functions for Operating on Multibyte Text, Multibyte String Terminology, Working With Multibyte Strings in C
  103. @section Promised Properties of the Guile Multibyte Encoding
  104. Internally, Guile uses a single encoding for all text --- symbols,
  105. strings, error messages, etc. Here we list a number of helpful
  106. properties of Guile's encoding. It is correct to write code which
  107. assumes these properties; code which uses these assumptions will be
  108. portable to all future versions of Guile, as far as we know.
  109. @b{Every @sc{ASCII} character is encoded as a single byte from 0 to 127, in
  110. the obvious way.} This means that a standard C string containing only
  111. @sc{ASCII} characters is a valid Guile string (except for the terminator;
  112. Guile strings store the length explicitly, so they can contain null
  113. characters).
  114. @b{The encodings of non-@sc{ASCII} characters use only bytes between 128
  115. and 255.} That is, when we turn a non-@sc{ASCII} character into a
  116. series of bytes, none of those bytes can ever be mistaken for the
  117. encoding of an @sc{ASCII} character. This means that you can search a
  118. Guile string for an @sc{ASCII} character using the standard
  119. @code{memchr} library function. By extension, you can search for an
  120. @sc{ASCII} substring in a Guile string using a traditional substring
  121. search algorithm --- you needn't add special checks to verify encoding
  122. boundaries, etc.
  123. @b{No character encoding is a subsequence of any other character
  124. encoding.} (This is just a stronger version of the previous promise.)
  125. This means that you can search for occurrences of one Guile string
  126. within another Guile string just as if they were raw byte strings. You
  127. can use the stock @code{memmem} function (provided on GNU systems, at
  128. least) for such searches. If you don't need the ability to represent
  129. null characters in your text, you can still use null-termination for
  130. strings, and use the traditional string-handling functions like
  131. @code{strlen}, @code{strstr}, and @code{strcat}.
  132. @b{You can always determine the full length of a character's encoding
  133. from its first byte.} Guile provides the macro @code{scm_mb_len} which
  134. computes the encoding's length from its first byte. Given the first
  135. rule, you can see that @code{scm_mb_len (@var{b})}, for any @code{0 <=
  136. @var{b} <= 127}, returns 1.
  137. @b{Given an arbitrary byte position in a Guile string, you can always
  138. find the beginning and end of the character containing that byte without
  139. scanning too far in either direction.} This means that, if you are sure
  140. a byte sequence is a valid encoding of a character sequence, you can
  141. find character boundaries without keeping track of the beginning and
  142. ending of the overall string. This promise relies on the fact that, in
  143. addition to storing the string's length explicitly, Guile always either
  144. terminates the string's storage with a zero byte, or shares it with
  145. another string which is terminated this way.
  146. @node Functions for Operating on Multibyte Text, Multibyte Text Processing Errors, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C
  147. @section Functions for Operating on Multibyte Text
  148. Guile provides a variety of functions, variables, and types for working
  149. with multibyte text.
  150. @menu
  151. * Basic Multibyte Character Processing::
  152. * Finding Character Encoding Boundaries::
  153. * Multibyte String Functions::
  154. * Exchanging Guile Text With the Outside World in C::
  155. * Implementing Your Own Text Conversions::
  156. @end menu
  157. @node Basic Multibyte Character Processing, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text, Functions for Operating on Multibyte Text
  158. @subsection Basic Multibyte Character Processing
  159. Here are the essential types and functions for working with Guile text.
  160. Guile uses the C type @code{unsigned char *} to refer to text encoded
  161. with Guile's encoding.
  162. Note that any operation marked here as a ``Libguile Macro'' might
  163. evaluate its argument multiple times.
  164. @deftp {Libguile Type} scm_char_t
  165. This is a signed integral type large enough to hold any character in
  166. Guile's character set. All character numbers are positive.
  167. @end deftp
  168. @deftypefn {Libguile Macro} scm_char_t scm_mb_get (const unsigned char *@var{p})
  169. Return the character whose encoding starts at @var{p}. If @var{p} does
  170. not point at a valid character encoding, the behavior is undefined.
  171. @end deftypefn
  172. @deftypefn {Libguile Macro} int scm_mb_put (unsigned char *@var{p}, scm_char_t @var{c})
  173. Place the encoded form of the Guile character @var{c} at @var{p}, and
  174. return its length in bytes. If @var{c} is not a Guile character, the
  175. behavior is undefined.
  176. @end deftypefn
  177. @deftypevr {Libguile Constant} int scm_mb_max_len
  178. The maximum length of any character's encoding, in bytes. You may
  179. assume this is relatively small --- less than a dozen or so.
  180. @end deftypevr
  181. @deftypefn {Libguile Macro} int scm_mb_len (unsigned char @var{b})
  182. If @var{b} is the first byte of a character's encoding, return the full
  183. length of the character's encoding, in bytes. If @var{b} is not a valid
  184. leading byte, the behavior is undefined.
  185. @end deftypefn
  186. @deftypefn {Libguile Macro} int scm_mb_char_len (scm_char_t @var{c})
  187. Return the length of the encoding of the character @var{c}, in bytes.
  188. If @var{c} is not a valid Guile character, the behavior is undefined.
  189. @end deftypefn
  190. @deftypefn {Libguile Function} scm_char_t scm_mb_get_func (const unsigned char *@var{p})
  191. @deftypefnx {Libguile Function} int scm_mb_put_func (unsigned char *@var{p}, scm_char_t @var{c})
  192. @deftypefnx {Libguile Function} int scm_mb_len_func (unsigned char @var{b})
  193. @deftypefnx {Libguile Function} int scm_mb_char_len_func (scm_char_t @var{c})
  194. These are functions identical to the corresponding macros. You can use
  195. them in situations where the overhead of a function call is acceptable,
  196. and the cleaner semantics of function application are desireable.
  197. @end deftypefn
  198. @node Finding Character Encoding Boundaries, Multibyte String Functions, Basic Multibyte Character Processing, Functions for Operating on Multibyte Text
  199. @subsection Finding Character Encoding Boundaries
  200. These are functions for finding the boundaries between characters in
  201. multibyte text.
  202. Note that any operation marked here as a ``Libguile Macro'' might
  203. evaluate its argument multiple times, unless the definition promises
  204. otherwise.
  205. @deftypefn {Libguile Macro} int scm_mb_boundary_p (const unsigned char *@var{p})
  206. Return non-zero iff @var{p} points to the start of a character in
  207. multibyte text.
  208. This macro will evaluate its argument only once.
  209. @end deftypefn
  210. @deftypefn {Libguile Function} {const unsigned char *} scm_mb_floor (const unsigned char *@var{p})
  211. ``Round'' @var{p} to the previous character boundary. That is, if
  212. @var{p} points to the middle of the encoding of a Guile character,
  213. return a pointer to the first byte of the encoding. If @var{p} points
  214. to the start of the encoding of a Guile character, return @var{p}
  215. unchanged.
  216. @end deftypefn
  217. @deftypefn {libguile Function} {const unsigned char *} scm_mb_ceiling (const unsigned char *@var{p})
  218. ``Round'' @var{p} to the next character boundary. That is, if @var{p}
  219. points to the middle of the encoding of a Guile character, return a
  220. pointer to the first byte of the encoding of the next character. If
  221. @var{p} points to the start of the encoding of a Guile character, return
  222. @var{p} unchanged.
  223. @end deftypefn
  224. Note that it is usually not friendly for functions to silently correct
  225. byte offsets that point into the middle of a character's encoding. Such
  226. offsets almost always indicate a programming error, and they should be
  227. reported as early as possible. So, when you write code which operates
  228. on multibyte text, you should not use functions like these to ``clean
  229. up'' byte offsets which the originator believes to be correct; instead,
  230. your code should signal a @code{text:not-char-boundary} error as soon as
  231. it detects an invalid offset. @xref{Multibyte Text Processing Errors}.
  232. @node Multibyte String Functions, Exchanging Guile Text With the Outside World in C, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text
  233. @subsection Multibyte String Functions
  234. These functions allow you to operate on multibyte strings: sequences of
  235. character encodings.
  236. @deftypefn {Libguile Function} int scm_mb_count (const unsigned char *@var{p}, int @var{len})
  237. Return the number of Guile characters encoded by the @var{len} bytes at
  238. @var{p}.
  239. If the sequence contains any invalid character encodings, or ends with
  240. an incomplete character encoding, signal a @code{text:bad-encoding}
  241. error.
  242. @end deftypefn
  243. @deftypefn {Libguile Macro} scm_char_t scm_mb_walk (unsigned char **@var{pp})
  244. Return the character whose encoding starts at @code{*@var{pp}}, and
  245. advance @code{*@var{pp}} to the start of the next character. Return -1
  246. if @code{*@var{pp}} does not point to a valid character encoding.
  247. @end deftypefn
  248. @deftypefn {Libguile Function} {const unsigned char *} scm_mb_prev (const unsigned char *@var{p})
  249. If @var{p} points to the middle of the encoding of a Guile character,
  250. return a pointer to the first byte of the encoding. If @var{p} points
  251. to the start of the encoding of a Guile character, return the start of
  252. the previous character's encoding.
  253. This is like @code{scm_mb_floor}, but the returned pointer will always
  254. be before @var{p}. If you use this function to drive an iteration, it
  255. guarantees backward progress.
  256. @end deftypefn
  257. @deftypefn {Libguile Function} {const unsigned char *} scm_mb_next (const unsigned char *@var{p})
  258. If @var{p} points to the encoding of a Guile character, return a pointer
  259. to the first byte of the encoding of the next character.
  260. This is like @code{scm_mb_ceiling}, but the returned pointer will always
  261. be after @var{p}. If you use this function to drive an iteration, it
  262. guarantees forward progress.
  263. @end deftypefn
  264. @deftypefn {Libguile Function} {const unsigned char *} scm_mb_index (const unsigned char *@var{p}, int @var{len}, int @var{i})
  265. Assuming that the @var{len} bytes starting at @var{p} are a
  266. concatenation of valid character encodings, return a pointer to the
  267. start of the @var{i}'th character encoding in the sequence.
  268. This function scans the sequence from the beginning to find the
  269. @var{i}'th character, and will generally require time proportional to
  270. the distance from @var{p} to the returned address.
  271. If the sequence contains any invalid character encodings, or ends with
  272. an incomplete character encoding, signal a @code{text:bad-encoding}
  273. error.
  274. @end deftypefn
  275. It is common to process the characters in a string from left to right.
  276. However, if you fetch each character using @code{scm_mb_index}, each
  277. call will scan the text from the beginning, so your loop will require
  278. time proportional to at least the square of the length of the text. To
  279. avoid this poor performance, you can use an @code{scm_mb_cache}
  280. structure and the @code{scm_mb_index_cached} macro.
  281. @deftp {Libguile Type} {struct scm_mb_cache}
  282. This structure holds information that allows a string scanning operation
  283. to use the results from a previous scan of the string. It has the
  284. following members:
  285. @table @code
  286. @item character
  287. An index, in characters, into the string.
  288. @item byte
  289. The index, in bytes, of the start of that character.
  290. @end table
  291. In other words, @code{byte} is the byte offset of the
  292. @code{character}'th character of the string. Note that if @code{byte}
  293. and @code{character} are equal, then all characters before that point
  294. must have encodings exactly one byte long, and the string can be indexed
  295. normally.
  296. All elements of a @code{struct scm_mb_cache} structure should be
  297. initialized to zero before its first use, and whenever the string's text
  298. changes.
  299. @end deftp
  300. @deftypefn {Libguile Macro} const unsigned char *scm_mb_index_cached (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
  301. @deftypefnx {Libguile Function} const unsigned char *scm_mb_index_cached_func (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
  302. This macro and this function are identical to @code{scm_mb_index},
  303. except that they may consult and update *@var{cache} in order to avoid
  304. scanning the string from the beginning. @code{scm_mb_index_cached} is a
  305. macro, so it may have less overhead than
  306. @code{scm_mb_index_cached_func}, but it may evaluate its arguments more
  307. than once.
  308. Using @code{scm_mb_index_cached} or @code{scm_mb_index_cached_func}, you
  309. can scan a string from left to right, or from right to left, in time
  310. proportional to the length of the string. As long as each character
  311. fetched is less than some constant distance before or after the previous
  312. character fetched with @var{cache}, each access will require constant
  313. time.
  314. @end deftypefn
  315. Guile also provides functions to convert between an encoded sequence of
  316. characters, and an array of @code{scm_char_t} objects.
  317. @deftypefn {Libguile Function} scm_char_t *scm_mb_multibyte_to_fixed (const unsigned char *@var{p}, int @var{len}, int *@var{result_len})
  318. Convert the variable-width text in the @var{len} bytes at @var{p}
  319. to an array of @code{scm_char_t} values. Return a pointer to the array,
  320. and set @code{*@var{result_len}} to the number of elements it contains.
  321. The returned array is allocated with @code{malloc}, and it is the
  322. caller's responsibility to free it.
  323. If the text is not a sequence of valid character encodings, this
  324. function will signal a @code{text:bad-encoding} error.
  325. @end deftypefn
  326. @deftypefn {Libguile Function} unsigned char *scm_mb_fixed_to_multibyte (const scm_char_t *@var{fixed}, int @var{len}, int *@var{result_len})
  327. Convert the array of @code{scm_char_t} values to a sequence of
  328. variable-width character encodings. Return a pointer to the array of
  329. bytes, and set @code{*@var{result_len}} to its length, in bytes.
  330. The returned byte sequence is terminated with a zero byte, which is not
  331. counted in the length returned in @code{*@var{result_len}}.
  332. The returned byte sequence is allocated with @code{malloc}; it is the
  333. caller's responsibility to free it.
  334. If the text is not a sequence of valid character encodings, this
  335. function will signal a @code{text:bad-encoding} error.
  336. @end deftypefn
  337. @node Exchanging Guile Text With the Outside World in C, Implementing Your Own Text Conversions, Multibyte String Functions, Functions for Operating on Multibyte Text
  338. @subsection Exchanging Guile Text With the Outside World in C
  339. [[This is kind of a heavy-weight model, given that one end of the
  340. conversion is always going to be the Guile encoding. Any way to shorten
  341. things a bit?]]
  342. Guile provides functions for converting between Guile's internal text
  343. representation and encodings popular in the outside world. These
  344. functions are closely modeled after the @code{iconv} functions available
  345. on some systems.
  346. To convert text between two encodings, you should first call
  347. @code{scm_mb_iconv_open} to indicate the source and destination
  348. encodings; this function returns a context object which records the
  349. conversion to perform.
  350. Then, you should call @code{scm_mb_iconv} to actually convert the text.
  351. This function expects input and output buffers, and a pointer to the
  352. context you got from @var{scm_mb_iconv_open}. You don't need to pass
  353. all your input to @code{scm_mb_iconv} at once; you can invoke it on
  354. successive blocks of input (as you read it from a file, say), and it
  355. will convert as much as it can each time, indicating when you should
  356. grow your output buffer.
  357. An encoding may be @dfn{stateless}, or @dfn{stateful}. In most
  358. encodings, a contiguous group of bytes from the sequence completely
  359. specifies a particular character; these are stateless encodings.
  360. However, some encodings require you to look back an unbounded number of
  361. bytes in the stream to assign a meaning to a particular byte sequence;
  362. such encodings are stateful.
  363. For example, in the @samp{ISO-2022-JP} encoding for Japanese text, the
  364. byte sequence @samp{27 36 66} indicates that subsequent bytes should be
  365. taken in pairs and interpreted as characters from the JIS-0208 character
  366. set. An arbitrary number of byte pairs may follow this sequence. The
  367. byte sequence @samp{27 40 66} indicates that subsequent bytes should be
  368. interpreted as @sc{ASCII}. In this encoding, you cannot tell whether a
  369. given byte is an @sc{ASCII} character without looking back an arbitrary
  370. distance for the most recent escape sequence, so it is a stateful
  371. encoding.
  372. In Guile, if a conversion involves a stateful encoding, the context
  373. object carries any necessary state. Thus, you can have many independent
  374. conversions to or from stateful encodings taking place simultaneously,
  375. as long as each data stream uses its own context object for the
  376. conversion.
  377. @deftp {Libguile Type} {struct scm_mb_iconv}
  378. This is the type for context objects, which represent the encodings and
  379. current state of an ongoing text conversion. A @code{struct
  380. scm_mb_iconv} records the source and destination encodings, and keeps
  381. track of any information needed to handle stateful encodings.
  382. @end deftp
  383. @deftypefn {Libguile Function} {struct scm_mb_iconv *} scm_mb_iconv_open (const char *@var{tocode}, const char *@var{fromcode})
  384. Return a pointer to a new @code{struct scm_mb_iconv} context object,
  385. ready to convert from the encoding named @var{fromcode} to the encoding
  386. named @var{tocode}. For stateful encodings, the context object is in
  387. some appropriate initial state, ready for use with the
  388. @code{scm_mb_iconv} function.
  389. When you are done using a context object, you may call
  390. @code{scm_mb_iconv_close} to free it.
  391. If either @var{tocode} or @var{fromcode} is not the name of a known
  392. encoding, this function will signal the @code{text:unknown-conversion}
  393. error, described below.
  394. @c Try to use names here from the IANA list:
  395. @c see ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
  396. Guile supports at least these encodings:
  397. @table @samp
  398. @item US-ASCII
  399. @sc{US-ASCII}, in the standard one-character-per-byte encoding.
  400. @item ISO-8859-1
  401. The usual character set for Western European languages, in its usual
  402. one-character-per-byte encoding.
  403. @item Guile-MB
  404. Guile's current internal multibyte encoding. The actual encoding this
  405. name refers to will change from one version of Guile to the next. You
  406. should use this when converting data between external sources and the
  407. encoding used by Guile objects.
  408. You should @emph{not} use this as the encoding for data presented to the
  409. outside world, for two reasons. 1) Its meaning will change over time,
  410. so data written using the @samp{guile} encoding with one version of
  411. Guile might not be readable with the @samp{guile} encoding in another
  412. version of Guile. 2) It currently corresponds to @samp{Emacs-Mule},
  413. which invented for Emacs's internal use, and was never intended to serve
  414. as an exchange medium.
  415. @item Guile-Wide
  416. Guile's character set, as an array of @code{scm_char_t} values.
  417. Note that this encoding is even less suitable for public use than
  418. @samp{Guile}, since the exact sequence of bytes depends heavily on the
  419. size and endianness the host system uses for @code{scm_char_t}. Using
  420. this encoding is very much like calling the
  421. @code{scm_mb_multibyte_to_fixed} or @code{scm_mb_fixed_to_multibyte}
  422. functions, except that @code{scm_mb_iconv} gives you more control over
  423. buffer allocation and management.
  424. @item Emacs-Mule
  425. This is the variable-length encoding for multi-lingual text by GNU
  426. Emacs, at least through version 20.4. You probably should not use this
  427. encoding, as it is designed only for Emacs's internal use. However, we
  428. provide it here because it's trivial to support, and some people
  429. probably do have @samp{emacs-mule}-format files lying around.
  430. @end table
  431. (At the moment, this list doesn't include any character sets suitable for
  432. external use that can actually handle multilingual data; this is
  433. unfortunate, as it encourages users to write data in Emacs-Mule format,
  434. which nobody but Emacs and Guile understands. We hope to add support
  435. for Unicode in UTF-8 soon, which should solve this problem.)
  436. Case is not significant in encoding names.
  437. You can define your own conversions; see @ref{Implementing Your Own Text
  438. Conversions}.
  439. @end deftypefn
  440. @deftypefn {Libguile Function} int scm_mb_have_encoding (const char *@var{encoding})
  441. Return a non-zero value if Guile supports the encoding named @var{encoding}[[]]
  442. @end deftypefn
  443. @deftypefn {Libguile Function} size_t scm_mb_iconv (struct scm_mb_iconv *@var{context}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
  444. Convert a sequence of characters from one encoding to another. The
  445. argument @var{context} specifies the encodings to use for the input and
  446. output, and carries state for stateful encodings; use
  447. @code{scm_mb_iconv_open} to create a @var{context} object for a
  448. particular conversion.
  449. Upon entry to the function, @code{*@var{inbuf}} should point to the
  450. input buffer, and @code{*@var{inbytesleft}} should hold the number of
  451. input bytes present in the buffer; @code{*@var{outbuf}} should point to
  452. the output buffer, and @code{*@var{outbytesleft}} should hold the number
  453. of bytes available to hold the conversion results in that buffer.
  454. Upon exit from the function, @code{*@var{inbuf}} points to the first
  455. unconsumed byte of input, and @code{*@var{inbytesleft}} holds the number
  456. of unconsumed input bytes; @code{*@var{outbuf}} points to the byte after
  457. the last output byte, and @code{*@var{outbyteleft}} holds the number of
  458. bytes left unused in the output buffer.
  459. For stateful encodings, @var{context} carries encoding state from one
  460. call to @code{scm_mb_iconv} to the next. Thus, successive calls to
  461. @var{scm_mb_iconv} which use the same context object can convert a
  462. stream of data one chunk at a time.
  463. If @var{inbuf} is zero or @code{*@var{inbuf}} is zero, then the call is
  464. taken as a request to reset the states of the input and the output
  465. encodings. If @var{outbuf} is non-zero and @code{*@var{outbuf}} is
  466. non-zero, then @code{scm_mb_iconv} stores a byte sequence in the output
  467. buffer to put the output encoding in its initial state. If the output
  468. buffer is not large enough to hold this byte sequence,
  469. @code{scm_mb_iconv} returns @code{scm_mb_iconv_too_big}, and leaves
  470. the shift states of @var{context}'s input and output encodings
  471. unchanged.
  472. The @code{scm_mb_iconv} function always consumes only complete
  473. characters or shift sequences from the input buffer, and the output
  474. buffer always contains a sequence of complete characters or escape
  475. sequences.
  476. If the input sequence contains characters which are not expressible in
  477. the output encoding, @code{scm_mb_iconv} converts it in an
  478. implementation-defined way. It may simply delete the character.
  479. Some encodings use byte sequences which do not correspond to any textual
  480. character. For example, the escape sequence of a stateful encoding has
  481. no textual meaning. When converting from such an encoding, a call to
  482. @code{scm_mb_iconv} might consume input but produce no output, since the
  483. input sequence might contain only escape sequences.
  484. Normally, @code{scm_mb_iconv} returns the number of input characters it
  485. could not convert perfectly to the ouput encoding. However, it may
  486. return one of the @code{scm_mb_iconv_} codes described below, to
  487. indicate an error. All of these codes are negative values.
  488. If the input sequence contains an invalid character encoding, conversion
  489. stops before the invalid input character, and @code{scm_mb_iconv}
  490. returns the constant value @code{scm_mb_iconv_bad_encoding}.
  491. If the input sequence ends with an incomplete character encoding,
  492. @code{scm_mb_iconv} will leave it in the input buffer, unconsumed, and
  493. return the constant value @code{scm_mb_iconv_incomplete_encoding}. This
  494. is not necessarily an error, if you expect to call @code{scm_mb_iconv}
  495. again with more data which might contain the rest of the encoding
  496. fragment.
  497. If the output buffer does not contain enough room to hold the converted
  498. form of the complete input text, @code{scm_mb_iconv} converts as much as
  499. it can, changes the input and output pointers to reflect the amount of
  500. text successfully converted, and then returns
  501. @code{scm_mb_iconv_too_big}.
  502. @end deftypefn
  503. Here are the status codes that might be returned by @code{scm_mb_iconv}.
  504. They are all negative integers.
  505. @table @code
  506. @item scm_mb_iconv_too_big
  507. The conversion needs more room in the output buffer. Some characters
  508. may have been consumed from the input buffer, and some characters may
  509. have been placed in the available space in the output buffer.
  510. @item scm_mb_iconv_bad_encoding
  511. @code{scm_mb_iconv} encountered an invalid character encoding in the
  512. input buffer. Conversion stopped before the invalid character, so there
  513. may be some characters consumed from the input buffer, and some
  514. converted text in the output buffer.
  515. @item scm_mb_iconv_incomplete_encoding
  516. The input buffer ends with an incomplete character encoding. The
  517. incomplete encoding is left in the input buffer, unconsumed. This is
  518. not necessarily an error, if you expect to call @code{scm_mb_iconv}
  519. again with more data which might contain the rest of the incomplete
  520. encoding.
  521. @end table
  522. Finally, Guile provides a function for destroying conversion contexts.
  523. @deftypefn {Libguile Function} void scm_mb_iconv_close (struct scm_mb_iconv *@var{context})
  524. Deallocate the conversion context object @var{context}, and all other
  525. resources allocated by the call to @code{scm_mb_iconv_open} which
  526. returned @var{context}.
  527. @end deftypefn
  528. @node Implementing Your Own Text Conversions, , Exchanging Guile Text With the Outside World in C, Functions for Operating on Multibyte Text
  529. @subsection Implementing Your Own Text Conversions
  530. [[note that conversions to and from Guile must produce streams
  531. containing only valid character encodings, or else Guile will crash]]
  532. This section describes the interface for adding your own encoding
  533. conversions for use with @code{scm_mb_iconv}. The interface here is
  534. borrowed from the GNOME Project's @file{libunicode} library.
  535. Guile's @code{scm_mb_iconv} function works by converting the input text
  536. to a stream of @code{scm_char_t} characters, and then converting
  537. those characters to the desired output encoding. This makes it easy
  538. for Guile to choose the appropriate conversion back ends for an
  539. arbitrary pair of input and output encodings, but it also means that the
  540. accuracy and quality of the conversions depends on the fidelity of
  541. Guile's internal character set to the source and destination encodings.
  542. Since @code{scm_mb_iconv} will be used almost exclusively for converting
  543. to and from Guile's internal character set, this shouldn't be a problem.
  544. To add support for a particular encoding to Guile, you must provide one
  545. function (called the @dfn{read} function) which converts from your
  546. encoding to an array of @code{scm_char_t}'s, and another function
  547. (called the @dfn{write} function) to convert from an array of
  548. @code{scm_char_t}'s back into your encoding. To convert from some
  549. encoding @var{a} to some other encoding @var{b}, Guile pairs up
  550. @var{a}'s read function with @var{b}'s write function. Each call to
  551. @code{scm_mb_iconv} passes text in encoding @var{a} through the read
  552. function, to produce an array of @code{scm_char_t}'s, and then passes
  553. that array to the write function, to produce text in encoding @var{b}.
  554. For stateful encodings, a read or write function can hang its own data
  555. structures off the conversion object, and provide its own functions to
  556. allocate and destroy them; this allows read and write functions to
  557. maintain whatever state they like.
  558. The Guile conversion back end represents each available encoding with a
  559. @code{struct scm_mb_encoding} object.
  560. @deftp {Libguile Type} {struct scm_mb_encoding}
  561. This data structure describes an encoding. It has the following
  562. members:
  563. @table @code
  564. @item char **names
  565. An array of strings, giving the various names for this encoding. The
  566. array should be terminated by a zero pointer. Case is not significant
  567. in encoding names.
  568. The @code{scm_mb_iconv_open} function searches the list of registered
  569. encodings for an encoding whose @code{names} array matches its
  570. @var{tocode} or @var{fromcode} argument.
  571. @item int (*init) (void **@var{cookie})
  572. An initialization function for the encoding's private data.
  573. @code{scm_mb_iconv_open} will call this function, passing it the address
  574. of the cookie for this encoding in this context. (We explain cookies
  575. below.) There is no way for the @code{init} function to tell whether
  576. the encoding will be used for reading or writing.
  577. Note that @code{init} receives a @emph{pointer} to the cookie, not the
  578. cookie itself. Because the type of @var{cookie} is @code{void **}, the
  579. C compiler will not check it as carefully as it would other types.
  580. The @code{init} member may be zero, indicating that no initialization is
  581. necessary for this encoding.
  582. @item int (*destroy) (void **@var{cookie})
  583. A deallocation function for the encoding's private data.
  584. @code{scm_mb_iconv_close} calls this function, passing it the address of
  585. the cookie for this encoding in this context. The @code{destroy}
  586. function should free any data the @code{init} function allocated.
  587. Note that @code{destroy} receives a @emph{pointer} to the cookie, not the
  588. cookie itself. Because the type of @var{cookie} is @code{void **}, the
  589. C compiler will not check it as carefully as it would other types.
  590. The @code{destroy} member may be zero, indicating that this encoding
  591. doesn't need to perform any special action to destroy its local data.
  592. @item int (*reset) (void *@var{cookie}, char **@var{outbuf}, size_t *@var{outbytesleft})
  593. Put the encoding into its initial shift state. Guile calls this
  594. function whether the encoding is being used for input or output, so this
  595. should take appropriate steps for both directions. If @var{outbuf} and
  596. @var{outbytesleft} are valid, the reset function should emit an escape
  597. sequence to reset the output stream to its initial state; @var{outbuf}
  598. and @var{outbytesleft} should be handled just as for
  599. @code{scm_mb_iconv}.
  600. This function can return an @code{scm_mb_iconv_} error code
  601. (@pxref{Exchanging Guile Text With the Outside World in C}). If it
  602. returns @code{scm_mb_iconv_too_big}, then the output buffer's shift
  603. state must be left unchanged.
  604. Note that @code{reset} receives the cookie's value itself, not a pointer
  605. to the cookie, as the @code{init} and @code{destroy} functions do.
  606. The @code{reset} member may be zero, indicating that this encoding
  607. doesn't use a shift state.
  608. @item enum scm_mb_read_result (*read) (void *@var{cookie}, const char **@var{inbuf}, size_t *@var{inbytesleft}, scm_char_t **@var{outbuf}, size_t *@var{outcharsleft})
  609. Read some bytes and convert into an array of Guile characters. This is
  610. the encoding's read function.
  611. On entry, there are *@var{inbytesleft} bytes of text at *@var{inbuf} to
  612. be converted, and *@var{outcharsleft} characters available at
  613. *@var{outbuf} to hold the results.
  614. On exit, *@var{inbytesleft} and *@var{inbuf} indicate the input bytes
  615. still not consumed. *@var{outcharsleft} and *@var{outbuf} indicate the
  616. output buffer space still not filled. (By exclusion, these indicate
  617. which input bytes were consumed, and which output characters were
  618. produced.)
  619. Return one of the @code{enum scm_mb_read_result} values, described below.
  620. Note that @code{read} receives the cookie's value itself, not a pointer
  621. to the cookie, as the @code{init} and @code{destroy} functions do.
  622. @item enum scm_mb_write_result (*write) (void *@var{cookie}, scm_char_t **@var{inbuf}, size_t *@var{incharsleft}, **@var{outbuf}, size_t *@var{outbytesleft})
  623. Convert an array of Guile characters to output bytes. This is
  624. the encoding's write function.
  625. On entry, there are *@var{incharsleft} Guile characters available at
  626. *@var{inbuf}, and *@var{outbytesleft} bytes available to store output at
  627. *@var{outbuf}.
  628. On exit, *@var{incharsleft} and *@var{inbuf} indicate the number of
  629. Guile characters left unconverted (because there was insufficient room
  630. in the output buffer to hold their converted forms), and
  631. *@var{outbytesleft} and *@var{outbuf} indicate the unused portion of the
  632. output buffer.
  633. Return one of the @code{scm_mb_write_result} values, described below.
  634. Note that @code{write} receives the cookie's value itself, not a pointer
  635. to the cookie, as the @code{init} and @code{destroy} functions do.
  636. @item struct scm_mb_encoding *next
  637. This is used by Guile to maintain a linked list of encodings. It is
  638. filled in when you call @code{scm_mb_register_encoding} to add your
  639. encoding to the list.
  640. @end table
  641. @end deftp
  642. Here is the enumerated type for the values an encoding's read function
  643. can return:
  644. @deftp {Libguile Type} {enum scm_mb_read_result}
  645. This type represents the result of a call to an encoding's read
  646. function. It has the following values:
  647. @table @code
  648. @item scm_mb_read_ok
  649. The read function consumed at least one byte of input.
  650. @item scm_mb_read_incomplete
  651. The data present in the input buffer does not contain a complete
  652. character encoding. No input was consumed, and no characters were
  653. produced as output. This is not necessarily an error status, if there
  654. is more data to pass through.
  655. @item scm_mb_read_error
  656. The input contains an invalid character encoding.
  657. @end table
  658. @end deftp
  659. Here is the enumerated type for the values an encoding's write function
  660. can return:
  661. @deftp {Libguile Type} {enum scm_mb_write_result}
  662. This type represents the result of a call to an encoding's write
  663. function. It has the following values:
  664. @table @code
  665. @item scm_mb_write_ok
  666. The write function was able to convert all the characters in @var{inbuf}
  667. successfully.
  668. @item scm_mb_write_too_big
  669. The write function filled the output buffer, but there are still
  670. characters in @var{inbuf} left unconsumed; @var{inbuf} and
  671. @var{incharsleft} indicate the unconsumed portion of the input buffer.
  672. @end table
  673. @end deftp
  674. Conversions to or from stateful encodings need to keep track of each
  675. encoding's current state. Each conversion context contains two
  676. @code{void *} variables called @dfn{cookies}, one for the input
  677. encoding, and one for the output encoding. These cookies are passed to
  678. the encodings' functions, for them to use however they please. A
  679. stateful encoding can use its cookie to hold a pointer to some object
  680. which maintains the context's current shift state. Stateless encodings
  681. will probably not use their cookies.
  682. The cookies' lifetime is the same as that of the context object. When
  683. the user calls @code{scm_mb_iconv_close} to destroy a context object,
  684. @code{scm_mb_iconv_close} calls the input and output encodings'
  685. @code{destroy} functions, passing them their respective cookies, so each
  686. encoding can free any data it allocated for that context.
  687. Note that, if a read or write function returns a successful result code
  688. like @code{scm_mb_read_ok} or @code{scm_mb_write_ok}, then the remaining
  689. input, together with the output, must together represent the complete
  690. input text; the encoding may not store any text temporarily in its
  691. cookie. This is because, if @code{scm_mb_iconv} returns a successful
  692. result to the user, it is correct for the user to assume that all the
  693. consumed input has been converted and placed in the output buffer.
  694. There is no ``flush'' operation to push any final results out of the
  695. encodings' buffers.
  696. Here is the function you call to register a new encoding with the
  697. conversion system:
  698. @deftypefn {Libguile Function} void scm_mb_register_encoding (struct scm_mb_encoding *@var{encoding})
  699. Add the encoding described by @code{*@var{encoding}} to the set
  700. understood by @code{scm_mb_iconv_open}. Once you have registered your
  701. encoding, you can use it by calling @code{scm_mb_iconv_open} with one of
  702. the names in @code{@var{encoding}->names}.
  703. @end deftypefn
  704. @node Multibyte Text Processing Errors, Why Guile Does Not Use a Fixed-Width Encoding, Functions for Operating on Multibyte Text, Working With Multibyte Strings in C
  705. @section Multibyte Text Processing Errors
  706. This section describes error conditions which code can signal to
  707. indicate problems encountered while processing multibyte text. In each
  708. case, the arguments @var{message} and @var{args} are an error format
  709. string and arguments to be substituted into the string, as accepted by
  710. the @code{display-error} function.
  711. @deffn Condition text:not-char-boundary func message args object offset
  712. By calling @var{func}, the program attempted to access a character at
  713. byte offset @var{offset} in the Guile object @var{object}, but
  714. @var{offset} is not the start of a character's encoding in @var{object}.
  715. Typically, @var{object} is a string or symbol. If the function signalling
  716. the error cannot find the Guile object that contains the text it is
  717. inspecting, it should use @code{#f} for @var{object}.
  718. @end deffn
  719. @deffn Condition text:bad-encoding func message args object
  720. By calling @var{func}, the program attempted to interpret the text in
  721. @var{object}, but @var{object} contains a byte sequence which is not a
  722. valid encoding for any character.
  723. @end deffn
  724. @deffn Condition text:not-guile-char func message args number
  725. By calling @var{func}, the program attempted to treat @var{number} as the
  726. number of a character in the Guile character set, but @var{number} does
  727. not correspond to any character in the Guile character set.
  728. @end deffn
  729. @deffn Condition text:unknown-conversion func message args from to
  730. By calling @var{func}, the program attempted to convert from an encoding
  731. named @var{from} to an encoding named @var{to}, but Guile does not
  732. support such a conversion.
  733. @end deffn
  734. @deftypevr {Libguile Variable} SCM scm_text_not_char_boundary
  735. @deftypevrx {Libguile Variable} SCM scm_text_bad_encoding
  736. @deftypevrx {Libguile Variable} SCM scm_text_not_guile_char
  737. These variables hold the scheme symbol objects whose names are the
  738. condition symbols above. You can use these when signalling these
  739. errors, instead of looking them up yourself.
  740. @end deftypevr
  741. @node Why Guile Does Not Use a Fixed-Width Encoding, , Multibyte Text Processing Errors, Working With Multibyte Strings in C
  742. @section Why Guile Does Not Use a Fixed-Width Encoding
  743. Multibyte encodings are clumsier to work with than encodings which use a
  744. fixed number of bytes for every character. For example, using a
  745. fixed-width encoding, we can extract the @var{i}th character of a string
  746. in constant time, and we can always substitute the @var{i}th character
  747. of a string with any other character without reallocating or copying the
  748. string.
  749. However, there are no fixed-width encodings which include the characters
  750. we wish to include, and also fit in a reasonable amount of space.
  751. Despite the Unicode standard's claims to the contrary, Unicode is not
  752. really a fixed-width encoding. Unicode uses surrogate pairs to
  753. represent characters outside the 16-bit range; a surrogate pair must be
  754. treated as a single character, but occupies two 16-bit spaces. As of
  755. this writing, there are already plans to assign characters to the
  756. surrogate character codes. Three- and four-byte encodings are
  757. too wasteful for a majority of Guile's users, who only need @sc{ASCII}
  758. and a few accented characters.
  759. Another alternative would be to have several different fixed-width
  760. string representations, each with a different element size. For each
  761. string, Guile would use the smallest element size capable of
  762. accomodating the string's text. This would allow users of English and
  763. the Western European languages to use the traditional memory-efficient
  764. encodings. However, if Guile has @var{n} string representations, then
  765. users must write @var{n} versions of any code which manipulates text
  766. directly --- one for each element size. And if a user wants to operate
  767. on two strings simultaneously, and wants to avoid testing the string
  768. sizes within the loop, she must make @var{n}*@var{n} copies of the loop.
  769. Most users will simply not bother. Instead, they will write code which
  770. supports only one string size, leaving us back where we started. By
  771. using a single internal representation, Guile makes it easier for users
  772. to write multilingual code.
  773. [[What about tagging each string with its encoding?
  774. "Every extension must be written to deal with every encoding"]]
  775. [[You don't really want to index strings anyway.]]
  776. Finally, Guile's multibyte encoding is not so bad. Unlike a two- or
  777. four-byte encoding, it is efficient in space for American and European
  778. users. Furthermore, the properties described above mean that many
  779. functions can be coded just as they would for a single-byte encoding;
  780. see @ref{Promised Properties of the Guile Multibyte Encoding}.
  781. @bye