123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988 |
- \input texinfo
- @setfilename mbapi.info
- @settitle Multibyte API
- @setchapternewpage off
- @c Open issues:
- @c What's the best way to report errors? Should functions return a
- @c magic value, according to C tradition, or should they signal a
- @c Guile exception?
- @c
- @node Working With Multibyte Strings in C
- @chapter Working With Multibyte Strings in C
- Guile allows strings to contain characters drawn from a wide variety of
- languages, including many Asian, Eastern European, and Middle Eastern
- languages, in a uniform and unrestricted way. The string representation
- normally used in C code --- an array of @sc{ASCII} characters --- is not
- sufficient for Guile strings, since they may contain characters not
- present in @sc{ASCII}.
- Instead, Guile uses a very large character set, and encodes each
- character as a sequence of one or more bytes. We call this
- variable-width encoding a @dfn{multibyte} encoding. Guile uses this
- single encoding internally for all strings, symbol names, error
- messages, etc., and performs appropriate conversions upon input and
- output.
- The use of this variable-width encoding is almost invisible to Scheme
- code. Strings are still indexed by character number, not by byte
- offset; @code{string-length} still returns the length of a string in
- characters, not in bytes. @code{string-ref} and @code{string-set!} are
- no longer guaranteed to be constant-time operations, but Guile uses
- various strategies to reduce the impact of this change.
- However, the encoding is visible via Guile's C interface, which gives
- the user direct access to a string's bytes. This chapter explains how
- to work with Guile multibyte text in C code. Since variable-width
- encodings are clumsier to work with than simple fixed-width encodings,
- Guile provides a set of standard macros and functions for manipulating
- multibyte text to make the job easier. Furthermore, Guile makes some
- promises about the encoding which you can use in writing your own text
- processing code.
- While we discuss guaranteed properties of Guile's encoding, and provide
- functions to operate on its character set, we do not actually specify
- either the character set or encoding here. This is because we expect
- both of them to change in the future: currently, Guile uses the same
- encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU Emacs
- as well) to use Unicode and UTF-8, with some extensions. This will make
- it more comfortable to use Guile with other systems which use UTF-8,
- like the GTk user interface toolkit.
- @menu
- * Multibyte String Terminology::
- * Promised Properties of the Guile Multibyte Encoding::
- * Functions for Operating on Multibyte Text::
- * Multibyte Text Processing Errors::
- * Why Guile Does Not Use a Fixed-Width Encoding::
- @end menu
- @node Multibyte String Terminology, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C, Working With Multibyte Strings in C
- @section Multibyte String Terminology
- In the descriptions which follow, we make the following definitions:
- @table @dfn
- @item byte
- A @dfn{byte} is a number between 0 and 255. It has no inherent textual
- interpretation. So 65 is a byte, not a character.
- @item character
- A @dfn{character} is a unit of text. It has no inherent numeric value.
- @samp{A} and @samp{.} are characters, not bytes. (This is different
- from the C language's definition of @dfn{character}; in this chapter, we
- will always use a phrase like ``the C language's @code{char} type'' when
- that's what we mean.)
- @item character set
- A @dfn{character set} is an invertible mapping between numbers and a
- given set of characters. @sc{ASCII} is a character set assigning
- characters to the numbers 0 through 127. It maps @samp{A} onto the
- number 65, and @samp{.} onto 46.
- Note that a character set maps characters onto numbers, @emph{not
- necessarily} onto bytes. For example, the Unicode character set maps
- the Greek lower-case @samp{alpha} character onto the number 945, which
- is not a byte.
- (This is what Internet standards would call a "coding character set".)
- @item encoding
- An encoding maps numbers onto sequences of bytes. For example, the
- UTF-8 encoding, defined in the Unicode Standard, would map the number
- 945 onto the sequence of bytes @samp{206 177}. When using the
- @sc{ASCII} character set, every number assigned also happens to be a
- byte, so there is an obvious trivial encoding for @sc{ASCII} in bytes.
- (This is what Internet standards would call a "character encoding
- scheme".)
- @end table
- Thus, to turn a character into a sequence of bytes, you need a character
- set to assign a number to that character, and then an encoding to turn
- that number into a sequence of bytes.
- Likewise, to interpret a sequence of bytes as a sequence of characters,
- you use an encoding to extract a sequence of numbers from the bytes, and
- then a character set to turn the numbers into characters.
- Errors can occur while carrying out either of these processes. For
- example, under a particular encoding, a given string of bytes might not
- correspond to any number. For example, the byte sequence @samp{128 128}
- is not a valid encoding of any number under UTF-8.
- Having carefully defined our terminology, we will now abuse it.
- We will sometimes use the word @dfn{character} to refer to the number
- assigned to a character by a character set, in contexts where it's
- obvious we mean a number.
- Sometimes there is a close association between a particular encoding and
- a particular character set. Thus, we may sometimes refer to the
- character set and encoding together as an @dfn{encoding}.
- @node Promised Properties of the Guile Multibyte Encoding, Functions for Operating on Multibyte Text, Multibyte String Terminology, Working With Multibyte Strings in C
- @section Promised Properties of the Guile Multibyte Encoding
- Internally, Guile uses a single encoding for all text --- symbols,
- strings, error messages, etc. Here we list a number of helpful
- properties of Guile's encoding. It is correct to write code which
- assumes these properties; code which uses these assumptions will be
- portable to all future versions of Guile, as far as we know.
- @b{Every @sc{ASCII} character is encoded as a single byte from 0 to 127, in
- the obvious way.} This means that a standard C string containing only
- @sc{ASCII} characters is a valid Guile string (except for the terminator;
- Guile strings store the length explicitly, so they can contain null
- characters).
- @b{The encodings of non-@sc{ASCII} characters use only bytes between 128
- and 255.} That is, when we turn a non-@sc{ASCII} character into a
- series of bytes, none of those bytes can ever be mistaken for the
- encoding of an @sc{ASCII} character. This means that you can search a
- Guile string for an @sc{ASCII} character using the standard
- @code{memchr} library function. By extension, you can search for an
- @sc{ASCII} substring in a Guile string using a traditional substring
- search algorithm --- you needn't add special checks to verify encoding
- boundaries, etc.
- @b{No character encoding is a subsequence of any other character
- encoding.} (This is just a stronger version of the previous promise.)
- This means that you can search for occurrences of one Guile string
- within another Guile string just as if they were raw byte strings. You
- can use the stock @code{memmem} function (provided on GNU systems, at
- least) for such searches. If you don't need the ability to represent
- null characters in your text, you can still use null-termination for
- strings, and use the traditional string-handling functions like
- @code{strlen}, @code{strstr}, and @code{strcat}.
- @b{You can always determine the full length of a character's encoding
- from its first byte.} Guile provides the macro @code{scm_mb_len} which
- computes the encoding's length from its first byte. Given the first
- rule, you can see that @code{scm_mb_len (@var{b})}, for any @code{0 <=
- @var{b} <= 127}, returns 1.
- @b{Given an arbitrary byte position in a Guile string, you can always
- find the beginning and end of the character containing that byte without
- scanning too far in either direction.} This means that, if you are sure
- a byte sequence is a valid encoding of a character sequence, you can
- find character boundaries without keeping track of the beginning and
- ending of the overall string. This promise relies on the fact that, in
- addition to storing the string's length explicitly, Guile always either
- terminates the string's storage with a zero byte, or shares it with
- another string which is terminated this way.
- @node Functions for Operating on Multibyte Text, Multibyte Text Processing Errors, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C
- @section Functions for Operating on Multibyte Text
- Guile provides a variety of functions, variables, and types for working
- with multibyte text.
- @menu
- * Basic Multibyte Character Processing::
- * Finding Character Encoding Boundaries::
- * Multibyte String Functions::
- * Exchanging Guile Text With the Outside World in C::
- * Implementing Your Own Text Conversions::
- @end menu
- @node Basic Multibyte Character Processing, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text, Functions for Operating on Multibyte Text
- @subsection Basic Multibyte Character Processing
- Here are the essential types and functions for working with Guile text.
- Guile uses the C type @code{unsigned char *} to refer to text encoded
- with Guile's encoding.
- Note that any operation marked here as a ``Libguile Macro'' might
- evaluate its argument multiple times.
- @deftp {Libguile Type} scm_char_t
- This is a signed integral type large enough to hold any character in
- Guile's character set. All character numbers are positive.
- @end deftp
- @deftypefn {Libguile Macro} scm_char_t scm_mb_get (const unsigned char *@var{p})
- Return the character whose encoding starts at @var{p}. If @var{p} does
- not point at a valid character encoding, the behavior is undefined.
- @end deftypefn
- @deftypefn {Libguile Macro} int scm_mb_put (unsigned char *@var{p}, scm_char_t @var{c})
- Place the encoded form of the Guile character @var{c} at @var{p}, and
- return its length in bytes. If @var{c} is not a Guile character, the
- behavior is undefined.
- @end deftypefn
- @deftypevr {Libguile Constant} int scm_mb_max_len
- The maximum length of any character's encoding, in bytes. You may
- assume this is relatively small --- less than a dozen or so.
- @end deftypevr
- @deftypefn {Libguile Macro} int scm_mb_len (unsigned char @var{b})
- If @var{b} is the first byte of a character's encoding, return the full
- length of the character's encoding, in bytes. If @var{b} is not a valid
- leading byte, the behavior is undefined.
- @end deftypefn
- @deftypefn {Libguile Macro} int scm_mb_char_len (scm_char_t @var{c})
- Return the length of the encoding of the character @var{c}, in bytes.
- If @var{c} is not a valid Guile character, the behavior is undefined.
- @end deftypefn
- @deftypefn {Libguile Function} scm_char_t scm_mb_get_func (const unsigned char *@var{p})
- @deftypefnx {Libguile Function} int scm_mb_put_func (unsigned char *@var{p}, scm_char_t @var{c})
- @deftypefnx {Libguile Function} int scm_mb_len_func (unsigned char @var{b})
- @deftypefnx {Libguile Function} int scm_mb_char_len_func (scm_char_t @var{c})
- These are functions identical to the corresponding macros. You can use
- them in situations where the overhead of a function call is acceptable,
- and the cleaner semantics of function application are desireable.
- @end deftypefn
- @node Finding Character Encoding Boundaries, Multibyte String Functions, Basic Multibyte Character Processing, Functions for Operating on Multibyte Text
- @subsection Finding Character Encoding Boundaries
- These are functions for finding the boundaries between characters in
- multibyte text.
- Note that any operation marked here as a ``Libguile Macro'' might
- evaluate its argument multiple times, unless the definition promises
- otherwise.
- @deftypefn {Libguile Macro} int scm_mb_boundary_p (const unsigned char *@var{p})
- Return non-zero iff @var{p} points to the start of a character in
- multibyte text.
- This macro will evaluate its argument only once.
- @end deftypefn
- @deftypefn {Libguile Function} {const unsigned char *} scm_mb_floor (const unsigned char *@var{p})
- ``Round'' @var{p} to the previous character boundary. That is, if
- @var{p} points to the middle of the encoding of a Guile character,
- return a pointer to the first byte of the encoding. If @var{p} points
- to the start of the encoding of a Guile character, return @var{p}
- unchanged.
- @end deftypefn
- @deftypefn {libguile Function} {const unsigned char *} scm_mb_ceiling (const unsigned char *@var{p})
- ``Round'' @var{p} to the next character boundary. That is, if @var{p}
- points to the middle of the encoding of a Guile character, return a
- pointer to the first byte of the encoding of the next character. If
- @var{p} points to the start of the encoding of a Guile character, return
- @var{p} unchanged.
- @end deftypefn
- Note that it is usually not friendly for functions to silently correct
- byte offsets that point into the middle of a character's encoding. Such
- offsets almost always indicate a programming error, and they should be
- reported as early as possible. So, when you write code which operates
- on multibyte text, you should not use functions like these to ``clean
- up'' byte offsets which the originator believes to be correct; instead,
- your code should signal a @code{text:not-char-boundary} error as soon as
- it detects an invalid offset. @xref{Multibyte Text Processing Errors}.
- @node Multibyte String Functions, Exchanging Guile Text With the Outside World in C, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text
- @subsection Multibyte String Functions
- These functions allow you to operate on multibyte strings: sequences of
- character encodings.
- @deftypefn {Libguile Function} int scm_mb_count (const unsigned char *@var{p}, int @var{len})
- Return the number of Guile characters encoded by the @var{len} bytes at
- @var{p}.
- If the sequence contains any invalid character encodings, or ends with
- an incomplete character encoding, signal a @code{text:bad-encoding}
- error.
- @end deftypefn
- @deftypefn {Libguile Macro} scm_char_t scm_mb_walk (unsigned char **@var{pp})
- Return the character whose encoding starts at @code{*@var{pp}}, and
- advance @code{*@var{pp}} to the start of the next character. Return -1
- if @code{*@var{pp}} does not point to a valid character encoding.
- @end deftypefn
- @deftypefn {Libguile Function} {const unsigned char *} scm_mb_prev (const unsigned char *@var{p})
- If @var{p} points to the middle of the encoding of a Guile character,
- return a pointer to the first byte of the encoding. If @var{p} points
- to the start of the encoding of a Guile character, return the start of
- the previous character's encoding.
- This is like @code{scm_mb_floor}, but the returned pointer will always
- be before @var{p}. If you use this function to drive an iteration, it
- guarantees backward progress.
- @end deftypefn
- @deftypefn {Libguile Function} {const unsigned char *} scm_mb_next (const unsigned char *@var{p})
- If @var{p} points to the encoding of a Guile character, return a pointer
- to the first byte of the encoding of the next character.
- This is like @code{scm_mb_ceiling}, but the returned pointer will always
- be after @var{p}. If you use this function to drive an iteration, it
- guarantees forward progress.
- @end deftypefn
- @deftypefn {Libguile Function} {const unsigned char *} scm_mb_index (const unsigned char *@var{p}, int @var{len}, int @var{i})
- Assuming that the @var{len} bytes starting at @var{p} are a
- concatenation of valid character encodings, return a pointer to the
- start of the @var{i}'th character encoding in the sequence.
- This function scans the sequence from the beginning to find the
- @var{i}'th character, and will generally require time proportional to
- the distance from @var{p} to the returned address.
- If the sequence contains any invalid character encodings, or ends with
- an incomplete character encoding, signal a @code{text:bad-encoding}
- error.
- @end deftypefn
- It is common to process the characters in a string from left to right.
- However, if you fetch each character using @code{scm_mb_index}, each
- call will scan the text from the beginning, so your loop will require
- time proportional to at least the square of the length of the text. To
- avoid this poor performance, you can use an @code{scm_mb_cache}
- structure and the @code{scm_mb_index_cached} macro.
- @deftp {Libguile Type} {struct scm_mb_cache}
- This structure holds information that allows a string scanning operation
- to use the results from a previous scan of the string. It has the
- following members:
- @table @code
- @item character
- An index, in characters, into the string.
- @item byte
- The index, in bytes, of the start of that character.
- @end table
- In other words, @code{byte} is the byte offset of the
- @code{character}'th character of the string. Note that if @code{byte}
- and @code{character} are equal, then all characters before that point
- must have encodings exactly one byte long, and the string can be indexed
- normally.
- All elements of a @code{struct scm_mb_cache} structure should be
- initialized to zero before its first use, and whenever the string's text
- changes.
- @end deftp
- @deftypefn {Libguile Macro} const unsigned char *scm_mb_index_cached (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
- @deftypefnx {Libguile Function} const unsigned char *scm_mb_index_cached_func (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
- This macro and this function are identical to @code{scm_mb_index},
- except that they may consult and update *@var{cache} in order to avoid
- scanning the string from the beginning. @code{scm_mb_index_cached} is a
- macro, so it may have less overhead than
- @code{scm_mb_index_cached_func}, but it may evaluate its arguments more
- than once.
- Using @code{scm_mb_index_cached} or @code{scm_mb_index_cached_func}, you
- can scan a string from left to right, or from right to left, in time
- proportional to the length of the string. As long as each character
- fetched is less than some constant distance before or after the previous
- character fetched with @var{cache}, each access will require constant
- time.
- @end deftypefn
- Guile also provides functions to convert between an encoded sequence of
- characters, and an array of @code{scm_char_t} objects.
- @deftypefn {Libguile Function} scm_char_t *scm_mb_multibyte_to_fixed (const unsigned char *@var{p}, int @var{len}, int *@var{result_len})
- Convert the variable-width text in the @var{len} bytes at @var{p}
- to an array of @code{scm_char_t} values. Return a pointer to the array,
- and set @code{*@var{result_len}} to the number of elements it contains.
- The returned array is allocated with @code{malloc}, and it is the
- caller's responsibility to free it.
- If the text is not a sequence of valid character encodings, this
- function will signal a @code{text:bad-encoding} error.
- @end deftypefn
- @deftypefn {Libguile Function} unsigned char *scm_mb_fixed_to_multibyte (const scm_char_t *@var{fixed}, int @var{len}, int *@var{result_len})
- Convert the array of @code{scm_char_t} values to a sequence of
- variable-width character encodings. Return a pointer to the array of
- bytes, and set @code{*@var{result_len}} to its length, in bytes.
- The returned byte sequence is terminated with a zero byte, which is not
- counted in the length returned in @code{*@var{result_len}}.
- The returned byte sequence is allocated with @code{malloc}; it is the
- caller's responsibility to free it.
- If the text is not a sequence of valid character encodings, this
- function will signal a @code{text:bad-encoding} error.
- @end deftypefn
- @node Exchanging Guile Text With the Outside World in C, Implementing Your Own Text Conversions, Multibyte String Functions, Functions for Operating on Multibyte Text
- @subsection Exchanging Guile Text With the Outside World in C
- [[This is kind of a heavy-weight model, given that one end of the
- conversion is always going to be the Guile encoding. Any way to shorten
- things a bit?]]
- Guile provides functions for converting between Guile's internal text
- representation and encodings popular in the outside world. These
- functions are closely modeled after the @code{iconv} functions available
- on some systems.
- To convert text between two encodings, you should first call
- @code{scm_mb_iconv_open} to indicate the source and destination
- encodings; this function returns a context object which records the
- conversion to perform.
- Then, you should call @code{scm_mb_iconv} to actually convert the text.
- This function expects input and output buffers, and a pointer to the
- context you got from @var{scm_mb_iconv_open}. You don't need to pass
- all your input to @code{scm_mb_iconv} at once; you can invoke it on
- successive blocks of input (as you read it from a file, say), and it
- will convert as much as it can each time, indicating when you should
- grow your output buffer.
- An encoding may be @dfn{stateless}, or @dfn{stateful}. In most
- encodings, a contiguous group of bytes from the sequence completely
- specifies a particular character; these are stateless encodings.
- However, some encodings require you to look back an unbounded number of
- bytes in the stream to assign a meaning to a particular byte sequence;
- such encodings are stateful.
- For example, in the @samp{ISO-2022-JP} encoding for Japanese text, the
- byte sequence @samp{27 36 66} indicates that subsequent bytes should be
- taken in pairs and interpreted as characters from the JIS-0208 character
- set. An arbitrary number of byte pairs may follow this sequence. The
- byte sequence @samp{27 40 66} indicates that subsequent bytes should be
- interpreted as @sc{ASCII}. In this encoding, you cannot tell whether a
- given byte is an @sc{ASCII} character without looking back an arbitrary
- distance for the most recent escape sequence, so it is a stateful
- encoding.
- In Guile, if a conversion involves a stateful encoding, the context
- object carries any necessary state. Thus, you can have many independent
- conversions to or from stateful encodings taking place simultaneously,
- as long as each data stream uses its own context object for the
- conversion.
- @deftp {Libguile Type} {struct scm_mb_iconv}
- This is the type for context objects, which represent the encodings and
- current state of an ongoing text conversion. A @code{struct
- scm_mb_iconv} records the source and destination encodings, and keeps
- track of any information needed to handle stateful encodings.
- @end deftp
- @deftypefn {Libguile Function} {struct scm_mb_iconv *} scm_mb_iconv_open (const char *@var{tocode}, const char *@var{fromcode})
- Return a pointer to a new @code{struct scm_mb_iconv} context object,
- ready to convert from the encoding named @var{fromcode} to the encoding
- named @var{tocode}. For stateful encodings, the context object is in
- some appropriate initial state, ready for use with the
- @code{scm_mb_iconv} function.
- When you are done using a context object, you may call
- @code{scm_mb_iconv_close} to free it.
- If either @var{tocode} or @var{fromcode} is not the name of a known
- encoding, this function will signal the @code{text:unknown-conversion}
- error, described below.
- @c Try to use names here from the IANA list:
- @c see ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
- Guile supports at least these encodings:
- @table @samp
- @item US-ASCII
- @sc{US-ASCII}, in the standard one-character-per-byte encoding.
- @item ISO-8859-1
- The usual character set for Western European languages, in its usual
- one-character-per-byte encoding.
- @item Guile-MB
- Guile's current internal multibyte encoding. The actual encoding this
- name refers to will change from one version of Guile to the next. You
- should use this when converting data between external sources and the
- encoding used by Guile objects.
- You should @emph{not} use this as the encoding for data presented to the
- outside world, for two reasons. 1) Its meaning will change over time,
- so data written using the @samp{guile} encoding with one version of
- Guile might not be readable with the @samp{guile} encoding in another
- version of Guile. 2) It currently corresponds to @samp{Emacs-Mule},
- which invented for Emacs's internal use, and was never intended to serve
- as an exchange medium.
- @item Guile-Wide
- Guile's character set, as an array of @code{scm_char_t} values.
- Note that this encoding is even less suitable for public use than
- @samp{Guile}, since the exact sequence of bytes depends heavily on the
- size and endianness the host system uses for @code{scm_char_t}. Using
- this encoding is very much like calling the
- @code{scm_mb_multibyte_to_fixed} or @code{scm_mb_fixed_to_multibyte}
- functions, except that @code{scm_mb_iconv} gives you more control over
- buffer allocation and management.
- @item Emacs-Mule
- This is the variable-length encoding for multi-lingual text by GNU
- Emacs, at least through version 20.4. You probably should not use this
- encoding, as it is designed only for Emacs's internal use. However, we
- provide it here because it's trivial to support, and some people
- probably do have @samp{emacs-mule}-format files lying around.
- @end table
- (At the moment, this list doesn't include any character sets suitable for
- external use that can actually handle multilingual data; this is
- unfortunate, as it encourages users to write data in Emacs-Mule format,
- which nobody but Emacs and Guile understands. We hope to add support
- for Unicode in UTF-8 soon, which should solve this problem.)
- Case is not significant in encoding names.
- You can define your own conversions; see @ref{Implementing Your Own Text
- Conversions}.
- @end deftypefn
- @deftypefn {Libguile Function} int scm_mb_have_encoding (const char *@var{encoding})
- Return a non-zero value if Guile supports the encoding named @var{encoding}[[]]
- @end deftypefn
- @deftypefn {Libguile Function} size_t scm_mb_iconv (struct scm_mb_iconv *@var{context}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
- Convert a sequence of characters from one encoding to another. The
- argument @var{context} specifies the encodings to use for the input and
- output, and carries state for stateful encodings; use
- @code{scm_mb_iconv_open} to create a @var{context} object for a
- particular conversion.
- Upon entry to the function, @code{*@var{inbuf}} should point to the
- input buffer, and @code{*@var{inbytesleft}} should hold the number of
- input bytes present in the buffer; @code{*@var{outbuf}} should point to
- the output buffer, and @code{*@var{outbytesleft}} should hold the number
- of bytes available to hold the conversion results in that buffer.
- Upon exit from the function, @code{*@var{inbuf}} points to the first
- unconsumed byte of input, and @code{*@var{inbytesleft}} holds the number
- of unconsumed input bytes; @code{*@var{outbuf}} points to the byte after
- the last output byte, and @code{*@var{outbyteleft}} holds the number of
- bytes left unused in the output buffer.
- For stateful encodings, @var{context} carries encoding state from one
- call to @code{scm_mb_iconv} to the next. Thus, successive calls to
- @var{scm_mb_iconv} which use the same context object can convert a
- stream of data one chunk at a time.
- If @var{inbuf} is zero or @code{*@var{inbuf}} is zero, then the call is
- taken as a request to reset the states of the input and the output
- encodings. If @var{outbuf} is non-zero and @code{*@var{outbuf}} is
- non-zero, then @code{scm_mb_iconv} stores a byte sequence in the output
- buffer to put the output encoding in its initial state. If the output
- buffer is not large enough to hold this byte sequence,
- @code{scm_mb_iconv} returns @code{scm_mb_iconv_too_big}, and leaves
- the shift states of @var{context}'s input and output encodings
- unchanged.
- The @code{scm_mb_iconv} function always consumes only complete
- characters or shift sequences from the input buffer, and the output
- buffer always contains a sequence of complete characters or escape
- sequences.
- If the input sequence contains characters which are not expressible in
- the output encoding, @code{scm_mb_iconv} converts it in an
- implementation-defined way. It may simply delete the character.
- Some encodings use byte sequences which do not correspond to any textual
- character. For example, the escape sequence of a stateful encoding has
- no textual meaning. When converting from such an encoding, a call to
- @code{scm_mb_iconv} might consume input but produce no output, since the
- input sequence might contain only escape sequences.
- Normally, @code{scm_mb_iconv} returns the number of input characters it
- could not convert perfectly to the ouput encoding. However, it may
- return one of the @code{scm_mb_iconv_} codes described below, to
- indicate an error. All of these codes are negative values.
- If the input sequence contains an invalid character encoding, conversion
- stops before the invalid input character, and @code{scm_mb_iconv}
- returns the constant value @code{scm_mb_iconv_bad_encoding}.
- If the input sequence ends with an incomplete character encoding,
- @code{scm_mb_iconv} will leave it in the input buffer, unconsumed, and
- return the constant value @code{scm_mb_iconv_incomplete_encoding}. This
- is not necessarily an error, if you expect to call @code{scm_mb_iconv}
- again with more data which might contain the rest of the encoding
- fragment.
- If the output buffer does not contain enough room to hold the converted
- form of the complete input text, @code{scm_mb_iconv} converts as much as
- it can, changes the input and output pointers to reflect the amount of
- text successfully converted, and then returns
- @code{scm_mb_iconv_too_big}.
- @end deftypefn
- Here are the status codes that might be returned by @code{scm_mb_iconv}.
- They are all negative integers.
- @table @code
- @item scm_mb_iconv_too_big
- The conversion needs more room in the output buffer. Some characters
- may have been consumed from the input buffer, and some characters may
- have been placed in the available space in the output buffer.
- @item scm_mb_iconv_bad_encoding
- @code{scm_mb_iconv} encountered an invalid character encoding in the
- input buffer. Conversion stopped before the invalid character, so there
- may be some characters consumed from the input buffer, and some
- converted text in the output buffer.
- @item scm_mb_iconv_incomplete_encoding
- The input buffer ends with an incomplete character encoding. The
- incomplete encoding is left in the input buffer, unconsumed. This is
- not necessarily an error, if you expect to call @code{scm_mb_iconv}
- again with more data which might contain the rest of the incomplete
- encoding.
- @end table
- Finally, Guile provides a function for destroying conversion contexts.
- @deftypefn {Libguile Function} void scm_mb_iconv_close (struct scm_mb_iconv *@var{context})
- Deallocate the conversion context object @var{context}, and all other
- resources allocated by the call to @code{scm_mb_iconv_open} which
- returned @var{context}.
- @end deftypefn
- @node Implementing Your Own Text Conversions, , Exchanging Guile Text With the Outside World in C, Functions for Operating on Multibyte Text
- @subsection Implementing Your Own Text Conversions
- [[note that conversions to and from Guile must produce streams
- containing only valid character encodings, or else Guile will crash]]
- This section describes the interface for adding your own encoding
- conversions for use with @code{scm_mb_iconv}. The interface here is
- borrowed from the GNOME Project's @file{libunicode} library.
- Guile's @code{scm_mb_iconv} function works by converting the input text
- to a stream of @code{scm_char_t} characters, and then converting
- those characters to the desired output encoding. This makes it easy
- for Guile to choose the appropriate conversion back ends for an
- arbitrary pair of input and output encodings, but it also means that the
- accuracy and quality of the conversions depends on the fidelity of
- Guile's internal character set to the source and destination encodings.
- Since @code{scm_mb_iconv} will be used almost exclusively for converting
- to and from Guile's internal character set, this shouldn't be a problem.
- To add support for a particular encoding to Guile, you must provide one
- function (called the @dfn{read} function) which converts from your
- encoding to an array of @code{scm_char_t}'s, and another function
- (called the @dfn{write} function) to convert from an array of
- @code{scm_char_t}'s back into your encoding. To convert from some
- encoding @var{a} to some other encoding @var{b}, Guile pairs up
- @var{a}'s read function with @var{b}'s write function. Each call to
- @code{scm_mb_iconv} passes text in encoding @var{a} through the read
- function, to produce an array of @code{scm_char_t}'s, and then passes
- that array to the write function, to produce text in encoding @var{b}.
- For stateful encodings, a read or write function can hang its own data
- structures off the conversion object, and provide its own functions to
- allocate and destroy them; this allows read and write functions to
- maintain whatever state they like.
- The Guile conversion back end represents each available encoding with a
- @code{struct scm_mb_encoding} object.
- @deftp {Libguile Type} {struct scm_mb_encoding}
- This data structure describes an encoding. It has the following
- members:
- @table @code
- @item char **names
- An array of strings, giving the various names for this encoding. The
- array should be terminated by a zero pointer. Case is not significant
- in encoding names.
- The @code{scm_mb_iconv_open} function searches the list of registered
- encodings for an encoding whose @code{names} array matches its
- @var{tocode} or @var{fromcode} argument.
- @item int (*init) (void **@var{cookie})
- An initialization function for the encoding's private data.
- @code{scm_mb_iconv_open} will call this function, passing it the address
- of the cookie for this encoding in this context. (We explain cookies
- below.) There is no way for the @code{init} function to tell whether
- the encoding will be used for reading or writing.
- Note that @code{init} receives a @emph{pointer} to the cookie, not the
- cookie itself. Because the type of @var{cookie} is @code{void **}, the
- C compiler will not check it as carefully as it would other types.
- The @code{init} member may be zero, indicating that no initialization is
- necessary for this encoding.
- @item int (*destroy) (void **@var{cookie})
- A deallocation function for the encoding's private data.
- @code{scm_mb_iconv_close} calls this function, passing it the address of
- the cookie for this encoding in this context. The @code{destroy}
- function should free any data the @code{init} function allocated.
- Note that @code{destroy} receives a @emph{pointer} to the cookie, not the
- cookie itself. Because the type of @var{cookie} is @code{void **}, the
- C compiler will not check it as carefully as it would other types.
- The @code{destroy} member may be zero, indicating that this encoding
- doesn't need to perform any special action to destroy its local data.
- @item int (*reset) (void *@var{cookie}, char **@var{outbuf}, size_t *@var{outbytesleft})
- Put the encoding into its initial shift state. Guile calls this
- function whether the encoding is being used for input or output, so this
- should take appropriate steps for both directions. If @var{outbuf} and
- @var{outbytesleft} are valid, the reset function should emit an escape
- sequence to reset the output stream to its initial state; @var{outbuf}
- and @var{outbytesleft} should be handled just as for
- @code{scm_mb_iconv}.
- This function can return an @code{scm_mb_iconv_} error code
- (@pxref{Exchanging Guile Text With the Outside World in C}). If it
- returns @code{scm_mb_iconv_too_big}, then the output buffer's shift
- state must be left unchanged.
- Note that @code{reset} receives the cookie's value itself, not a pointer
- to the cookie, as the @code{init} and @code{destroy} functions do.
- The @code{reset} member may be zero, indicating that this encoding
- doesn't use a shift state.
- @item enum scm_mb_read_result (*read) (void *@var{cookie}, const char **@var{inbuf}, size_t *@var{inbytesleft}, scm_char_t **@var{outbuf}, size_t *@var{outcharsleft})
- Read some bytes and convert into an array of Guile characters. This is
- the encoding's read function.
- On entry, there are *@var{inbytesleft} bytes of text at *@var{inbuf} to
- be converted, and *@var{outcharsleft} characters available at
- *@var{outbuf} to hold the results.
- On exit, *@var{inbytesleft} and *@var{inbuf} indicate the input bytes
- still not consumed. *@var{outcharsleft} and *@var{outbuf} indicate the
- output buffer space still not filled. (By exclusion, these indicate
- which input bytes were consumed, and which output characters were
- produced.)
- Return one of the @code{enum scm_mb_read_result} values, described below.
- Note that @code{read} receives the cookie's value itself, not a pointer
- to the cookie, as the @code{init} and @code{destroy} functions do.
- @item enum scm_mb_write_result (*write) (void *@var{cookie}, scm_char_t **@var{inbuf}, size_t *@var{incharsleft}, **@var{outbuf}, size_t *@var{outbytesleft})
- Convert an array of Guile characters to output bytes. This is
- the encoding's write function.
- On entry, there are *@var{incharsleft} Guile characters available at
- *@var{inbuf}, and *@var{outbytesleft} bytes available to store output at
- *@var{outbuf}.
- On exit, *@var{incharsleft} and *@var{inbuf} indicate the number of
- Guile characters left unconverted (because there was insufficient room
- in the output buffer to hold their converted forms), and
- *@var{outbytesleft} and *@var{outbuf} indicate the unused portion of the
- output buffer.
- Return one of the @code{scm_mb_write_result} values, described below.
- Note that @code{write} receives the cookie's value itself, not a pointer
- to the cookie, as the @code{init} and @code{destroy} functions do.
- @item struct scm_mb_encoding *next
- This is used by Guile to maintain a linked list of encodings. It is
- filled in when you call @code{scm_mb_register_encoding} to add your
- encoding to the list.
- @end table
- @end deftp
- Here is the enumerated type for the values an encoding's read function
- can return:
- @deftp {Libguile Type} {enum scm_mb_read_result}
- This type represents the result of a call to an encoding's read
- function. It has the following values:
- @table @code
- @item scm_mb_read_ok
- The read function consumed at least one byte of input.
- @item scm_mb_read_incomplete
- The data present in the input buffer does not contain a complete
- character encoding. No input was consumed, and no characters were
- produced as output. This is not necessarily an error status, if there
- is more data to pass through.
- @item scm_mb_read_error
- The input contains an invalid character encoding.
- @end table
- @end deftp
- Here is the enumerated type for the values an encoding's write function
- can return:
- @deftp {Libguile Type} {enum scm_mb_write_result}
- This type represents the result of a call to an encoding's write
- function. It has the following values:
- @table @code
- @item scm_mb_write_ok
- The write function was able to convert all the characters in @var{inbuf}
- successfully.
- @item scm_mb_write_too_big
- The write function filled the output buffer, but there are still
- characters in @var{inbuf} left unconsumed; @var{inbuf} and
- @var{incharsleft} indicate the unconsumed portion of the input buffer.
- @end table
- @end deftp
- Conversions to or from stateful encodings need to keep track of each
- encoding's current state. Each conversion context contains two
- @code{void *} variables called @dfn{cookies}, one for the input
- encoding, and one for the output encoding. These cookies are passed to
- the encodings' functions, for them to use however they please. A
- stateful encoding can use its cookie to hold a pointer to some object
- which maintains the context's current shift state. Stateless encodings
- will probably not use their cookies.
- The cookies' lifetime is the same as that of the context object. When
- the user calls @code{scm_mb_iconv_close} to destroy a context object,
- @code{scm_mb_iconv_close} calls the input and output encodings'
- @code{destroy} functions, passing them their respective cookies, so each
- encoding can free any data it allocated for that context.
- Note that, if a read or write function returns a successful result code
- like @code{scm_mb_read_ok} or @code{scm_mb_write_ok}, then the remaining
- input, together with the output, must together represent the complete
- input text; the encoding may not store any text temporarily in its
- cookie. This is because, if @code{scm_mb_iconv} returns a successful
- result to the user, it is correct for the user to assume that all the
- consumed input has been converted and placed in the output buffer.
- There is no ``flush'' operation to push any final results out of the
- encodings' buffers.
- Here is the function you call to register a new encoding with the
- conversion system:
- @deftypefn {Libguile Function} void scm_mb_register_encoding (struct scm_mb_encoding *@var{encoding})
- Add the encoding described by @code{*@var{encoding}} to the set
- understood by @code{scm_mb_iconv_open}. Once you have registered your
- encoding, you can use it by calling @code{scm_mb_iconv_open} with one of
- the names in @code{@var{encoding}->names}.
- @end deftypefn
- @node Multibyte Text Processing Errors, Why Guile Does Not Use a Fixed-Width Encoding, Functions for Operating on Multibyte Text, Working With Multibyte Strings in C
- @section Multibyte Text Processing Errors
- This section describes error conditions which code can signal to
- indicate problems encountered while processing multibyte text. In each
- case, the arguments @var{message} and @var{args} are an error format
- string and arguments to be substituted into the string, as accepted by
- the @code{display-error} function.
- @deffn Condition text:not-char-boundary func message args object offset
- By calling @var{func}, the program attempted to access a character at
- byte offset @var{offset} in the Guile object @var{object}, but
- @var{offset} is not the start of a character's encoding in @var{object}.
- Typically, @var{object} is a string or symbol. If the function signalling
- the error cannot find the Guile object that contains the text it is
- inspecting, it should use @code{#f} for @var{object}.
- @end deffn
- @deffn Condition text:bad-encoding func message args object
- By calling @var{func}, the program attempted to interpret the text in
- @var{object}, but @var{object} contains a byte sequence which is not a
- valid encoding for any character.
- @end deffn
- @deffn Condition text:not-guile-char func message args number
- By calling @var{func}, the program attempted to treat @var{number} as the
- number of a character in the Guile character set, but @var{number} does
- not correspond to any character in the Guile character set.
- @end deffn
- @deffn Condition text:unknown-conversion func message args from to
- By calling @var{func}, the program attempted to convert from an encoding
- named @var{from} to an encoding named @var{to}, but Guile does not
- support such a conversion.
- @end deffn
- @deftypevr {Libguile Variable} SCM scm_text_not_char_boundary
- @deftypevrx {Libguile Variable} SCM scm_text_bad_encoding
- @deftypevrx {Libguile Variable} SCM scm_text_not_guile_char
- These variables hold the scheme symbol objects whose names are the
- condition symbols above. You can use these when signalling these
- errors, instead of looking them up yourself.
- @end deftypevr
- @node Why Guile Does Not Use a Fixed-Width Encoding, , Multibyte Text Processing Errors, Working With Multibyte Strings in C
- @section Why Guile Does Not Use a Fixed-Width Encoding
- Multibyte encodings are clumsier to work with than encodings which use a
- fixed number of bytes for every character. For example, using a
- fixed-width encoding, we can extract the @var{i}th character of a string
- in constant time, and we can always substitute the @var{i}th character
- of a string with any other character without reallocating or copying the
- string.
- However, there are no fixed-width encodings which include the characters
- we wish to include, and also fit in a reasonable amount of space.
- Despite the Unicode standard's claims to the contrary, Unicode is not
- really a fixed-width encoding. Unicode uses surrogate pairs to
- represent characters outside the 16-bit range; a surrogate pair must be
- treated as a single character, but occupies two 16-bit spaces. As of
- this writing, there are already plans to assign characters to the
- surrogate character codes. Three- and four-byte encodings are
- too wasteful for a majority of Guile's users, who only need @sc{ASCII}
- and a few accented characters.
- Another alternative would be to have several different fixed-width
- string representations, each with a different element size. For each
- string, Guile would use the smallest element size capable of
- accomodating the string's text. This would allow users of English and
- the Western European languages to use the traditional memory-efficient
- encodings. However, if Guile has @var{n} string representations, then
- users must write @var{n} versions of any code which manipulates text
- directly --- one for each element size. And if a user wants to operate
- on two strings simultaneously, and wants to avoid testing the string
- sizes within the loop, she must make @var{n}*@var{n} copies of the loop.
- Most users will simply not bother. Instead, they will write code which
- supports only one string size, leaving us back where we started. By
- using a single internal representation, Guile makes it easier for users
- to write multilingual code.
- [[What about tagging each string with its encoding?
- "Every extension must be written to deal with every encoding"]]
- [[You don't really want to index strings anyway.]]
- Finally, Guile's multibyte encoding is not so bad. Unlike a two- or
- four-byte encoding, it is efficient in space for American and European
- users. Furthermore, the properties described above mean that many
- functions can be coded just as they would for a single-byte encoding;
- see @ref{Promised Properties of the Guile Multibyte Encoding}.
- @bye
|