123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244 |
- #
- # $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
- #
- CHARACTER DATA
- ==============
- This package generates some data files that contain character properties useful
- for text processing.
- CHARACTER PROPERTIES
- ====================
- The first data file is called "ctype.dat" and contains a compressed form of
- the character properties found in the Unicode Character Database (UCDB).
- Additional properties can be specified in limited UCDB format in another file
- to avoid modifying the original UCDB.
- The following is a property name and code table to be used with the character
- data:
- NAME CODE DESCRIPTION
- ---------------------
- Mn 0 Mark, Non-Spacing
- Mc 1 Mark, Spacing Combining
- Me 2 Mark, Enclosing
- Nd 3 Number, Decimal Digit
- Nl 4 Number, Letter
- No 5 Number, Other
- Zs 6 Separator, Space
- Zl 7 Separator, Line
- Zp 8 Separator, Paragraph
- Cc 9 Other, Control
- Cf 10 Other, Format
- Cs 11 Other, Surrogate
- Co 12 Other, Private Use
- Cn 13 Other, Not Assigned
- Lu 14 Letter, Uppercase
- Ll 15 Letter, Lowercase
- Lt 16 Letter, Titlecase
- Lm 17 Letter, Modifier
- Lo 18 Letter, Other
- Pc 19 Punctuation, Connector
- Pd 20 Punctuation, Dash
- Ps 21 Punctuation, Open
- Pe 22 Punctuation, Close
- Po 23 Punctuation, Other
- Sm 24 Symbol, Math
- Sc 25 Symbol, Currency
- Sk 26 Symbol, Modifier
- So 27 Symbol, Other
- L 28 Left-To-Right
- R 29 Right-To-Left
- EN 30 European Number
- ES 31 European Number Separator
- ET 32 European Number Terminator
- AN 33 Arabic Number
- CS 34 Common Number Separator
- B 35 Block Separator
- S 36 Segment Separator
- WS 37 Whitespace
- ON 38 Other Neutrals
- Pi 47 Punctuation, Initial
- Pf 48 Punctuation, Final
- #
- # Implementation specific properties.
- #
- Cm 39 Composite
- Nb 40 Non-Breaking
- Sy 41 Symmetric (characters which are part of open/close pairs)
- Hd 42 Hex Digit
- Qm 43 Quote Mark
- Mr 44 Mirroring
- Ss 45 Space, Other (controls viewed as spaces in ctype isspace())
- Cp 46 Defined character
- The actual binary data is formatted as follows:
- Assumptions: unsigned short is at least 16-bits in size and unsigned long
- is at least 32-bits in size.
- unsigned short ByteOrderMark
- unsigned short OffsetArraySize
- unsigned long Bytes
- unsigned short Offsets[OffsetArraySize + 1]
- unsigned long Ranges[N], N = value of Offsets[OffsetArraySize]
- The Bytes field provides the total byte count used for the Offsets[] and
- Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and
- there is always one extra node on the end to hold the final index of the
- Ranges[] array. The Ranges[] array contains pairs of 4-byte values
- representing a range of Unicode characters. The pairs are arranged in
- increasing order by the first character code in the range.
- Determining if a particular character is in the property list requires a
- simple binary search to determine if a character is in any of the ranges
- for the property.
- If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
- machine with a different endian order and the values must be byte-swapped.
- To swap a 16-bit value:
- c = (c >> 8) | ((c & 0xff) << 8)
- To swap a 32-bit value:
- c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
- (((c >> 16) & 0xff) << 8) | (c >> 24)
- CASE MAPPINGS
- =============
- The next data file is called "case.dat" and contains three case mapping tables
- in the following order: upper, lower, and title case. Each table is in
- increasing order by character code and each mapping contains 3 unsigned longs
- which represent the possible mappings.
- The format for the binary form of these tables is:
- unsigned short ByteOrderMark
- unsigned short NumMappingNodes, count of all mapping nodes
- unsigned short CaseTableSizes[2], upper and lower mapping node counts
- unsigned long CaseTables[NumMappingNodes]
- The starting indexes of the case tables are calculated as following:
- UpperIndex = 0;
- LowerIndex = CaseTableSizes[0] * 3;
- TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
- The order of the fields for the three tables are:
- Upper case
- ----------
- unsigned long upper;
- unsigned long lower;
- unsigned long title;
- Lower case
- ----------
- unsigned long lower;
- unsigned long upper;
- unsigned long title;
- Title case
- ----------
- unsigned long title;
- unsigned long upper;
- unsigned long lower;
- If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
- same way as described in the CHARACTER PROPERTIES section.
- Because the tables are in increasing order by character code, locating a
- mapping requires a simple binary search on one of the 3 codes that make up
- each node.
- It is important to note that there can only be 65536 mapping nodes which
- divided into 3 portions allows 21845 nodes for each case mapping table. The
- distribution of mappings may be more or less than 21845 per table, but only
- 65536 are allowed.
- DECOMPOSITIONS
- ==============
- The next data file is called "decomp.dat" and contains the decomposition data
- for all characters with decompositions containing more than one character and
- are *not* compatibility decompositions. Compatibility decompositions are
- signaled in the UCDB format by the use of the <compat> tag in the
- decomposition field. Each list of character codes represents a full
- decomposition of a composite character. The nodes are arranged in increasing
- order by character code.
- The format for the binary form of this table is:
- unsigned short ByteOrderMark
- unsigned short NumDecompNodes, count of all decomposition nodes
- unsigned long Bytes
- unsigned long DecompNodes[(NumDecompNodes * 2) + 1]
- unsigned long Decomp[N], N = sum of all counts in DecompNodes[]
- If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
- same way as described in the CHARACTER PROPERTIES section.
- The DecompNodes[] array consists of pairs of unsigned longs, the first of
- which is the character code and the second is the initial index of the list
- of character codes representing the decomposition.
- Locating the decomposition of a composite character requires a binary search
- for a character code in the DecompNodes[] array and using its index to
- locate the start of the decomposition. The length of the decomposition list
- is the index in the following element in DecompNode[] minus the current
- index.
- COMBINING CLASSES
- =================
- The fourth data file is called "cmbcl.dat" and contains the characters with
- non-zero combining classes.
- The format for the binary form of this table is:
- unsigned short ByteOrderMark
- unsigned short NumCCLNodes
- unsigned long Bytes
- unsigned long CCLNodes[NumCCLNodes * 3]
- If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
- same way as described in the CHARACTER PROPERTIES section.
- The CCLNodes[] array consists of groups of three unsigned longs. The first
- and second are the beginning and ending of a range and the third is the
- combining class of that range.
- If a character is not found in this table, then the combining class is
- assumed to be 0.
- It is important to note that only 65536 distinct ranges plus combining class
- can be specified because the NumCCLNodes is usually a 16-bit number.
- NUMBER TABLE
- ============
- The final data file is called "num.dat" and contains the characters that have
- a numeric value associated with them.
- The format for the binary form of the table is:
- unsigned short ByteOrderMark
- unsigned short NumNumberNodes
- unsigned long Bytes
- unsigned long NumberNodes[NumNumberNodes]
- unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
- / sizeof(short)]
- If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
- same way as described in the CHARACTER PROPERTIES section.
- The NumberNodes array contains pairs of values, the first of which is the
- character code and the second an index into the ValueNodes array. The
- ValueNodes array contains pairs of integers which represent the numerator
- and denominator of the numeric value of the character. If the character
- happens to map to an integer, both the values in ValueNodes will be the
- same.
|