format.txt 8.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244
  1. #
  2. # $Id: format.txt,v 1.1 1999/01/08 00:19:20 ftang%netscape.com Exp $
  3. #
  4. CHARACTER DATA
  5. ==============
  6. This package generates some data files that contain character properties useful
  7. for text processing.
  8. CHARACTER PROPERTIES
  9. ====================
  10. The first data file is called "ctype.dat" and contains a compressed form of
  11. the character properties found in the Unicode Character Database (UCDB).
  12. Additional properties can be specified in limited UCDB format in another file
  13. to avoid modifying the original UCDB.
  14. The following is a property name and code table to be used with the character
  15. data:
  16. NAME CODE DESCRIPTION
  17. ---------------------
  18. Mn 0 Mark, Non-Spacing
  19. Mc 1 Mark, Spacing Combining
  20. Me 2 Mark, Enclosing
  21. Nd 3 Number, Decimal Digit
  22. Nl 4 Number, Letter
  23. No 5 Number, Other
  24. Zs 6 Separator, Space
  25. Zl 7 Separator, Line
  26. Zp 8 Separator, Paragraph
  27. Cc 9 Other, Control
  28. Cf 10 Other, Format
  29. Cs 11 Other, Surrogate
  30. Co 12 Other, Private Use
  31. Cn 13 Other, Not Assigned
  32. Lu 14 Letter, Uppercase
  33. Ll 15 Letter, Lowercase
  34. Lt 16 Letter, Titlecase
  35. Lm 17 Letter, Modifier
  36. Lo 18 Letter, Other
  37. Pc 19 Punctuation, Connector
  38. Pd 20 Punctuation, Dash
  39. Ps 21 Punctuation, Open
  40. Pe 22 Punctuation, Close
  41. Po 23 Punctuation, Other
  42. Sm 24 Symbol, Math
  43. Sc 25 Symbol, Currency
  44. Sk 26 Symbol, Modifier
  45. So 27 Symbol, Other
  46. L 28 Left-To-Right
  47. R 29 Right-To-Left
  48. EN 30 European Number
  49. ES 31 European Number Separator
  50. ET 32 European Number Terminator
  51. AN 33 Arabic Number
  52. CS 34 Common Number Separator
  53. B 35 Block Separator
  54. S 36 Segment Separator
  55. WS 37 Whitespace
  56. ON 38 Other Neutrals
  57. Pi 47 Punctuation, Initial
  58. Pf 48 Punctuation, Final
  59. #
  60. # Implementation specific properties.
  61. #
  62. Cm 39 Composite
  63. Nb 40 Non-Breaking
  64. Sy 41 Symmetric (characters which are part of open/close pairs)
  65. Hd 42 Hex Digit
  66. Qm 43 Quote Mark
  67. Mr 44 Mirroring
  68. Ss 45 Space, Other (controls viewed as spaces in ctype isspace())
  69. Cp 46 Defined character
  70. The actual binary data is formatted as follows:
  71. Assumptions: unsigned short is at least 16-bits in size and unsigned long
  72. is at least 32-bits in size.
  73. unsigned short ByteOrderMark
  74. unsigned short OffsetArraySize
  75. unsigned long Bytes
  76. unsigned short Offsets[OffsetArraySize + 1]
  77. unsigned long Ranges[N], N = value of Offsets[OffsetArraySize]
  78. The Bytes field provides the total byte count used for the Offsets[] and
  79. Ranges[] arrays. The Offsets[] array is aligned on a 4-byte boundary and
  80. there is always one extra node on the end to hold the final index of the
  81. Ranges[] array. The Ranges[] array contains pairs of 4-byte values
  82. representing a range of Unicode characters. The pairs are arranged in
  83. increasing order by the first character code in the range.
  84. Determining if a particular character is in the property list requires a
  85. simple binary search to determine if a character is in any of the ranges
  86. for the property.
  87. If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a
  88. machine with a different endian order and the values must be byte-swapped.
  89. To swap a 16-bit value:
  90. c = (c >> 8) | ((c & 0xff) << 8)
  91. To swap a 32-bit value:
  92. c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |
  93. (((c >> 16) & 0xff) << 8) | (c >> 24)
  94. CASE MAPPINGS
  95. =============
  96. The next data file is called "case.dat" and contains three case mapping tables
  97. in the following order: upper, lower, and title case. Each table is in
  98. increasing order by character code and each mapping contains 3 unsigned longs
  99. which represent the possible mappings.
  100. The format for the binary form of these tables is:
  101. unsigned short ByteOrderMark
  102. unsigned short NumMappingNodes, count of all mapping nodes
  103. unsigned short CaseTableSizes[2], upper and lower mapping node counts
  104. unsigned long CaseTables[NumMappingNodes]
  105. The starting indexes of the case tables are calculated as following:
  106. UpperIndex = 0;
  107. LowerIndex = CaseTableSizes[0] * 3;
  108. TitleIndex = LowerIndex + CaseTableSizes[1] * 3;
  109. The order of the fields for the three tables are:
  110. Upper case
  111. ----------
  112. unsigned long upper;
  113. unsigned long lower;
  114. unsigned long title;
  115. Lower case
  116. ----------
  117. unsigned long lower;
  118. unsigned long upper;
  119. unsigned long title;
  120. Title case
  121. ----------
  122. unsigned long title;
  123. unsigned long upper;
  124. unsigned long lower;
  125. If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
  126. same way as described in the CHARACTER PROPERTIES section.
  127. Because the tables are in increasing order by character code, locating a
  128. mapping requires a simple binary search on one of the 3 codes that make up
  129. each node.
  130. It is important to note that there can only be 65536 mapping nodes which
  131. divided into 3 portions allows 21845 nodes for each case mapping table. The
  132. distribution of mappings may be more or less than 21845 per table, but only
  133. 65536 are allowed.
  134. DECOMPOSITIONS
  135. ==============
  136. The next data file is called "decomp.dat" and contains the decomposition data
  137. for all characters with decompositions containing more than one character and
  138. are *not* compatibility decompositions. Compatibility decompositions are
  139. signaled in the UCDB format by the use of the <compat> tag in the
  140. decomposition field. Each list of character codes represents a full
  141. decomposition of a composite character. The nodes are arranged in increasing
  142. order by character code.
  143. The format for the binary form of this table is:
  144. unsigned short ByteOrderMark
  145. unsigned short NumDecompNodes, count of all decomposition nodes
  146. unsigned long Bytes
  147. unsigned long DecompNodes[(NumDecompNodes * 2) + 1]
  148. unsigned long Decomp[N], N = sum of all counts in DecompNodes[]
  149. If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
  150. same way as described in the CHARACTER PROPERTIES section.
  151. The DecompNodes[] array consists of pairs of unsigned longs, the first of
  152. which is the character code and the second is the initial index of the list
  153. of character codes representing the decomposition.
  154. Locating the decomposition of a composite character requires a binary search
  155. for a character code in the DecompNodes[] array and using its index to
  156. locate the start of the decomposition. The length of the decomposition list
  157. is the index in the following element in DecompNode[] minus the current
  158. index.
  159. COMBINING CLASSES
  160. =================
  161. The fourth data file is called "cmbcl.dat" and contains the characters with
  162. non-zero combining classes.
  163. The format for the binary form of this table is:
  164. unsigned short ByteOrderMark
  165. unsigned short NumCCLNodes
  166. unsigned long Bytes
  167. unsigned long CCLNodes[NumCCLNodes * 3]
  168. If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
  169. same way as described in the CHARACTER PROPERTIES section.
  170. The CCLNodes[] array consists of groups of three unsigned longs. The first
  171. and second are the beginning and ending of a range and the third is the
  172. combining class of that range.
  173. If a character is not found in this table, then the combining class is
  174. assumed to be 0.
  175. It is important to note that only 65536 distinct ranges plus combining class
  176. can be specified because the NumCCLNodes is usually a 16-bit number.
  177. NUMBER TABLE
  178. ============
  179. The final data file is called "num.dat" and contains the characters that have
  180. a numeric value associated with them.
  181. The format for the binary form of the table is:
  182. unsigned short ByteOrderMark
  183. unsigned short NumNumberNodes
  184. unsigned long Bytes
  185. unsigned long NumberNodes[NumNumberNodes]
  186. unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))
  187. / sizeof(short)]
  188. If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the
  189. same way as described in the CHARACTER PROPERTIES section.
  190. The NumberNodes array contains pairs of values, the first of which is the
  191. character code and the second an index into the ValueNodes array. The
  192. ValueNodes array contains pairs of integers which represent the numerator
  193. and denominator of the numeric value of the character. If the character
  194. happens to map to an integer, both the values in ValueNodes will be the
  195. same.