README.nonstandard 4.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123
  1. Non-standard hyphenation
  2. ------------------------
  3. Some languages use non-standard hyphenation; `discretionary'
  4. character changes at hyphenation points. For example,
  5. Catalan: paral·lel -> paral-lel,
  6. Dutch: omaatje -> oma-tje,
  7. German (before the new orthography): Schiffahrt -> Schiff-fahrt,
  8. Hungarian: asszonnyal -> asz-szony-nyal (multiple occurance!)
  9. Swedish: tillata -> till-lata.
  10. Using this extended library, you can define
  11. non-standard hyphenation patterns. For example:
  12. l·1l/l=l
  13. a1atje./a=t,1,3
  14. .schif1fahrt/ff=f,5,2
  15. .as3szon/sz=sz,2,3
  16. n1nyal./ny=ny,1,3
  17. .til1lata./ll=l,3,2
  18. or with narrow boundaries:
  19. l·1l/l=,1,2
  20. a1atje./a=,1,1
  21. .schif1fahrt/ff=,5,1
  22. .as3szon/sz=,2,1
  23. n1nyal./ny=,1,1
  24. .til1lata./ll=,3,1
  25. Note: Libhnj uses modified patterns by preparing substrings.pl.
  26. Unfortunatelly, now the conversion step can generate bad non-standard
  27. patterns (non-standard -> standard pattern conversion), so using
  28. narrow boundaries may be better for recent Libhnj. For example,
  29. substrings.pl generates a few bad patterns for Hungarian hyphenation
  30. patterns resulting bad non-standard hyphenation in a few cases. Using narrow
  31. boundaries solves this problem. Java HyFo module can check this problem.
  32. Syntax of the non-standard hyphenation patterns
  33. ------------------------------------------------
  34. pat1tern/change[,start,cut]
  35. If this pattern matches the word, and this pattern win (see README.hyphen)
  36. in the change region of the pattern, then pattern[start, start + cut - 1]
  37. substring will be replaced with the "change".
  38. For example, a German ff -> ff-f hyphenation:
  39. f1f/ff=f
  40. or with expansion
  41. f1f/ff=f,1,2
  42. will change every "ff" with "ff=f" at hyphenation.
  43. A more real example:
  44. % simple ff -> f-f hyphenation
  45. f1f
  46. % Schiffahrt -> Schiff-fahrt hyphenation
  47. %
  48. schif3fahrt/ff=f,5,2
  49. Specification
  50. - Pattern: matching patterns of the original Liang's algorithm
  51. - patterns must contain only one hyphenation point at change region
  52. signed with an one-digit odd number (1, 3, 5, 7 or 9).
  53. These point may be at subregion boundaries: schif3fahrt/ff=,5,1
  54. - only the greater value guarantees the win (don't mix non-standard and
  55. non-standard patterns with the same value, for example
  56. instead of f3f and schif3fahrt/ff=f,5,2 use f3f and schif5fahrt/ff=f,5,2)
  57. - Change: new characters.
  58. Arbitrary character sequence. Equal sign (=) signs hyphenation points
  59. for OpenOffice.org (like in the example). (In a possible German LaTeX
  60. preprocessor, ff could be replaced with "ff, for a Hungarian one, ssz
  61. with `ssz, according to the German and Hungarian Babel settings.)
  62. - Start: starting position of the change region.
  63. - begins with 1 (not 0): schif3fahrt/ff=f,5,2
  64. - start dot doesn't matter: .schif3fahrt/ff=f,5,2
  65. - numbers don't matter: .s2c2h2i2f3f2ahrt/ff=f,5,2
  66. - In UTF-8 encoding, use Unicode character positions: össze/sz=sz,2,3
  67. ("össze" looks "össze" in an ISO 8859-1 8-bit editor).
  68. - Cut: length of the removed character sequence in the original word.
  69. - In UTF-8 encoding, use Unicode character length: paral·1lel/l=l,5,3
  70. ("paral·lel" looks "paral·1lel" in an ISO 8859-1 8-bit editor).
  71. Dictionary developing
  72. ---------------------
  73. There hasn't been extended PatGen pattern generator for non-standard
  74. hyphenation patterns, yet.
  75. Fortunatelly, non-standard hyphenation points are forbidden in the PatGen
  76. generated hyphenation patterns, so with a little patch can be develop
  77. non-standard hyphenation patterns also in this case.
  78. Warning: If you use UTF-8 Unicode encoding in your patterns, call
  79. substrings.pl with UTF-8 parameter to calculate right
  80. character positions for non-standard hyphenation:
  81. ./substrings.pl input output UTF-8
  82. Programming
  83. -----------
  84. Use hyphenate2() or hyphenate3() to handle non-standard hyphenation.
  85. See hyphen.h for the documentation of the hyphenate*() functions.
  86. See example.c for processing the output of the hyphenate*() functions.
  87. Warning: change characters are lower cased in the source, so you may need
  88. case conversion of the change characters based on input word case detection.
  89. For example, see OpenOffice.org source
  90. (lingucomponent/source/hyphenator/altlinuxhyph/hyphen/hyphenimp.cxx).
  91. László Németh
  92. <nemeth (at) openoffice.org>