123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123 |
- Non-standard hyphenation
- ------------------------
- Some languages use non-standard hyphenation; `discretionary'
- character changes at hyphenation points. For example,
- Catalan: paral·lel -> paral-lel,
- Dutch: omaatje -> oma-tje,
- German (before the new orthography): Schiffahrt -> Schiff-fahrt,
- Hungarian: asszonnyal -> asz-szony-nyal (multiple occurance!)
- Swedish: tillata -> till-lata.
- Using this extended library, you can define
- non-standard hyphenation patterns. For example:
- l·1l/l=l
- a1atje./a=t,1,3
- .schif1fahrt/ff=f,5,2
- .as3szon/sz=sz,2,3
- n1nyal./ny=ny,1,3
- .til1lata./ll=l,3,2
- or with narrow boundaries:
- l·1l/l=,1,2
- a1atje./a=,1,1
- .schif1fahrt/ff=,5,1
- .as3szon/sz=,2,1
- n1nyal./ny=,1,1
- .til1lata./ll=,3,1
- Note: Libhnj uses modified patterns by preparing substrings.pl.
- Unfortunatelly, now the conversion step can generate bad non-standard
- patterns (non-standard -> standard pattern conversion), so using
- narrow boundaries may be better for recent Libhnj. For example,
- substrings.pl generates a few bad patterns for Hungarian hyphenation
- patterns resulting bad non-standard hyphenation in a few cases. Using narrow
- boundaries solves this problem. Java HyFo module can check this problem.
- Syntax of the non-standard hyphenation patterns
- ------------------------------------------------
- pat1tern/change[,start,cut]
- If this pattern matches the word, and this pattern win (see README.hyphen)
- in the change region of the pattern, then pattern[start, start + cut - 1]
- substring will be replaced with the "change".
- For example, a German ff -> ff-f hyphenation:
- f1f/ff=f
- or with expansion
- f1f/ff=f,1,2
- will change every "ff" with "ff=f" at hyphenation.
- A more real example:
- % simple ff -> f-f hyphenation
- f1f
- % Schiffahrt -> Schiff-fahrt hyphenation
- %
- schif3fahrt/ff=f,5,2
- Specification
- - Pattern: matching patterns of the original Liang's algorithm
- - patterns must contain only one hyphenation point at change region
- signed with an one-digit odd number (1, 3, 5, 7 or 9).
- These point may be at subregion boundaries: schif3fahrt/ff=,5,1
- - only the greater value guarantees the win (don't mix non-standard and
- non-standard patterns with the same value, for example
- instead of f3f and schif3fahrt/ff=f,5,2 use f3f and schif5fahrt/ff=f,5,2)
- - Change: new characters.
- Arbitrary character sequence. Equal sign (=) signs hyphenation points
- for OpenOffice.org (like in the example). (In a possible German LaTeX
- preprocessor, ff could be replaced with "ff, for a Hungarian one, ssz
- with `ssz, according to the German and Hungarian Babel settings.)
- - Start: starting position of the change region.
- - begins with 1 (not 0): schif3fahrt/ff=f,5,2
- - start dot doesn't matter: .schif3fahrt/ff=f,5,2
- - numbers don't matter: .s2c2h2i2f3f2ahrt/ff=f,5,2
- - In UTF-8 encoding, use Unicode character positions: össze/sz=sz,2,3
- ("össze" looks "össze" in an ISO 8859-1 8-bit editor).
- - Cut: length of the removed character sequence in the original word.
- - In UTF-8 encoding, use Unicode character length: paral·1lel/l=l,5,3
- ("paral·lel" looks "paral·1lel" in an ISO 8859-1 8-bit editor).
- Dictionary developing
- ---------------------
- There hasn't been extended PatGen pattern generator for non-standard
- hyphenation patterns, yet.
- Fortunatelly, non-standard hyphenation points are forbidden in the PatGen
- generated hyphenation patterns, so with a little patch can be develop
- non-standard hyphenation patterns also in this case.
- Warning: If you use UTF-8 Unicode encoding in your patterns, call
- substrings.pl with UTF-8 parameter to calculate right
- character positions for non-standard hyphenation:
- ./substrings.pl input output UTF-8
- Programming
- -----------
- Use hyphenate2() or hyphenate3() to handle non-standard hyphenation.
- See hyphen.h for the documentation of the hyphenate*() functions.
- See example.c for processing the output of the hyphenate*() functions.
- Warning: change characters are lower cased in the source, so you may need
- case conversion of the change characters based on input word case detection.
- For example, see OpenOffice.org source
- (lingucomponent/source/hyphenator/altlinuxhyph/hyphen/hyphenimp.cxx).
- László Németh
- <nemeth (at) openoffice.org>
|