hyphenations char might not be a dash¶
Hyphenation is splitting a long word across lines in a correct place. The split is shown by a hyphenation char at the end of the first part:
kolen => ko= len
(The = is used by intent as the hyphenation char here, but in practice this normally is the dash or a shortened dash graphically. The difference is there though.)
hyphenation and meaning¶
The point where the word is split is can be relevant for the interpretation of the word:
tafeltennisartiest (table tennis artist) is better off being split as tafeltennis=artiest (table tennis) (artist) then as tafel=tennisartiest (table) (tennis artist). The first is an artist in table tennis, the second a tennis artist on a table.
However relevant in the area of intention of the word, it is hardly useful when splitting the words over lines, since parts like these are so far apart, the choice for a nice layout will dictate one of the two.
Hyphenation on word boundaries is better in compound words then hyphenating within compounds.
So when 2 hyphenation options are both usefull as a hyphenation point, using the word boundary is best.
There is fundamental ambiguity in hyphenations, because of affixing from different words. Example: dor => dor=ste and dorsen => dors=te. gei=tje and geit=je, ballet=je and balle=tje ..
ambiguity in word boundaries¶
Since in compounding the combining s is more or less free to be added, problems arise hyphenating some words.
Geweldstelling can have 2 meanings : geweld+s=telling (counting violence) and geweld=stelling (declaring violence)
Generally, words ending with -ings can be broken after that. There is one very important exception for this: belasting generally has no following s, so has to be broken after the ing. (Detection patterns will have to be quite long!)
Apart from the concatenating s, there are ambiguous compounds, from which the valk+uil/val+kuil is the most well known one.
When hyphenated, words can change their grapheme :
aa-bbb => aa=bb : the - is replaced by the hyphenations char
autootje => auto=tje : one o is dropped to show the correct base form
sms'en => sms=en
To encode all possible hyphenations of a word in one string, we invented this notation:
ge=weld@s#au=to[o|=]tjes = means a hyphenation position [a|b] is a hyphenation, replacing a with b when hyphenating @ means word boundary # means concatenation boundary Leading to the possible compounds: ge=weldsautootjes gewelds=autootjes geweldsau=tootjes geweldsauto=tjes
It is generally better to not hyphenate then hyphenate wrongly.
All words in the approved words list should hyphenate only correctly (not necessarily in all hyphenation positions), possibly with the exception of proper names.
Hyphenating wrongly in ambiguous situations could be prevented by invoking user intervention : present both valk=uil and val=kuil for the user to choose the best hyphenation in this case.
Correct dutch hyphenation is a challenge because of compounding and the natural ambiguity of words.
In this wiki page I want to document how I am trying to create better hyphenation results for Dutch using existing software.
First requirement is a words list for which hyphenations should be supported. I decided to focus on supporting the current OpenTaal words list as the baseline.
It helps to initially remove proper names from this list, since they tend to contain non-regular hyphenations. In fact, I removed all capitalized words.
Luckily, hyphenation now (2.6) supports some decompounding. So the first step is to identify word borders and make reliable and safe splitting patterns for them.
Actually, it is not compounding, but word-level splitting. (Un)compound would require both parts after splitting to be checked, but this is just splitting by pattern.
(For Dutch, since 'fiets' is a correct splitter, the pattern fiets1 will split fietsenrek into fiets=enrek, where enrek is not a word. To prevent this, an exception fiets2en1 hast to be added. This is true for a lot of compounds.)
From a list containing only word boundaries, patgen might be used to detect patterns. However, pattern length is limited, so belasting is not treated well. So I chose to stick to manual control. More work, better understandable results I Hope.
Example: lots of compounds split after 'ings', 'heids' and 'eits', with some exceptions, like 'belasting', which does not have a compounding s.
To prevent splitting in non-word boundaries (resulting in mistakes), a minumim # of chars of 4 has been chosen.
Generally, splitting proves best when nouns are used to split at the start, since noun affixes are not in the way then.
Nouns derives from verbs and adjective tend to have prefixes in them, and require exceptions at the start.
Complete plural nouns can be used to split at both ends (1verzekeringen1).
ISO8859-1 COMPOUNDLEFTHYPHENMIN 4 COMPOUNDRIGHTHYPHENMIN 4 % 1st (compound) level heids1 c2h eits1 ings1 1belasting1 1verzekering (etc) % 2nd (non compound) level NEXTLEVEL
Generally, it is best to place breakpoints in from of a common part when it is a noun with possible affixes within compounds: 1belasting .
For parts having an compounding s, putting a breakpoint after that is productive.
Problems are in : (be/ver)werking, where the 3 forms behave differently.
Currently, this is the step I am in right now. The rest is a best guess on how to progress...:
To test this, have all words of at least 8 characters and hyphenate them. Then use the generated hyphenation (word boundary) to split the words. Test if all of the resulting word parts are actually correct.
validate the longest remaining words if there might be a useful pattern to split them.
Repeat this process until satisfied with the results.
Every word of 7 chars or less is not hyphenated by the compound level. These are done on this level.
(what is the best option, just pattern generation or all parts .aan=een. in a list? Has to be tested. parts is most readable..)
Compound parts that need special hyphenation is best to add as a complete compounding part:
When all more ore less regular words have been added, it is time to add the strange ones, like incorrectly hyphenated proper names.
(Onderstaand deel moet een eigen pagina krijgen.)
|open standaard||woorden in UTF-8 tekstbestand, elke regel één woord|
|referentie software||Hunspell (command-line, dynamic library met API, webservice met API )|
|open standaard||formaat gedocumenteerd in PDF, woorden en affixen in UTF-8 tekstbestanden|
|bijzonderheden||ondersteund samenstellingen, woorden mogen geen spaties bevatten|
|referentie software||LanguageTool (GUI in Java, command-line, dynamic library met API, webservice met API )|
|open standaard||formaat vastgelegd in XSD , regels en patronen in met regex UTF-8 XML -bestanden|
|bijzonderheden||ondersteund valse vrienden, woorden mogen wel spaties bevatten|
|referentie software||"PatGen" (command-line) en libhyphen (dynamic library met API )|
|open standaard||Hyphenation Definitions , formaat vastgelegd in EBNF , data in UTF-8 tekstbestanden en verwerkbaar met regex|
|bijzonderheden||ondersteund speciale gevallen voor Nederlands|
|open standaard||Lexical Markup Framework , formaat vastgelegd in DTD maar ook in RDF /OWL, data in UTF-8 XML -bestanden|
|open standaard||Part-Of-Speach (POS) tagging|