Project

General

Profile

Hyphenation

hyphenations char might not be a dash

Hyphenation is splitting a long word across lines in a correct place. The split is shown by a hyphenation char at the end of the first part:

 kolen =>
 ko=
 len

(The = is used by intent as the hyphenation char here, but in practice this normally is the dash or a shortened dash graphically. The difference is there though.)

hyphenation and meaning

The point where the word is split is can be relevant for the interpretation of the word:

tafeltennisartiest (table tennis artist) is better off being split as tafeltennis=artiest (table tennis) (artist) then as tafel=tennisartiest (table) (tennis artist). The first is an artist in table tennis, the second a tennis artist on a table.

However relevant in the area of intention of the word, it is hardly useful when splitting the words over lines, since parts like these are so far apart, the choice for a nice layout will dictate one of the two.

word boundaries

Hyphenation on word boundaries is better in compound words then hyphenating within compounds.
So when 2 hyphenation options are both usefull as a hyphenation point, using the word boundary is best.

Ambiguity

There is fundamental ambiguity in hyphenations, because of affixing from different words. Example: dor => dor=ste and dorsen => dors=te. gei=tje and geit=je, ballet=je and balle=tje ..

ambiguity in word boundaries

Since in compounding the combining s is more or less free to be added, problems arise hyphenating some words.
Geweldstelling can have 2 meanings : geweld+s=telling (counting violence) and geweld=stelling (declaring violence)

Generally, words ending with -ings can be broken after that. There is one very important exception for this: belasting generally has no following s, so has to be broken after the ing. (Detection patterns will have to be quite long!)

Apart from the concatenating s, there are ambiguous compounds, from which the valk+uil/val+kuil is the most well known one.

special hyphenations

When hyphenated, words can change their grapheme :
aa-bbb => aa=bb : the - is replaced by the hyphenations char
autootje => auto=tje : one o is dropped to show the correct base form
sms'en => sms=en

hyphenation notation

To encode all possible hyphenations of a word in one string, we invented this notation:

 ge=weld@s#au=to[o|=]tjes
 = means a hyphenation position
 [a|b] is a hyphenation, replacing a with b when hyphenating
 @ means word boundary
 # means concatenation boundary

 Leading to the possible compounds:

 ge=weldsautootjes
 gewelds=autootjes
 geweldsau=tootjes
 geweldsauto=tjes

Preventing errors

It is generally better to not hyphenate then hyphenate wrongly.

All words in the approved words list should hyphenate only correctly (not necessarily in all hyphenation positions), possibly with the exception of proper names.

Manual intervention

Hyphenating wrongly in ambiguous situations could be prevented by invoking user intervention : present both valk=uil and val=kuil for the user to choose the best hyphenation in this case.

Improving hyphenation

Intro

Correct dutch hyphenation is a challenge because of compounding and the natural ambiguity of words.
In this wiki page I want to document how I am trying to create better hyphenation results for Dutch using existing software.

Preparations

First requirement is a words list for which hyphenations should be supported. I decided to focus on supporting the current OpenTaal words list as the baseline.

It helps to initially remove proper names from this list, since they tend to contain non-regular hyphenations. In fact, I removed all capitalized words.

Compound level

Luckily, hyphenation now (2.6) supports some decompounding. So the first step is to identify word borders and make reliable and safe splitting patterns for them.

Actually, it is not compounding, but word-level splitting. (Un)compound would require both parts after splitting to be checked, but this is just splitting by pattern.
(For Dutch, since 'fiets' is a correct splitter, the pattern fiets1 will split fietsenrek into fiets=enrek, where enrek is not a word. To prevent this, an exception fiets2en1 hast to be added. This is true for a lot of compounds.)

From a list containing only word boundaries, patgen might be used to detect patterns. However, pattern length is limited, so belasting is not treated well. So I chose to stick to manual control. More work, better understandable results I Hope.

Example: lots of compounds split after 'ings', 'heids' and 'eits', with some exceptions, like 'belasting', which does not have a compounding s.
To prevent splitting in non-word boundaries (resulting in mistakes), a minumim # of chars of 4 has been chosen.

Generally, splitting proves best when nouns are used to split at the start, since noun affixes are not in the way then.
Nouns derives from verbs and adjective tend to have prefixes in them, and require exceptions at the start.
Complete plural nouns can be used to split at both ends (1verzekeringen1).

ISO8859-1
COMPOUNDLEFTHYPHENMIN 4
COMPOUNDRIGHTHYPHENMIN 4
% 1st (compound) level
heids1
c2h
eits1
ings1
1belasting1
1verzekering
(etc)
% 2nd (non compound) level
NEXTLEVEL

Generally, it is best to place breakpoints in from of a common part when it is a noun with possible affixes within compounds: 1belasting .
For parts having an compounding s, putting a breakpoint after that is productive.
Problems are in : (be/ver)werking, where the 3 forms behave differently.

Currently, this is the step I am in right now. The rest is a best guess on how to progress...:

To test this, have all words of at least 8 characters and hyphenate them. Then use the generated hyphenation (word boundary) to split the words. Test if all of the resulting word parts are actually correct.
validate the longest remaining words if there might be a useful pattern to split them.

Repeat this process until satisfied with the results.

Part level

Every word of 7 chars or less is not hyphenated by the compound level. These are done on this level.

(what is the best option, just pattern generation or all parts .aan=een. in a list? Has to be tested. parts is most readable..)

Special hyphenations

Compound parts that need special hyphenation is best to add as a complete compounding part:
.au1to1otje./o=t,6,3 ?????
ï/ï=i
ë/ë=e

Exceptions

When all more ore less regular words have been added, it is time to add the strange ones, like incorrectly hyphenated proper names.

See also

(Onderstaand deel moet een eigen pagina krijgen.)

Standaarden

categorie woordenlijst
referentie software
open standaard woorden in UTF-8 tekstbestand, elke regel één woord
bijzonderheden
categorie spellingcontrole
referentie software Hunspell (command-line, dynamic library met API, webservice met API )
open standaard formaat gedocumenteerd in PDF, woorden en affixen in UTF-8 tekstbestanden
bijzonderheden ondersteund samenstellingen, woorden mogen geen spaties bevatten
categorie grammaticacontrole
referentie software LanguageTool (GUI in Java, command-line, dynamic library met API, webservice met API )
open standaard formaat vastgelegd in XSD , regels en patronen in met regex UTF-8 XML -bestanden
bijzonderheden ondersteund valse vrienden, woorden mogen wel spaties bevatten
categorie afbreekpatronen
referentie software "PatGen" (command-line) en libhyphen (dynamic library met API )
open standaard Hyphenation Definitions , formaat vastgelegd in EBNF , data in UTF-8 tekstbestanden en verwerkbaar met regex
bijzonderheden ondersteund speciale gevallen voor Nederlands
categorie thesaurus
referentie software
open standaard Lexical Markup Framework , formaat vastgelegd in DTD maar ook in RDF /OWL, data in UTF-8 XML -bestanden
bijzonderheden ISO-standaard 24613
categorie woordclassificatie
referentie software
open standaard Part-Of-Speach (POS) tagging
bijzonderheden