A multilingual gold standard for translation spotting of German compounds and their corresponding multiword units
in English, French, Italian and Spanish
This article describes a new word alignment gold standard for German nominal compounds and their multiword translation
equivalents in English, French, Italian, and Spanish. The gold standard contains alignments for each of the ten
language pairs, resulting in a total of 8,229 bidirectional alignments. It covers 362 occurrences of 137 different
German compounds randomly selected from the corpus of European Parliament plenary sessions, sampled according to the
criteria of frequency and morphological complexity. The standard serves for the evaluation and optimisation of
automatic word alignments in the context of spotting translations of German compounds. The study also shows that in
this text genre, around 80% of German noun types are morphological compounds indicating potential multiword units in
their parallel equivalents.
Article outline
- 1.Introduction
- 2.Resources
- 2.1Selection and preprocessing of the gold standard material
- 2.2Annotation guidelines and annotation process
- 3.Evaluation and discussion
- 3.1Quality of universal part-of-speech (UPOS) tagging
- 3.2Aligned UPOS tags across languages
- 3.3Complexity of the compounds and aligned MWUs
- 3.4Evaluation of the quality of the automatic GIZA++ word alignment
- 3.5Optimisation of the directed word alignments through symmetrisation
- 3.6Frequency effects
- 3.7Effects of morphological complexity
- 3.8Lexicalisation and variability
- 4.Conclusion
-
Acknowledgements
-
Notes
-
References
-
Appendix