Dutch compound splitting for bilingual terminology extraction
As compounds pose a problem for applications that rely on precise word alignments, we developed a state-of-the-art
compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation
techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists.
As compounds are not always translated compositionally, we developed a novel methodology for word alignment. We train
the word alignment models twice: a first time on the original data set and a second time on the data set in which the
compounds are split into their component parts.
Experiments show that the compound splitter combined with the novel word alignment technique considerably improves
bilingual terminology extraction results.
Article outline
- 1.Introduction
- 2.Dutch compound splitter
- 2.1Domain adaptation
- 2.2Data Sets and Experiments
- 3.Impact on word alignment
- 3.1Data sets and experiments
- 4.Impact on terminology extraction
- 5.Conclusion
-
Notes
-
References