Dutch compound splitting for bilingual terminology extraction

Macken, Lieve; Tezcan, Arda

doi:10.1075/cilt.341.07mac

Part of

Multiword Units in Machine Translation and Translation Technology
Edited by Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor and Violeta Seretan
[Current Issues in Linguistic Theory 341] 2018
► pp. 147–162

Dutch compound splitting for bilingual terminology extraction

Lieve Macken

Arda Tezcan

As compounds pose a problem for applications that rely on precise word alignments, we developed a state-of-the-art compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists.

As compounds are not always translated compositionally, we developed a novel methodology for word alignment. We train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts.

Experiments show that the compound splitter combined with the novel word alignment technique considerably improves bilingual terminology extraction results.

Keywords: compound splitting, bilingual terminology extraction, word alignment, multiword units, translation, Dutch

Article outline

1.Introduction
2.Dutch compound splitter
- 2.1Domain adaptation
- 2.2Data Sets and Experiments
3.Impact on word alignment
- 3.1Data sets and experiments
4.Impact on terminology extraction
- 4.1Experiments
5.Conclusion
Notes
References

Published online: 20 July 2018

https://doi.org/10.1075/cilt.341.07mac

References (15)

Baayen, R. H., R. Piepenbrock, & van Rijn, H.

(1993) The CELEX lexical database on CD-ROM. Philadelphia, PA: Linguistic Data Consortium.

Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., & Mercer R. L.

(1993) “The Mathematics of Statistical Machine Translation: Parameter Estimation”. Computational Linguistics, 19(2), 263–311.

Frantzi, K., & Ananiadou. S.

(1999) The C-value / NC-value domain independent method for multiword term extraction. Journal of Natural Language Processing, 6(3), 145–179.

Fritzinger, F., & Fraser, A.

(2010) How to avoid burning ducks: combining linguistic analysis and corpus statistics for German compound processing. In Proceedings of the ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR. 224–234. Uppsala, Sweden.

Kageura, K., & Umino, B.

(1996) Methods of automatic term recognition. A review. Terminology, 3(2), 259–289.

Koehn, P., Axelrod, A., Birch Mayne, A., Callison-Burch, C., Osborne, M., & Talbot, D.

(2005) Edinburgh system description for the 2005 IWSLT speech translation evaluation. In Proceedings of the International Workshop on Spoken Language Translation: Evaluation Campaign on Spoken Language Translation (IWSLT 2005). Pittsburgh, PA, USA.

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N. et al.

(2007) Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the ACL 2007 Demo and Poster Sessions. 177–180. Prague, Czech Republic.

Koehn, P., & Knight, K.

(2003) Empirical methods for compound splitting. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003). 187–193. Budapest, Hungary.

Lefever, E., Macken, L., & Hoste, V.

(2009) Language-independent bilingual terminology extraction from a multilingual parallel corpus. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 496–504. Athens, Greece.

Macken, L., De Clercq, O., & Paulussen, H.

(2011) “Dutch Parallel Corpus: a Balanced Copyright-Cleared Parallel Corpus”. Meta, 56(2), 374–390.

Macken, L., Lefever, E., & Hoste, V.

(2013) TExSIS. Bilingual terminology extraction from parallel corpora using chunk-based alignment. Terminology, 19(1), 1–30.

Och, F. J., & Ney, H.

(2003) A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

Parra Escartín, C.

(2014) Chasing the Perfect Splitter: A Comparison of Different Compound Splitting Tools. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). 3340–3347. Reykjavik, Iceland.

Réveil, B., & Martens, J.-P.

(2008) Reducing speech recognition time and memory use by means of compound (de-)composition. In Proceedings of the Annual Workshop on Circuits, Systems and Signal Processing (ProRISC 2008). 348–352. Utrecht, The Netherlands.

Stymne, S., & Holmqvist, M.

(2008) Processing of Swedish compounds for phrase-based statistical machine translation. In Proceedings of the 12th annual conference of the European Association for Machine Translation (EAMT 2008). 182–191. Hamburg, Germany.