Chapter published in:
Parallel Corpora for Contrastive and Translation Studies: New resources and applicationsEdited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 79–90
Innovations in parallel corpus alignment and retrieval
Martin Volk | University of Zurich
In this chapter, we give an overview of parallel corpus annotation, alignment and retrieval. We present standard annotation methods such as Part-of-Speech tagging, lemmatization and dependency parsing, but we also introduce language-specific methods, for example for dealing with split verbs or truncated compounds in German. Our corpus annotation includes the identification of code-switching within sentences as a special case of language identification. We argue for careful sentence and word alignment for parallel corpora. And we explain how word alignment is the basis for a wide range of applications from translation variant ranking to lemma disambiguation.
Article outline
- 1.Introduction
- 2.Corpus annotations
- 2.1General corpus annotation
- 2.2Exploiting parallel corpora for annotation
- 2.3Language-specific corpus annotation
- 3.Aligning parallel corpora
- 4.Retrieval from parallel corpora
- 5.Conclusion
-
Acknowledgments -
Note -
References
Published online: 20 March 2019
https://doi.org/10.1075/scl.90.05vol
https://doi.org/10.1075/scl.90.05vol
References
Aepli, Noëmi & Volk, Martin
Augustinus, Liesbeth, Vandeghinste, Vincent & Vanallemeersch, Tom
Ebling, Sarah, Sennrich Rico, Klaper, David & Volk, Martin
2011 Digging for names in the mountains: combined person name recognition and reference resolution for German alpine texts. In Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011 [Lecture Notes in Computer Science Vol. 8387], Zygmunt Vetulani, Joseph Mariani (eds), 189–200. Cham: Springer. DOI: 

Göhring, Anne & Volk, Martin
2011 The Text + Berg corpus: An alpine French-German parallel resource. In
Proceedings of Traitement Automatique des Langues Naturelles (TALN 2011), Montpellier, 27 Juni −1 Juli 2011.
Graën, Johannes, Batinic, Dolores & Volk, Martin
Junczys-Dowmunt, Marcin, Pouliquen, Bruno & Mazenc, Christophe
Koehn, Philipp
Lison, Pierre & Tiedemann, Jörg
McDonald, Ryan & Nivre, Joakim
Meurer, Paul
Petrov, Slav, Das, Dipanjan & McDonald, Ryan
Rios, Annette, Göhring, Anne & Volk, Martin
Sennrich, Rico & Volk, Martin
Steinberger, Ralf, Pouliquen, Bruno, Widiger, Anna, Ignat, Carmelia, Erjavec, Tomaz, Tufis, Dan & Varga, Daniel
Volk, Martin & Clematide, Simon
Volk, Martin, Graën, Johannes & Callegaro, Elena
Volk, Martin, Amrhein, Chantal, Aepli, Noëmi, Müller, Mathias & Ströbel, Phillip
Volk, Martin, Clematide, Simon, Graën, Johannes & Ströbel, Phillip
Volk, Martin, Marek, Torsten, & Yvonne, Samuelsson