Innovations in parallel corpus alignment and retrieval
In this chapter, we give an overview of parallel corpus annotation, alignment and retrieval. We present standard annotation methods such as Part-of-Speech tagging, lemmatization and dependency parsing, but we also introduce language-specific methods, for example for dealing with split verbs or truncated compounds in German. Our corpus annotation includes the identification of code-switching within sentences as a special case of language identification. We argue for careful sentence and word alignment for parallel corpora. And we explain how word alignment is the basis for a wide range of applications from translation variant ranking to lemma disambiguation.
Article outline
- 1.Introduction
- 2.Corpus annotations
- 2.1General corpus annotation
- 2.2Exploiting parallel corpora for annotation
- 2.3Language-specific corpus annotation
- 3.Aligning parallel corpora
- 4.Retrieval from parallel corpora
- 5.Conclusion
-
Acknowledgments
-
Note
-
References