Chapter 2
Mining parallel corpora from Wikipedia
In this article, we address the issue of Wikipedia as a multilingual resource to extract parallel corpora that are useful in multilingual terminology
extraction or machine translation. While most previous work in this field assumes that Wikipedia is suitable for
mining comparable corpora, we concentrate on the actual
place of translation in the editorial process of Wikipedia to examine the possibility of extracting parallel corpora,
that is, texts where source segments can be linked to their translations. After identifying the different projects,
tools and recommendations that allow contributors to enrich Wikipedia by exercising their skills as translators, we
conduct an experiment in which we download pairs of articles containing translations. We show the importance of
performing a temporal alignment of the versions to be downloaded before launching the actual sentence-level alignment. This strategy allows us to obtain a large volume
of parallel texts with good-quality sentence-to-sentence alignment.
Article outline
- 1.Introduction
- 2.Wikipedia as a comparable corpus for NLP and
contrastive studies
- 2.1Aligning documents according to domain or
content
- 2.2Bilingual lexicon extraction
- 2.3Aligning sentences and chunks using machine
translation
- 2.4Sentence alignment using a monotonic algorithm
- 2.5Parallel sentence extraction
- 3.The translation process in Wikipedia
- 3.1Translation projects
- 3.2Translation guidelines
- 3.3Review process
- 3.4Translation tools
- 3.5Content translation tool statistics
- 3.6Translation into languages other than English
- 4.Experiments
- 4.1Preliminary observations
- 4.2Downloading potentially alignable items
- 4.3First experiment: Sentence alignment of articles
- 4.4Second experiment: Filtering using dotplot
- 4.5Third experiment: Using Content Translation application markup
- 5.Conclusion and future perspectives
-
Notes
-
References
-
Appendix