Chapter 2
Mining parallel
corpora from Wikipedia
In this article, we address the issue of Wikipedia as a
multilingual resource to extract parallel corpora that are
useful in multilingual terminology extraction or machine translation. While
most previous work in this field assumes that Wikipedia is suitable for
mining comparable
corpora, we concentrate on the actual place of translation in
the editorial process of Wikipedia to examine the possibility of extracting
parallel corpora, that is, texts where source segments can be linked to
their translations. After identifying the different projects, tools and
recommendations that allow contributors to enrich Wikipedia by exercising
their skills as translators, we conduct an experiment in which we download
pairs of articles containing translations. We show the importance of
performing a temporal alignment of the versions to be downloaded before
launching the actual sentence-level alignment. This strategy
allows us to obtain a large volume of parallel texts with good-quality
sentence-to-sentence alignment.
Article outline
- 1.Introduction
- 2.Wikipedia as a comparable corpus for NLP and contrastive
studies
- 2.1Aligning documents according to domain or content
- 2.2Bilingual lexicon extraction
- 2.3Aligning
sentences and chunks using machine translation
- 2.4Sentence alignment using a monotonic algorithm
- 2.5Parallel sentence extraction
- 3.The translation process in Wikipedia
- 3.1Translation projects
- 3.2Translation guidelines
- 3.3Review process
- 3.4Translation tools
- 3.5Content translation tool statistics
- 3.6Translation into languages other than English
- 4.Experiments
- 4.1Preliminary observations
- 4.2Downloading potentially alignable items
- 4.3First experiment: Sentence alignment of articles
- 4.4Second experiment: Filtering using dotplot
- 4.5Third experiment: Using Content Translation application markup
- 5.Conclusion and future perspectives
-
Notes
-
References
-
Appendix
This content is being prepared for publication; it may be subject to changes.