Further Experiments in Bilingual Text Alignment

Somers, Harold

doi:10.1075/ijcl.3.1.06som

Article published In:

International Journal of Corpus Linguistics
Vol. 3:1 (1998) ► pp.115–150

Further Experiments in Bilingual Text Alignment

Harold Somers | UMIST

We describe and experimentally evaluate an alternative algorithm for aligning and extracting vocabulary from parallel texts using recency vectors and a similarity measure based on Levenshtein distance. The work is largely inspired by Fung and McKeown 's DK-vec, though we use a simpler algorithm. The technique is tested on two sets of parallel corpora involving English, French, German, Dutch, Spanish, and Japanese. We attempt to evaluate the importance of parameters such as frequency of words chosen as candidates, the effect of different language pairings, and differences between the two corpora.

Keywords: Text Alignment, Vocabulary Estimation, Word Alignment, Levenshtein Distance, Parallel Corpora

Published online: 1 January 1998

https://doi.org/10.1075/ijcl.3.1.06som

Cited by (5)

Cited by five other publications

Order by:

Venkataramani, Eknath & Deepa Gupta

2010. 2010 International Conference on Asian Language Processing, ► pp. 253 ff.

McTait, Kevin

2003. Translation Patterns, Linguistic Knowledge and Complexity in an Approach to EBMT. In Recent Advances in Example-Based Machine Translation [Text, Speech and Language Technology, 21], ► pp. 307 ff.

Way, Andy & Nano Gough

2003. wEBMT: Developing and Validating an Example-Based Machine Translation System Using the World Wide Web. Computational Linguistics 29:3 ► pp. 421 ff.

Somers, Harold

1999. Knowledge Extraction from Bilingual Corpora. In Information Extraction [Lecture Notes in Computer Science, 1714], ► pp. 120 ff.

Somers, Harold

2003. An Overview of EBMT. In Recent Advances in Example-Based Machine Translation [Text, Speech and Language Technology, 21], ► pp. 3 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.