Publication details [#13342]

Finch, Andrew, Yasuhiro Akiba and Eiichiro Sumita. 2004. How does automatic Machine Translation evaluation correlate with human scoring as the number of reference translations increases? In Lino, Maria Teresa, Maria Francisca Xavier, Fátima Ferreira, Rute Costa and Raquel Silva, eds. LREC-2004: fourth international conference on language resources and evaluation. pp. 2019–2022. URL
Publication type
Article in jnl/bk
Publication language


Automatic machine translation evaluation is a very difficult task due to the wide diversity of valid output translations that may result from translating a single source sentence or textual segment. Recently a number of competing methods of automatic Machine Translation evaluation have been adopted by the research community, of these the some of the most utilized are BLEU, NIST, mWER and the F-measure. This work extends the work of others in the field looking at how closely these evaluation techniques match human performance at ranking the translation output. However, the authors focus on investigating how these systems scale up with increasing numbers of human-produced references. They measure the correlation of the automatic ranking of the output from nine different machine translation systems, with the ranking derived from the score assigned by nine human evaluators using up to sixteen references per sentence. Our results show that evaluation performance improves with increasing numbers of references for all of the scoring methods except NIST which only shows improvements with small numbers of references.
Source : Based on abstract in book