Can it improve matching and retrieval for Translation Memory tools?: Chapter 4. Semantic textual similarity based on deep learning

Ranasinghe, Tharindu; Mitkov, Ruslan; Orăsan, Constantin; Quintana, Rocío Caro

doi:10.1075/btl.158.04ran

Part of

Corpora in Translation and Contrastive Research in the Digital Age: Recent advances and explorations
Edited by Julia Lavid-López, Carmen Maíz-Arévalo and Juan Rafael Zamorano-Mansilla
[Benjamins Translation Library 158] 2021
► pp. 101–124

Chapter 4
Semantic textual similarity based on deep learning

Can it improve matching and retrieval for Translation Memory tools?

Tharindu Ranasinghe

Ruslan Mitkov

Constantin Orăsan

Rocío Caro Quintana

This study proposes an original methodology to underpin the operation of new generation Translation Memory (TM) systems where the translations to be retrieved from the TM database are matched not on the basis of Levenshtein (edit) distance but by employing innovative Natural Language Processing (NLP) and Deep Learning (DL) techniques. Three DL sentence encoders were experimented with to retrieve TM matches in English-Spanish sentence pairs from the DGT TM dataset. Each sentence encoder was compared with Okapi which uses edit distance to retrieve the best match. The automatic evaluation shows the benefit of the DL technology for TM matching and holds promise for the implementation of the TM tool itself, which is our next project.

Keywords: machine translation, translation memory, deep learning, Okapi, textual similarity, semantic similarity

Article outline

1.Introduction
2.Methodology
- 2.1InferSent
- 2.2Universal sentence encoder
- 2.3Sentence BERT
3.Dataset and experiments
4.Evaluation and results
5.Analysis of typical errors
6.Conclusion
Acknowledgements
Notes
References

Published online: 8 December 2021

https://doi.org/10.1075/btl.158.04ran

References (35)

References

Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. 2019. “A Simple but Tough-to-Beat Baseline for Sentence Embeddings”. Proceedings of the 5th International Conference on Learning Representations (ICLR’2017).

Cer, D., Yang, Y., Kong, S. yi, Hua, N., Limtiaco, N., St. John, R., Constant, N., Guajardo-Céspedes, M., Yuan, S., Tar, C., Sung, Y. H., Strope, B., & Kurzweil, R. 2018. “Universal sentence encoder for English”. Proceedings of EMNLP 2018 – Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Proceedings, 169–174.

Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. 2014. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”. NIPS 2014 Workshop on Deep Learning, December 2014. [URL]

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. 2017. “Supervised learning of universal sentence representations from natural language inference data”. EMNLP 2017 – Conference on Empirical Methods in Natural Language Processing, Proceedings, 670–680.

Damerau, F. J. 1964. “A technique for computer detection and correction of spelling errors”. Communications of the ACM, 7(3), 171–176.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [URL]

Dice, Lee R. 1945. “Measures of the Amount of Ecologic Association Between Species”. Ecology. 26 (3): 297–302.

Ganitkevitch, Juri, Van Durme Benjamin, and Chris Callison-Burch. 2013. “PPDB: The paraphrase database”. In Proceedings of NAACL-HLT, 758–764, Atlanta, Georgia.

Gow, Francie. 2003. Metrics for Evaluating Translation Memory Software. PhD thesis. University of Ottawa.

Grönroos, Mickel, and Ari Becks. 2005. “Bringing Intelligence to Translation Memory Technology”. Proceedings of the International Conference Translating and the Computer 27. London: ASLIB.

Gupta, R., Bechara, H., El Maarouf, I. and Orasan, C., 2014, August. UoW: NLP techniques developed at the University of Wolverhampton for Semantic Similarity and Textual Entailment. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) (pp. 785–789).

Rohit Gupta, Hanna Bechara, and Constantin Orăsan. 2014b. Intelligent Translation Memory Matching and Retrieval Metric Exploiting Linguistic Technology. In Proceedings of the thirty sixth Conference on Translating and Computer, London, UK.

Gupta, R., Orǎsan, C., Zampieri, M., Vela, M., Mihaela Vela, van Genabith, J. and R. Mitkov. 2016a. “Improving Translation Memory matching and retrieval using paraphrases”, Machine Translation, 30(1), 19–40.

Gupta, R., Orǎsan, C., Liu, Q. and R. Mitkov. 2016b. “A Dynamic Programming Approach to Improving Translation Memory Matching and Retrieval using Paraphrases”. Lecture Notes in Computer Science book series (LNCS, volume 9924). Proceedings of the 19th International Conference on Text, Speech and Dialogue (TSD), Brno, Czech Republic. Springer.

Hochreiter, S., & Schmidhuber, J. 1997. “Long Short-Term Memory”. Neural Computation, 9(8), 1735–1780.

Hodász, G. and Pohl, G., 2005, September. MetaMorpho TM: a linguistically enriched translation memory. In International Workshop: Modern Approaches in Translation Technologies (pp. 26-30).

Lavie, A., & Agarwal, A. 2007. “METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments”. Proceedings of the Second Workshop on Statistical Machine Translation, June, 228–231. [URL].

Levenshtein, V. I., 1966, February. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady (Vol. 10, No. 8, pp. 707–710).

Macklovitch, E. and Russell, G., 2000, October. What’s been forgotten in translation memory. In Conference of the Association for Machine Translation in the Americas (pp. 137–146). Springer, Berlin, Heidelberg.

Marelli, Marco, Bentivogli, Luisa, Baroni, Marco, Bernardi, Raffaella, Menini, Stefano and Zamparelli, Roberto, 2014, August. SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) (pp. 1–8). Dublin, Ireland: Association for Computational Linguistics. [URL].

Mikolov, Tomas, Grave, Edouard, Bojanowski, Piotr, Puhrsch, Christian and Joulin, Armand, 2018, May. Advances in Pre-Training Distributed Word Representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). [URL]

Mitkov, R. 2005. ‘New Generation Translation Memory systems’. Panel discussion at the 27th international Aslib conference ‘Translating and the Computer’. London..

“Translation Memory”. 2020. In S. Deane-Cox and A. Spiessens (Eds), The Routledge Handbook of Translation and Memory. Basingstoke: Routledge.

Pagliardini, M., Gupta, P. and Jaggi, M., 2018, June. Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 528–540).

Pekar, V. and Mitkov, R. 2007. “New Generation Translation Memory: Content-Sensitive Matching”. Proceedings of the 40th Anniversary Congress of the Swiss Association of Translators, Terminologists and Interpreters. Bern: ASTTI, 2007.

Pennington, J., Socher, R. and Manning, C. D., 2014, October. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

Planas, Emmanuel. 2005. “SIMILIS: Second-generation translation memory software”. proceedings of the 27th International Conference Translating and the Computer. London.

Planas, Emmanuel and Furuse, Osamu. 2003. “Formalizing Translation Memory”. In Michael Carl and Andy Way (Eds), Recent Advances in Example-Based Machine Translation (pp. 157–188). Dordrecht: Springer Netherlands.

Ranasinghe, T., Orasan, C. and Mitkov, R., 2019, September. Enhancing Unsupervised Sentence Similarity Methods with Deep Contextualised Word Representations. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) (pp. 994–1003).

, 2019, September. Semantic textual similarity with Siamese neural networks. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) (pp. 1004–1011).

Reimers, N. and Gurevych, I., 2019, November. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3973–3983).

Sørensen, T. 1948. “A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons”. Kongelige Danske Videnskabernes Selskab. 5 (4): 1–34.

Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. 2012. “DGT-TM: A freely available translation memory in 22 languages”. Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, 454–459. [URL]

Timonera, K. and R. Mitkov. 2015. “Improving Translation Memory Matching through Clause Splitting”. Proceedings of the RANLP’2015 workshop ‘Natural Language Processing for Translation Memories’. Hissar, Bulgaria.

Wali, W., Gargouri, B. and Hamadou, A. B. 2017. “Sentence similarity computation based on WordNet and VerbNet”. Computación y Sistemas, 21(4), 627–635.

Cited by (1)

Cited by one other publication

Wang, Qiang, Hongfeng Wang & Mohammad Farukh Hashmi

2022. Deep Learning Model-Based Machine Learning for Chinese and Japanese Translation. Wireless Communications and Mobile Computing 2022 ► pp. 1 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.

Chapter 4Semantic textual similarity based on deep learning

Can it improve matching and retrieval for Translation Memory tools?

Cited by one other publication

Chapter 4
Semantic textual similarity based on deep learning