Chapter 4
Semantic textual similarity based on deep learning
Can it improve matching and retrieval for Translation Memory tools?
This study proposes an original methodology to underpin the operation of new generation Translation Memory (TM) systems where the translations to be retrieved from the TM database are matched not on the basis of Levenshtein (edit) distance but by employing innovative Natural Language Processing (NLP) and Deep Learning (DL) techniques. Three DL sentence encoders were experimented with to retrieve TM matches in English-Spanish sentence pairs from the DGT TM dataset. Each sentence encoder was compared with Okapi which uses edit distance to retrieve the best match. The automatic evaluation shows the benefit of the DL technology for TM matching and holds promise for the implementation of the TM tool itself, which is our next project.
Article outline
- 1.Introduction
- 2.Methodology
- 2.1InferSent
- 2.2Universal sentence encoder
- 2.3Sentence BERT
- 3.Dataset and experiments
- 4.Evaluation and results
- 5.Analysis of typical errors
- 6.Conclusion
-
Acknowledgements
-
Notes
-
References
References
Arora, Sanjeev, Yingyu Liang, and Tengyu Ma
2019 “
A Simple but Tough-to-Beat Baseline for Sentence Embeddings”.
Proceedings of the 5th International Conference on Learning Representations (ICLR’2017).

Cer, D., Yang, Y., Kong, S. yi, Hua, N., Limtiaco, N., St. John, R., Constant, N., Guajardo-Céspedes, M., Yuan, S., Tar, C., Sung, Y. H., Strope, B., & Kurzweil, R.
2018 “
Universal sentence encoder for English”.
Proceedings of EMNLP 2018 – Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Proceedings, 169–174.


Chung, J., Gulcehre, C., Cho, K., & Bengio, Y.
2014 “
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”.
NIPS 2014 Workshop on Deep Learning, December 2014.
[URL]
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A.
2017 “
Supervised learning of universal sentence representations from natural language inference data”.
EMNLP 2017 – Conference on Empirical Methods in Natural Language Processing, Proceedings, 670–680.


Damerau, F. J.
1964 “
A technique for computer detection and correction of spelling errors”.
Communications of the ACM, 7(3), 171–176.


Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K.
2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
[URL]
Dice, Lee R.
1945 “
Measures of the Amount of Ecologic Association Between Species”.
Ecology. 26 (3): 297–302.


Ganitkevitch, Juri, Van Durme Benjamin, and Chris Callison-Burch
2013 “
PPDB: The paraphrase database”. In
Proceedings of NAACL-HLT, 758–764, Atlanta, Georgia.

Gow, Francie
2003 Metrics for Evaluating Translation Memory Software. PhD thesis. University of Ottawa.

Grönroos, Mickel, and Ari Becks
2005 “
Bringing Intelligence to Translation Memory Technology”.
Proceedings of the International Conference Translating and the Computer 27. London: ASLIB.

Gupta, R., Bechara, H., El Maarouf, I. and Orasan, C.
2014,
August.
UoW: NLP techniques developed at the University of Wolverhampton for Semantic Similarity and Textual Entailment. In
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) (pp. 785–789).


Rohit Gupta, Hanna Bechara, and Constantin Orăsan
2014b Intelligent Translation Memory Matching and Retrieval Metric Exploiting Linguistic Technology. In
Proceedings of the thirty sixth Conference on Translating and Computer, London, UK.

Gupta, R., Orǎsan, C., Zampieri, M., Vela, M., Mihaela Vela, van Genabith, J. and R. Mitkov
2016a “
Improving Translation Memory matching and retrieval using paraphrases”,
Machine Translation, 30(1), 19–40.


Gupta, R., Orǎsan, C., Liu, Q. and R. Mitkov
2016b “
A Dynamic Programming Approach to Improving Translation Memory Matching and Retrieval using Paraphrases”.
Lecture Notes in Computer Science book series (LNCS, volume 9924).
Proceedings of the 19th International Conference on Text, Speech and Dialogue (TSD), Brno, Czech Republic. Springer.


Hochreiter, S., & Schmidhuber, J.
1997 “
Long Short-Term Memory”.
Neural Computation, 9(8), 1735–1780.

Hodász, G. and Pohl, G.
2005,
September.
MetaMorpho TM: a linguistically enriched translation memory. In
International Workshop: Modern Approaches in Translation Technologies (pp. 26-30).

Lavie, A., & Agarwal, A.
2007 “
METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments”.
Proceedings of the Second Workshop on Statistical Machine Translation, June, 228–231.
[URL].

Levenshtein, V. I.
1966,
February.
Binary codes capable of correcting deletions, insertions, and reversals. In
Soviet physics doklady (Vol. 10, No. 8, pp. 707–710).

Macklovitch, E. and Russell, G.
2000,
October.
What’s been forgotten in translation memory. In
Conference of the Association for Machine Translation in the Americas (pp. 137–146). Springer, Berlin, Heidelberg.


Marelli, Marco, Bentivogli, Luisa, Baroni, Marco, Bernardi, Raffaella, Menini, Stefano and Zamparelli, Roberto
2014,
August.
SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment. In
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) (pp. 1–8). Dublin, Ireland: Association for Computational Linguistics.
[URL].

Mikolov, Tomas, Grave, Edouard, Bojanowski, Piotr, Puhrsch, Christian and Joulin, Armand
2018,
May.
Advances in Pre-Training Distributed Word Representations. In
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA).
[URL]
Mitkov, R.
2005 ‘
New Generation Translation Memory systems’. Panel discussion at the 27th international Aslib conference ‘Translating and the Computer’. London..
Mitkov, R.
“
Translation Memory”
2020 In
S. Deane-Cox and
A. Spiessens (Eds),
The Routledge Handbook of Translation and Memory. Basingstoke: Routledge.

Pagliardini, M., Gupta, P. and Jaggi, M.
2018,
June.
Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 528–540).


Pekar, V. and Mitkov, R.
2007 “
New Generation Translation Memory: Content-Sensitive Matching”.
Proceedings of the 40th Anniversary Congress of the Swiss Association of Translators, Terminologists and Interpreters. Bern: ASTTI 2007.

Pennington, J., Socher, R. and Manning, C. D.
2014,
October.
Glove: Global vectors for word representation. In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).


Planas, Emmanuel
2005 “
SIMILIS: Second-generation translation memory software”. proceedings of the 27th International Conference Translating and the Computer. London.
Planas, Emmanuel and Furuse, Osamu
2003 “
Formalizing Translation Memory”. In
Michael Carl and
Andy Way (Eds),
Recent Advances in Example-Based Machine Translation (pp. 157–188). Dordrecht: Springer Netherlands.


Ranasinghe, T., Orasan, C. and Mitkov, R.
2019,
September.
Enhancing Unsupervised Sentence Similarity Methods with Deep Contextualised Word Representations. In
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) (pp. 994–1003).

Ranasinghe, T., Orasan, C. and Mitkov, R.
2019,
September.
Semantic textual similarity with Siamese neural networks. In
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) (pp. 1004–1011).


Reimers, N. and Gurevych, I.
2019,
November.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3973–3983).


Sørensen, T.
1948 “
A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons”.
Kongelige Danske Videnskabernes Selskab. 5 (4): 1–34.

Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P.
2012 “
DGT-TM: A freely available translation memory in 22 languages”.
Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, 454–459.
[URL]
Timonera, K. and R. Mitkov
2015 “
Improving Translation Memory Matching through Clause Splitting”.
Proceedings of the RANLP’2015 workshop ‘Natural Language Processing for Translation Memories’. Hissar, Bulgaria.

Wali, W., Gargouri, B. and Hamadou, A. B.
2017 “
Sentence similarity computation based on WordNet and VerbNet”.
Computación y Sistemas, 21(4), 627–635.

Cited by
Cited by 1 other publications
Wang, Qiang, Hongfeng Wang & Mohammad Farukh Hashmi
2022.
Deep Learning Model-Based Machine Learning for Chinese and Japanese Translation.
Wireless Communications and Mobile Computing 2022
► pp. 1 ff.

This list is based on CrossRef data as of 2 january 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.