Chapter published in:
Parallel Corpora for Contrastive and Translation Studies: New resources and applicationsEdited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 281–298
Normalization of shorthand forms in French text messages using word embedding and machine translation
Parijat Ghoshal | Neue Zürcher Zeitung, KOF Swiss Economic Institute
Xi Rao | Neue Zürcher Zeitung, KOF Swiss Economic Institute
This chapter focuses on the normalization of abbreviations and shorthand forms used in French text messages. These forms are difficult to normalize, as they mostly cannot be resolved by typical spell checkers and dictionary lookups. Firstly, we aligned normalized and non-normalized French text messages and built a parallel corpus. We applied two popular approaches for text normalization, namely multilingual word embeddings, and character-based machine translation. We compare our results and observe the efficacy of our models while normalizing deletions, substitutions, repetitions, swaps, and insertions, made to canonical forms. This is the first paper that uses Multivec and the Belgian SMS corpus collected under the SMS4Science Project. The unsupervised machine learning approach makes the system highly flexible, easily adaptable and provides a domain-independent method of text normalization.
Keywords: SMS, parallel corpus, French, abbreviation/shorthand form normalization, unsupervised learning, character-based machine translation, distributional semantics, Multivec, neural networks, deep learning, word embeddings
Article outline
- 1.Introduction
- 2.Previous work
- 3.Corpus and preprocessing
- 3.1Corpus
- 3.2Preprocessing
- 4.Methodologies, tools and experiments
- 4.1Methodologies
- 4.2Tools and experiments
- multivec
- moses
- 5.Results analysis
- 6.Conclusion
- 7.Future work
-
Acknowledgment -
Notes -
References
Published online: 20 March 2019
https://doi.org/10.1075/scl.90.17gho
https://doi.org/10.1075/scl.90.17gho
References
Beaufort, Richard, Roekhaut, Sophie, Cougnon, Louise-Amélie & Fairon, Cédrick
Bérard, Alexandre, Servan, Christophe, Pietquin, Olivier & Besacier, Laurent
Bird, Steven, Loper, Edward & Klein, Ewan
Bojanowski, Piotr, Grave, Edouard, Joulin, Armand & Mikolov, Tomas
2016 Enriching Word Vectors with Subword Information. https://arxiv.org/abs/1607.04606> (13 May 2017).
Choudhury, Monojit, Saraf, Rahul, Jain, Vijit, Sudeshna, Sarkar & Basu, Anupam
De Clercq Orphée, Schulz, Sarah, Desmet, Bart, Lefever, Els, Hoste, Véronique
Fairon, Cécrick, Klein, Jean R. & Paumier, Sébastien
Firth, John R.
Jurafsky, Daniel & Martin, James H.
Kobus, Catherine, Yvon, François & Damnati, Géraldine
Koch, Peter & Oesterreicher, Wulf
Koehn, Philipp, Hoang, Hieu, Birch, Alexandra, Callison-Burch, Chris, Federico, Marcello, Bertoldi, Brooke Cowan, Nicola, Shen, Wade, Moran, Christine, Zens, Richard, Dyer, Chris, Bojar, Ondřej, Constantin, Alexandra & Herbst, Evan
Li, Chen & Liu, Yang
Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey
2013a Efficient estimation of word representations in vector space. The Workshop Proceedings of the International Conference on Learning Representations. https://arxiv.org/abs/1301.3781> (13 May 2017).
Mikolov, Thomas, Ilya, Sutskever, Chen, Kai, Corrado, Greg & Dean, Jeffrey
2013b Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems.https://arxiv.org/pdf/1310.4546.pdf> (13 May 2017).
Och, Franz Josef & Ney, Hermann
Pennell, Deana L. & Liu, Yang
Rong, Xin
sms4science project
Sridhar, V. K. R.
Van Compernolle, Rémi A.