Detecting loan words computationally

Zhang, Liqin; Manni, Franz; Fabri, Ray; Nerbonne, John

doi:10.1075/coll.59.11zha

Part of

Variation Rolls the Dice: A worldwide collage in honour of Salikoko S. Mufwene
Edited by Enoch O. Aboh and Cécile B. Vigouroux
[Contact Language Library 59] 2021
► pp. 269–288

Detecting loan words computationally

Liqin Zhang | Open University, Netherlands

Franz Manni | National Museum of Natural History, Paris

Ray Fabri | L-Università ta’ Malta

John Nerbonne | University of Groningen & University of Freiburg

A loanword is a word that is borrowed from one language and adopted into another; examples are the English words toboggan, skunk, and hickory, all of which were borrowed from Algonquian languages. Among languages that are not (closely) related, loan words are recognizable because they are semantically related and are more similar in pronunciation than one would expect by coincidence. This chapter applies techniques for measuring pronunciation similarity, focusing on edit-distance measures and a sound-class based method. The novel issue in loan-word detection is the circumstance that loan words are normally modified to fit the phonology of the borrowing language, meaning that sensitivity in measuring pronunciation similarity may be deprecated.

Keywords: loan words, automatic detection, edit distance, sound class alignment, language contact

Article outline

1.Introduction
2.Mufwene’s perspective
3.Previous work
4.Data
- Example data
5.Measuring pronunciation similarity
6.General setup
7.Evaluation and results
- Examples of detected loans
8.Discussion and prospects
- Conclusions and a speculation
- Future work
- Prospects
Supplementary data
Acknowledgements
Notes
References

Published online: 12 October 2021

https://doi.org/10.1075/coll.59.11zha

References (31)

References

Black, E., Lafferty, J. & Roukos, S. 1992. Development and evaluation of a broad-coverage probabilistic grammar of English-language computer manuals. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, 185–192. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive).

Collins, M. & Mufwene, S. S. 2005. What we mean when we say “Creole”: An interview with Salikoko S. Mufwene. Callaloo 28: 425–462.

Dellert, J. 2017. Information-Theoretic Causal Inference of Lexical Flow. PhD dissertation, Eberhard-Karls-Universität Tübingen.

2018. Combining information-weighted sequence alignment and sound correspondence models for improved cognate detection. In Proceedings of the 27th International Conference on Computational Linguistics, 3123–3133. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive).

Delz, M. 2013. A Theoretical Approach to Automatic Loanword Detection. Master’s thesis, Eberhard-Karls-Universität Tübingen.

Dolgopolsky, A. B. 1964. Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija (A probabilistic hypothesis concerning the oldest relationships among the language families of Northern Eurasia). Voprosy Jazykoznanija 2: 53–63.

Heeringa, W. 2004. Measuring Dialect Pronunciation Differences using Levenshtein Distance. PhD dissertation, University of Groningen.

IPA (International Phonetic Association). 1999. Handbook of the International Phonetic Association. Cambridge: CUP.

Jobling, M. A., Hurles, M. E. & Tyler-Smith, C. 2004. Human Evolutionary Genetics: Origins, Peoples and Diseases. New York NY: Garland.

Köllner, M. & Dellert, J. 2016. Ancestral State Reconstruction and Loanword Detection. Tübingen: Universitätsbibliothek.

Kondrak, G., Marcu, D. & Knight, K. 2003. Cognates can improve statistical translation models. In Proceedings of the 2003 North American Association for Computational Linguistics. Vol. 2, 46–48. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive).

Kondrak, G. & T. Sherif. 2006. Evaluation of several phonetic similarity algorithms on the task of cognate identification. In Proceedings of the Workshop on Linguistic Distances, J. Nerbonne & E. Hinrichs (eds), 43–50. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive).

List, J.-M. 2012. SCA: Phonetic alignment based on sound classes. In New Directions in Logic, Language and Computation [LNCS 7415], 32–51. Berlin: Springer.

List, J.-M., Walworth, M., Greenhill, S. J., Tresoldi, T. & Forkel, R. 2018. Sequence comparison in computational historical linguistics. Journal of Language Evolution 2: 130–144.

Manni, F. 2017. Linguistic Probes into Human History. PhD dissertation, University of Groningen.

Manni, F., Heeringa, W. & Nerbonne, J. 2006. To what extent are surnames words? Comparing geographic patterns of surname and dialect variation in the Netherlands. Literary and Linguistic Computing 21: 507–527.

Mayr, E. 2002. Interview with Ernst Mayr. BioEssays 24: 960–973.

Mennecier, P., Nerbonne, J., Heyer, E. & Manni, F. 2016. A Central Asian language survey: Collecting data, measuring relatedness and detecting loans. Language Dynamics and Change 6: 57–98.

Mufwene, S. S. 2001. The Ecology of Language Evolution. Cambridge: CUP.

2005. Language evolution: The population genetics way. In Gene, Sprachen, und ihre Evolution [Schriftenreihe der Universität Regensburg 29], G. Hauska (ed.), 30–52. Regensburg: Universität Regensburg.

2008. Language Evolution: Contact, Competition and Change. London: Bloomsbury.

Nerbonne, J. 2003. Linguistic variation and computation. In Proceedings of the Tenth Conference of the European Chapter of the Association for Computational Linguistics, 3–10. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive).

2009. Data-driven dialectology. Language and Linguistics Compass 3: 175–198.

Nerbonne, J. & Siedle, C. 2005. Dialektklassifikation auf der Grundlage aggregierter Ausspracheunterschiede. Zeitschrift für Dialektologie und Linguistik 72: 129–147.

Shackleton Jr., R. G. 2010. Quantitative Assessment of English-American Speech Relationships [Groningen Dissertations in Linguistics 81]. Groningen: Center for Language and Cognition Groningen.

Sperber, D. 1996. La Contagion des Idées. Paris: Odile Jacob.

Swadesh, M. 1952. Lexicostatistic dating of prehistoric ethnic contacts. Proceedings of the American Philosophical Society 96: 452–463.

Tukey, J. W. 1977. Exploratory Data Analysis. Reading: Addison-Wesley.

Wieling, M., Leinonen, T. & Nerbonne, J. 2007. Inducing sound segment differences using pair hidden Markov models. In Proceedings of 9th Meeting, Association for Computational Linguistics Special Interest Group in Computational Morphology and Phonology, 48–56. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive).

Wieling, M., Margaretha, E. & Nerbonne, J. 2012. Inducing a measure of phonetic similarity from pronunciation variation. Journal of Phonetics 40: 307–314.

Zhang, L. 2016. A More Sensitive Edit-Distance for Measuring Pronunciation Distances and Detecting Loanwords. Master’s thesis, University of Groningen & University of Malta.

Cited by (2)

Cited by two other publications

List, Johann-Mattis, Robert Forkel, Simon J. Greenhill, Christoph Rzymski, Johannes Englisch & Russell D. Gray

2022. Lexibank, a public repository of standardized wordlists with computed phonological and lexical features. Scientific Data 9:1

List, Johann-Mattis & Robert Forkel

2021. Automated identification of borrowings in multilingual wordlists. Open Research Europe 1 ► pp. 79 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.