Detecting loan words computationally
Franz Manni | National Museum of Natural History, Paris
A loanword is a word that is borrowed from one language and adopted into another; examples are the English words toboggan, skunk, and hickory, all of which were borrowed from Algonquian languages. Among languages that are not (closely) related, loan words are recognizable because they are semantically related and are more similar in pronunciation than one would expect by coincidence. This chapter applies techniques for measuring pronunciation similarity, focusing on edit-distance measures and a sound-class based method. The novel issue in loan-word detection is the circumstance that loan words are normally modified to fit the phonology of the borrowing language, meaning that sensitivity in measuring pronunciation similarity may be deprecated.
Article outline
- 1.Introduction
- 2.Mufwene’s perspective
- 3.Previous work
- 4.Data
- 5.Measuring pronunciation similarity
- 6.General setup
- 7.Evaluation and results
- Examples of detected loans
- 8.Discussion and prospects
- Conclusions and a speculation
- Future work
- Prospects
- Supplementary data
-
Acknowledgements
-
Notes
-
References
References (31)
References
Black, E., Lafferty, J. & Roukos, S.
1992. Development and evaluation of a broad-coverage probabilistic grammar of English-language computer manuals. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, 185–192. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive). ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Collins, M. & Mufwene, S. S.
2005. What we mean when we say “Creole”: An interview with Salikoko S. Mufwene. Callaloo
28: 425–462. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Dellert, J.
2017. Information-Theoretic Causal Inference of Lexical Flow. PhD dissertation, Eberhard-Karls-Universität Tübingen.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Dellert, J.
2018. Combining information-weighted sequence alignment and sound correspondence models for improved cognate detection. In
Proceedings of the 27th International Conference on Computational Linguistics, 3123–3133. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive).![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Delz, M.
2013. A Theoretical Approach to Automatic Loanword Detection. Master’s thesis, Eberhard-Karls-Universität Tübingen.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Dolgopolsky, A. B.
1964. Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija (A probabilistic hypothesis concerning the oldest relationships among the language families of Northern Eurasia). Voprosy Jazykoznanija
2: 53–63.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Heeringa, W.
2004. Measuring Dialect Pronunciation Differences using Levenshtein Distance. PhD dissertation, University of Groningen.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
IPA (International Phonetic Association). 1999. Handbook of the International Phonetic Association. Cambridge: CUP.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Jobling, M. A., Hurles, M. E. & Tyler-Smith, C.
2004. Human Evolutionary Genetics: Origins, Peoples and Diseases. New York NY: Garland.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Köllner, M. & Dellert, J.
2016. Ancestral State Reconstruction and Loanword Detection. Tübingen: Universitätsbibliothek.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Kondrak, G., Marcu, D. & Knight, K.
2003. Cognates can improve statistical translation models. In
Proceedings of the 2003 North American Association for Computational Linguistics. Vol. 2, 46–48. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive). ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Kondrak, G. & T.
Sherif. 2006. Evaluation of several phonetic similarity algorithms on the task of cognate identification. In Proceedings of the Workshop on Linguistic Distances, J.
Nerbonne & E.
Hinrichs
(eds), 43–50. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive). ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
List, J.-M.
2012. SCA: Phonetic alignment based on sound classes. In
New Directions in Logic, Language and Computation [LNCS 7415], 32–51. Berlin: Springer. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
List, J.-M., Walworth, M., Greenhill, S. J., Tresoldi, T. & Forkel, R.
2018. Sequence comparison in computational historical linguistics. Journal of Language Evolution
2: 130–144. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Manni, F.
2017. Linguistic Probes into Human History. PhD dissertation, University of Groningen.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Manni, F., Heeringa, W. & Nerbonne, J.
2006. To what extent are surnames words? Comparing geographic patterns of surname and dialect variation in the Netherlands. Literary and Linguistic Computing 21: 507–527. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mayr, E.
2002. Interview with Ernst Mayr. BioEssays
24: 960–973. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mennecier, P., Nerbonne, J., Heyer, E. & Manni, F.
2016. A Central Asian language survey: Collecting data, measuring relatedness and detecting loans. Language Dynamics and Change
6: 57–98. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mufwene, S. S.
2001. The Ecology of Language Evolution. Cambridge: CUP. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mufwene, S. S.
2005. Language evolution: The population genetics way. In
Gene, Sprachen, und ihre Evolution [Schriftenreihe der Universität Regensburg 29], G.
Hauska
(ed.), 30–52. Regensburg: Universität Regensburg.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mufwene, S. S.
2008. Language Evolution: Contact, Competition and Change. London: Bloomsbury. ![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Nerbonne, J.
2003. Linguistic variation and computation. In Proceedings of the Tenth Conference of the European Chapter of the Association for Computational Linguistics, 3–10. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive). ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Nerbonne, J.
2009. Data-driven dialectology. Language and Linguistics Compass 3: 175–198. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Nerbonne, J. & Siedle, C.
2005. Dialektklassifikation auf der Grundlage aggregierter Ausspracheunterschiede. Zeitschrift für Dialektologie und Linguistik
72: 129–147.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Shackleton Jr., R. G.
2010. Quantitative Assessment of English-American Speech Relationships [Groningen Dissertations in Linguistics 81]. Groningen: Center for Language and Cognition Groningen.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sperber, D.
1996. La Contagion des Idées. Paris: Odile Jacob.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Swadesh, M.
1952. Lexicostatistic dating of prehistoric ethnic contacts. Proceedings of the American Philosophical Society
96: 452–463.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Tukey, J. W.
1977. Exploratory Data Analysis. Reading: Addison-Wesley. ![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Wieling, M., Leinonen, T. & Nerbonne, J.
2007. Inducing sound segment differences using pair hidden Markov models. In Proceedings of 9th Meeting, Association for Computational Linguistics Special Interest Group in Computational Morphology and Phonology, 48–56. Shroudsburg PA: Association for Computational Linguistics. <[URL]> (permanent archive). ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Wieling, M., Margaretha, E. & Nerbonne, J.
2012. Inducing a measure of phonetic similarity from pronunciation variation. Journal of Phonetics 40: 307–314. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Zhang, L.
2016. A More Sensitive Edit-Distance for Measuring Pronunciation Distances and Detecting Loanwords. Master’s thesis, University of Groningen & University of Malta.
Cited by (2)
Cited by two other publications
List, Johann-Mattis, Robert Forkel, Simon J. Greenhill, Christoph Rzymski, Johannes Englisch & Russell D. Gray
2022.
Lexibank, a public repository of standardized wordlists with computed phonological and lexical features.
Scientific Data 9:1
![DOI logo](//benjamins.com/logos/doi-logo.svg)
List, Johann-Mattis & Robert Forkel
2021.
Automated identification of borrowings in multilingual wordlists.
Open Research Europe 1
► pp. 79 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.