Detecting loan words computationally
Franz Manni | National Museum of Natural History, Paris
A loanword is a word that is borrowed from one language and adopted into another; examples are the English words toboggan, skunk, and hickory, all of which were borrowed from Algonquian languages. Among languages that are not (closely) related, loan words are recognizable because they are semantically related and are more similar in pronunciation than one would expect by coincidence. This chapter applies techniques for measuring pronunciation similarity, focusing on edit-distance measures and a sound-class based method. The novel issue in loan-word detection is the circumstance that loan words are normally modified to fit the phonology of the borrowing language, meaning that sensitivity in measuring pronunciation similarity may be deprecated.
Article outline
- 1.Introduction
- 2.Mufwene’s perspective
- 3.Previous work
- 4.Data
- 5.Measuring pronunciation similarity
- 6.General setup
- 7.Evaluation and results
- Examples of detected loans
- 8.Discussion and prospects
- Conclusions and a speculation
- Future work
- Prospects
- Supplementary data
-
Acknowledgements
-
Notes
-
References
References
Black, E., Lafferty, J. & Roukos, S.
1992 Development and evaluation of a broad-coverage probabilistic grammar of English-language computer manuals. In
Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, 185–192. Shroudsburg PA: Association for Computational Linguistics.
[URL] (permanent archive).
Collins, M. & Mufwene, S. S.
2005 What we mean when we say “Creole”: An interview with Salikoko S.
Mufwene. Callaloo
28: 425–462.
Dellert, J.
2017 Information-Theoretic Causal Inference of Lexical Flow. PhD dissertation, Eberhard-Karls-Universität Tübingen.
Dellert, J.
2018 Combining information-weighted sequence alignment and sound correspondence models for improved cognate detection. In
Proceedings of the 27th International Conference on Computational Linguistics, 3123–3133. Shroudsburg PA: Association for Computational Linguistics.
[URL] (permanent archive).
Delz, M.
2013 A Theoretical Approach to Automatic Loanword Detection. Master’s thesis, Eberhard-Karls-Universität Tübingen.
Dolgopolsky, A. B.
1964 Gipoteza drevnejšego rodstva jazykovych semej Severnoj Evrazii s verojatnostej točky zrenija (A probabilistic hypothesis concerning the oldest relationships among the language families of Northern Eurasia).
Voprosy Jazykoznanija
2: 53–63.
Heeringa, W.
2004 Measuring Dialect Pronunciation Differences using Levenshtein Distance. PhD dissertation, University of Groningen.
IPA (International Phonetic Association)
1999 Handbook of the International Phonetic Association. Cambridge: CUP.
Jobling, M. A., Hurles, M. E. & Tyler-Smith, C.
2004 Human Evolutionary Genetics: Origins, Peoples and Diseases. New York NY: Garland.
Köllner, M. & Dellert, J.
2016 Ancestral State Reconstruction and Loanword Detection. Tübingen: Universitätsbibliothek.
Kondrak, G., Marcu, D. & Knight, K.
2003 Cognates can improve statistical translation models. In
Proceedings of the 2003 North American Association for Computational Linguistics. Vol. 2, 46–48. Shroudsburg PA: Association for Computational Linguistics.
[URL] (permanent archive).
Kondrak, G. & T.
Sherif
2006 Evaluation of several phonetic similarity algorithms on the task of cognate identification. In
Proceedings of the Workshop on Linguistic Distances,
J.
Nerbonne &
E.
Hinrichs
(eds), 43–50. Shroudsburg PA: Association for Computational Linguistics.
[URL] (permanent archive).
List, J.-M.
2012 SCA: Phonetic alignment based on sound classes. In
New Directions in Logic, Language and Computation [LNCS 7415], 32–51. Berlin: Springer.
List, J.-M., Walworth, M., Greenhill, S. J., Tresoldi, T. & Forkel, R.
2018 Sequence comparison in computational historical linguistics.
Journal of Language Evolution
2: 130–144.
Manni, F.
2017.
Linguistic Probes into Human History. PhD dissertation, University of Groningen.
Manni, F., Heeringa, W. & Nerbonne, J.
2006 To what extent are surnames words? Comparing geographic patterns of surname and dialect variation in the Netherlands.
Literary and Linguistic Computing 21: 507–527.
Mayr, E.
2002 Interview with Ernst Mayr.
BioEssays
24: 960–973.
Mennecier, P., Nerbonne, J., Heyer, E. & Manni, F.
2016 A Central Asian language survey: Collecting data, measuring relatedness and detecting loans.
Language Dynamics and Change
6: 57–98.
Mufwene, S. S.
2001.
The Ecology of Language Evolution. Cambridge: CUP.
Mufwene, S. S.
2005 Language evolution: The population genetics way. In
Gene, Sprachen, und ihre Evolution [
Schriftenreihe der Universität Regensburg 29],
G.
Hauska
(ed.), 30–52. Regensburg: Universität Regensburg.
Mufwene, S. S.
2008.
Language Evolution: Contact, Competition and Change. London: Bloomsbury.
Nerbonne, J.
2003 Linguistic variation and computation. In
Proceedings of the Tenth Conference of the European Chapter of the Association for Computational Linguistics, 3–10. Shroudsburg PA: Association for Computational Linguistics.
[URL] (permanent archive).
Nerbonne, J.
2009 Data-driven dialectology.
Language and Linguistics Compass 3: 175–198.
Nerbonne, J. & Siedle, C.
2005 Dialektklassifikation auf der Grundlage aggregierter Ausspracheunterschiede.
Zeitschrift für Dialektologie und Linguistik
72: 129–147.
Shackleton Jr., R. G.
2010 Quantitative Assessment of English-American Speech Relationships [
Groningen Dissertations in Linguistics 81]. Groningen: Center for Language and Cognition Groningen.
Sperber, D.
1996 La Contagion des Idées. Paris: Odile Jacob.
Swadesh, M.
1952 Lexicostatistic dating of prehistoric ethnic contacts.
Proceedings of the American Philosophical Society
96: 452–463.
Tukey, J. W.
1977.
Exploratory Data Analysis. Reading: Addison-Wesley.
Wieling, M., Leinonen, T. & Nerbonne, J.
2007 Inducing sound segment differences using pair hidden Markov models. In
Proceedings of 9th Meeting, Association for Computational Linguistics Special Interest Group in Computational Morphology and Phonology, 48–56. Shroudsburg PA: Association for Computational Linguistics.
[URL] (permanent archive).
Wieling, M., Margaretha, E. & Nerbonne, J.
2012 Inducing a measure of phonetic similarity from pronunciation variation.
Journal of Phonetics 40: 307–314.
Zhang, L.
2016 A More Sensitive Edit-Distance for Measuring Pronunciation Distances and Detecting Loanwords. Master’s thesis, University of Groningen & University of Malta.
Cited by
Cited by 2 other publications
List, Johann-Mattis & Robert Forkel
2021.
Automated identification of borrowings in multilingual wordlists.
Open Research Europe 1
► pp. 79 ff.
List, Johann-Mattis, Robert Forkel, Simon J. Greenhill, Christoph Rzymski, Johannes Englisch & Russell D. Gray
2022.
Lexibank, a public repository of standardized wordlists with computed phonological and lexical features.
Scientific Data 9:1
This list is based on CrossRef data as of 20 march 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.