A probabilistic assessment of the Indo-Aryan Inner–Outer Hypothesis
This paper uses a novel data-driven probabilistic approach to address the century-old Inner-Outer hypothesis of
Indo-Aryan. I develop a Bayesian hierarchical mixed-membership model to assess the validity of this hypothesis using a large data
set of automatically extracted sound changes operating between Old Indo-Aryan and Modern Indo-Aryan speech varieties. I employ
different prior distributions in order to model sound change, one of which, the Logistic Normal distribution, has not received
much attention in linguistics outside of Natural Language Processing, despite its many attractive features. I find evidence for
cohesive dialect groups that have made their imprint on contemporary Indo-Aryan languages, and find that when a Logistic Normal
prior is used, the distribution of dialect components across languages is largely compatible with a core-periphery pattern similar
to that proposed under the Inner-Outer hypothesis.
Article outline
- 1.Introduction
- 2.Background
- 2.1Indo-Aryan dialectal variation
- 2.1.1Pre-Old Indo Aryan period
- 2.1.2Old Indo Aryan period
- 2.1.3Middle Indo Aryan period
- 2.1.4New Indo Aryan period
- 2.2Proposed Indo-Aryan dialectal groupings
- 3.Rationale
- 3.1Bayesian models in linguistics and related fields
- 3.2Operationalizing the Inner-Outer Hypothesis
- 4.Data
- 5.Modeling sound change
- 5.1Prior distributions over sound change probabilities
- 6.Generative model
- 7.Implementation and inference
- 8.Results
- 8.1Sparsity of language-group distributions
- 8.2Language-group distributions
- 8.3Sound change distributions
- 8.4Posterior predictive checks
- 8.4.1Entropy
- 8.4.2Accuracy
- 9.Discussion and outlook
- 10.Conclusion
- Acknowledgements
- Notes
- Appendix (supplementary material)
- Appendix (supplementary material)
- Dirichlet model sound change probabilities
- Logistic normal model sound change probabilities
- Accuracy scores for sound change distributions for simulated data
-
References
References
Aitchison, John
1986 The Statistical Analysis of Compositional Data. London & New York: Chapman & Hall.
Berger, Hermann
1955 Zwei Probleme der mittelindischen Lautlehre. Munich: J. Kitzinger.
Blei, David M., Alp Kucukelbir & Jon D. McAuliffe
2017 Variational Inference: A Review for Statisticians.
Journal of the American Statistical Association 112:518.859–877.
Blei, David M. & John D. Lafferty
2007 A Correlated Topic Model of Science.
The Annals of Applied Statistics 1:1.17–35.
Blei, David M., Andrew Y. Ng & Michael I. Jordan
2003 Latent Dirichlet Allocation.
Journal of Machine Learning Research 31.993–1022.
Bloomfield, Leonard
1933 Language. New York: Holt, Rinehart & Winston.
Bouchard-Côté, Alexandre, Thomas L. Griffiths & Dan Klein
2009 Improved Reconstruc-tion of Protolanguage Word Forms.
Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, 65–73. Boulder, CO: Association for Computational Linguistics.
Bouchard-Côté, Alexandre, David Hall, Thomas L. Griffiths & Dan Klein
2013 Auto-mated Reconstruction of Ancient Languages using Probabilistic Models of Sound Change.
Proceedings of the National Academy of Sciences 1101.4224–4229.
Bouchard-Côté, Alexandre, Percy S. Liang, Thomas L. Griffiths & Dan Klein
2007 A Probabilistic Approach to Diachronic Phonology.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 887–896. Prague: Association for Computational Linguistics.
Bouchard-Côté, Alexandre, Percy S. Liang, Dan Klein & Thomas L. Griffiths
2008 A Probabilistic Approach to Language Change.
Advances in Neural Information Processing Systems, 169–176.
Box, George E. P.
1980 Sampling and Bayes’ Inference in Scientific Modelling and Robustness.
Journal of the Royal Statistical Society. Series A (General) 1431.383–430.
Burrow, Thomas
1975 A New Look at Brugmann’s Law.
Bulletin of the School of Oriental and African Studies 38:1.55–80.
Cardona, George & Dhanesh Jain
2007 General Introduction.
The Indo-Aryan Languages ed. by
George Cardona &
Dhanesh Jain, 2–45. London: Routledge.
Carpenter, Bob, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li & Allen Riddell
2017 Stan: A Probabilistic Programming Language.
Journal of Statistical Software 761.
Chang, Will & Lev Michael
2014 A Relaxed Admixture Model of Language Contact.
Language Dynamics and Change 4:1.1–26.
Chatterji, Suniti Kumar
1926 The Origin and Development of the Bengali Language. Calcutta: Calcutta University Press.
Cohen, Shay B., Kevin Gimpel & Noah A. Smith
2009 Logistic Normal Priors for Unsu-pervised Probabilistic Grammar Induction. In
Advances in Neural Information Processing Systems, 321–328.
Cohen, Shay B. & Noah A. Smith
2009 Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction. In
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 74–82. Boulder, CO: Association for Computational Linguistics.
Deo, Ashwini
2018 Dialects in the Indo-Aryan landscape.
The Handbook of Dialectology ed. by
Charles Boberg,
John Nerbonne &
Dominic Watt, 535–546. Oxford: John Wiley & Sons.
Elizarenkova, T. Y.
1989 About Traces of a Prakrit Dialectal Basis in the Language of the Rgveda.
Dialectes dans les littératures indo-aryennes ed. by
Colette Caillat, 1–18. Paris: Collège de France.
Emeneau, Murray B.
1966 The Dialects of Old-Indo-Aryan.
Ancient Indo-European dialects ed. by
Jaan Puhvel, 123–138. Berkeley: University of California Press.
Frisk, Hjalmar
1991 Griechisches etymologisches Wörterbuch. Band II: Kρ–Ω. Heidelberg: Carl Winter.
Fritz, Sonja
2002 The Dhivehi Language: a Descriptive and Historical Grammar of Maldivian and its Dialects. 21 vols. Heidelberg: Ergon.
Gelman, Andrew, Xiao-Li Meng & Hal Stern
1996 Posterior Predictive Assessment of Model Fitness via Realized Discrepancies.
Statistica Sinica 61.733–760.
Gelman, Andrew & Donald B. Rubin
1992 Inference from Iterative Simulation Using Multiple Sequences.
Statistical Science 7:4.457–472.
Geman, Stuart & Donald Geman
1984 Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.
IEEE Transactions on Pattern Analysis and Machine Intelligence 61.721–741.
Grierson, George A.
1967 [1903–28] Linguistic Survey of India. Delhi: Motilal Banarsidass.
Hammarström, Harald, Robert Forkel & Martin Haspelmath
2017 Glottolog 3.3. Max Planck Institute for the Science of Human History.
[URL]
von Hinüber, Oskar
2001 Das ältere Mittelindisch im Überblick. Vienna: Verlag der Österreichischen Akademie der Wissenschaften.
Hock, Hans Henrich
2016 The Languages, their Histories, and their Genetic Classification.
The Languages and Linguistics of South Asia: A Comprehensive Guide ed. by
Hans Henrich Hock &
Elena Bashir, 9–240. Berlin & Boston: De Gruyter.
Hoernle, A. F. Rudolf
1880 A Comparative Grammar of the Gaudian Languages. London: Trübner.
Jäger, Gerhard
2013 Phylogenetic Inference from Word Lists using Weighted Alignment with Empirically Determined Weights.
Language Dynamics and Change 31.245–291.
Jamison, Stephanie W.
1988 The Quantity of the Outcome of Vocalized Laryngeals in Indic.
Die Laryngaltheorie und die Rekonstruktion des indogermanischen Laut- und Formensystems ed. by
Alfred Bammesberger, 213–226. Heidelberg: Carl Winter.
Jeffers, Robert J.
1976 The Position of the Bihārī Dialects in Indo-Aryan.
Indo-Iranian Journal 18:3–4.215–225.
Joshi, S. D.
1989 Patañjali’s Views on Apaśabdas
.
Dialectes dans les littératures indo-aryennes ed. by
Colette Caillat, 267–294. Paris: Collège de France.
Kakati, Banikanta
1941 Assamese, its Formation and Development. Gauhati: Government of Assam.
Kingma, Diederik P. & Jimmy Ba
2015 Adam: A Method for Stochastic Optimization.
International Conference on Learning Representations (ICLR).
Kingma, Diederik P. & Adam Welling
2013 Auto-Encoding Variational Bayes.
International Conference on Learning Representations (ICLR).
Kogan, Anton I.
2005 Dardskie jazyki. Genetičeskaja xarakteristika. Moscow: Vostočnaja Literatura.
Koskenniemi, Kimmo
2017 Aligning Phonemes using Finite-State Methods.
Proceedings of the 21st Nordic Conference of Computational Linguistics, 56–64. Gothenburg: Linköping University Electronic Press.
Kucukelbir, Alp, Dustin Tran, Rajesh Ranganath, Andrew Gelman & David M. Blei
2017 Automatic Differentiation Variational Inference.
The Journal of Machine Learning Research 18:1.430–474.
Kuiper, Franciscus Bernardus Jacobus
1991 Aryans in the Rigveda. Amsterdam & Atlanta: Rodopi.
Kümmel, Martin
2015 Developments in the Dissolution of the Indo-Iranian Accentual System. Paper presented at the Workshop on Diachronic Morphophonology: Lexical Accent Systems at the 22nd International Conference on Historical Linguistics. Naples, July 27–31.
Lipp, Reiner
2009 Die indogermanischen und einzelsprachlichen Palatale im Indoiranischen. 21 vols. Heidelberg: Carl Winter.
List, Johann-Mattis
2012 SCA. Phonetic Alignment based on Sound Classes.
New Directions in Logic, Language, and Computation ed. by
M. Slavkovik &
D. Lassiter, 32–51. Berlin & Heidelberg: Springer.
MacKenzie, David Neil
1961 The Origins of Kurdish.
Transactions of the Philological Society 68–86.
Marr, David
1982 Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. San Francisco: W. H. Freeman.
Masica, Colin P.
1991 The Indo-Aryan languages. Cambridge: Cambridge University Press.
Mayrhofer, Manfred
1989–2001 Etymologisches Wörterbuch des Altindoarischen. Heidelberg: Carl Winter.
Meylan, Stephan, Michael Frank & Roger Levy
2013 Modeling the Development of Deter-miner Productivity in Children’s Early Speech.
Proceedings of the Annual Meeting of the Cognitive Science Society 351.3032–3037.
Meylan, Stephan C., Michael C. Frank, Brandon C. Roy & Roger Levy
2017 The Emergence of an Abstract Grammatical Category in Children’s Early Speech.
Psychological Science 28:2.181–192.
Mimno, David, David M. Blei & Barbara E. Engelhardt
2015 Posterior Predictive Checks to Quantify Lack-of-Fit in Admixture Models of Latent Population Structure.
Proceedings of the National Academy of Sciences 112:26.E3441–E3450.
Mimno, David, Hanna Wallach & Andrew McCallum
2008 Gibbs Sampling for Logistic Normal Topic Models with Graph-Based Priors.
NIPS Workshop on Analyzing Graphs, 1–8.
Needleman, Saul B. & Christian D. Wunsch
1970 A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.
Journal of Molecular Biology 481.443–53.
Norton, Richard A., J. Andrés Christen & Colin Fox
2017 Sampling Hyperparameters in Hierarchical Models: Improving on Gibbs for High-Dimensional Latent Fields and Large Datasets.
Communications in Statistics-Simulation and Computation 471.2639–2655.
Oberlies, Thomas
2001 Pali: A Grammar of the Language of the Theravada Tipitaka. With a Concordance to Pischel’s Grammatik der Prakrit-Sprachen. Berlin: de Gruyter.
Oberlies, Thomas
2005 A Historical Grammar of Hindi. Graz: Leykam.
Parkes, Peter
1987 Livestock Symbolism and Pastoral Ideology among the Kafirs of the Hindu Kush.
Man 221.637–660.
Parpola, Asko
2002 Pre-Proto-Iranians of Afghanistan as Initiators of Śākta Tantrism: on the Scythian/Saka Affiliation of the Dāsas, Nuristanis and Magadhans.
Iranica Antiqua 371.233–324.
Peterson, John
2017 Fitting the Pieces Together: Towards a Linguistic Prehistory of Eastern-Central South Asia (and beyond).
Journal of South Asian Languages and Linguistics 41.211–257.
Pischel, Richard
1900 Grammatik der Prakrit-Sprachen. Strassburg: Karl J. Trübner.
Pritchard, Jonathan K., Matthew Stephens & Peter Donnelly
2000 Inference of Population Structure using Multilocus Genotype Data.
Genetics 155:2.945–959.
Ranganath, Rajesh, Linpeng Tang, Laurent Charlin & David Blei
2015 Deep Exponential Families.
Proceedings of the 18th International Conference on Artificial intelligence and statistics (AISTATS), 762–771. San Diego, CA.
Rasmussen, C. E. & C. K. I. Williams
2006 Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press.
Reesink, Ger, Ruth Singer & Michael Dunn
2009 Explaining the Linguistic Diversity of Sahul using Population Models.
PLoS Biology 7.e1000241.
Rix, Helmut, Martin Kimmel, Thomas Zehnder, Reiner Lipp & Brigitte Schirmer
eds. 2001 Lexikon der indogermanischen Verben: Die Wurzeln und ihre Primärstammbildungen. 2nd ed. Wiesbaden: Ludwig Reichert.
Salvatier, John, Thomas V. Wiecki & Christopher Fonnesbeck
2016 Probabilistic Program-ming in Python using PyMC3.
Peer J Computer Science 2.e55.
Shaked, Shaul
1969 Notes on the New Aśoka Inscription from Kandahar.
Journal of the Royal Asiatic Society 101:2.118–122.
Slaje, Walter
2014 Kingship in Kaśmīr (AD 1148–1459). Halle an der Saale: Universitätsverlag Halle-Wittenberg.
Smith, Caley
2017 The Dialectology of Indic.
Handbook of Comparative and Historical Indo-European Linguistics ed. by
Jared Klein,
Brian Joseph &
Matthias Fritz, 417–447. Berlin & Boston: De Gruyter.
Southworth, Franklin C.
2005 Linguistic Archaeology of South Asia. London: Routledge.
Srivastava, Akash & Charles Sutton
2017 Autoencoding Variational Inference for Topic Models. In
International Conference on Learning Representations (ICLR).
Syrjänen, Kaj, Terhi Honkola, Jyri Lehtinen, Antti Leino & Outi Vesakoski
2016 Ap-plying Population Genetic Approaches within Languages: Finnish Dialects as Linguistic Populations.
Language Dynamics and Change 61.235–283.
Tedesco, P.
1960 Notes to Mayrhofer’s Etymological Sanskrit Dictionary.
Journal of the American Oriental Society 80:4.360–366.
Tedesco, Paul
1945 Persian čīz and Sanskrit kím
.
Language 211.128–141.
Tedesco, Paul
1965 Turner’s Comparative Dictionary of the Indo-Aryan Languages.
Journal of the American Oriental Society 851.368–383.
Teh, Yee Whye, Michael I. Jordan, Matthew J. Beal & David M. Blei
2005 Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes. In
Advances in Neural Information Processing Systems, 1385–1392.
Thiel-Horstmann, Monika
1978 On RJ Jeffers: ‘The Position of the Bihārī Dialects in Indo-Aryan’ – A Phonological Reconsideration.
Indo-Iranian Journal 20:1–2.61–82.
Tran, Dustin, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy & David M. Blei
2017 Deep Probabilistic Programming.
arXiv preprint arXiv:1701.03757.
Turner, Ralph L.
1962–1966 A Comparative Dictionary of Indo-Aryan Languages. London: Oxford University Press.
Turner, Ralph L.
1916 The Indo-Germanic Accent in Marathi.
The Journal of the Royal Asiatic Society of Great Britain and Ireland 203–251.
Wieling, Martijn, Eliza Margaretha & John Nerbonne
2012 Inducing a Measure of Phonetic Similarity from Pronunciation Variation.
Journal of Phonetics 40:2.307–314.
Williamson, Sinead, Chong Wang, Katherine A. Heller & David M. Blei
2010 The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling.
Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel.
Witzel, Michael
1989 Tracing the Vedic Dialects.
Dialectes dans les littératures indo-aryennes ed. by
Colette Caillat, 97–266. Paris: Collège de France.
Yanovich, Igor
2016 Old English *motan, Variable-Force Modality, and the Presupposition of Inevitable Actualization.
Language 92:3.489–521.
Zograf, G. A.
1976 Morfologičeskij stroj novyx indoarijskix jazykov. Moscow: Nauka.
Zoller, Claus Peter
1988 Bericht über besondere Archaismen im Bangani, einer Western Pahari-Sprache.
Münchener Studien zur Sprachwissenschaft 491.173–200.
Zoller, Claus Peter
1989 Bericht über grammatische Archaismen im Bangani.
Münchener Studien zur Sprachwissenschaft 501.159–218.
Zoller, Claus Peter
1993 A Note on Baṅgāṇi.
Journal of the Linguistic Society of India 541.112–114.
Zoller, Claus Peter
2012 Garhwali and the History of Indo-Aryan: Some Observations. Paper presented at Hindi Diwas (Day of Hindi). Uppsala, 14 September.
Zoller, Claus-Peter
2016 Outer and Inner Indo-Aryan, and Northern India as an Ancient Linguistic Area.
Acta Orientalia 771.71–132.
Cited by
Cited by 2 other publications
Borin, Lars, Anju Saxena, Bernard Comrie & Shafqat Mumtaz Virk
2020.
A bird’s-eye view on South Asian languages through LSI.
Journal of South Asian Languages and Linguistics 7:2
► pp. 203 ff.
Ranacher, Peter, Nico Neureiter, Rik van Gijn, Barbara Sonnenhauser, Anastasia Escher, Robert Weibel, Pieter Muysken & Balthasar Bickel
2021.
Contact-tracing in cultural evolution: a Bayesian mixture model to detect geographic areas of language contact.
Journal of The Royal Society Interface 18:181
► pp. 20201031 ff.
This list is based on CrossRef data as of 1 april 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.