Strategies in tracing linguistic variation in a corpus of Old Irish texts (CorPH)

Stifter, David; Qiu, Fangzhe; Aquino-López, Marco A.; Bauer, Bernhard; Lash, Elliott; White, Nora

doi:10.1075/ijcl.22018.sti

Article published In:

Corpus studies of language through time: Special issue of the International Journal of Corpus Linguistics 27:4 (2022)
Edited by Tony McEnery, Gavin Brookes and Isobelle Clarke
[International Journal of Corpus Linguistics 27:4] 2022
► pp. 529–553

Strategies in tracing linguistic variation in a corpus of Old Irish texts (CorPH)

David Stifter | Maynooth University

Fangzhe Qiu | University College Dublin

Marco A. Aquino-López | Centro de Investigación en Matemáticas

Bernhard Bauer | Karl-Franzens-Universität Graz

Elliott Lash | Georg-August-Universität Göttingen

Nora White | Maynooth University

This article introduces Corpus PalaeoHibernicum (CorPH), a corpus currently consisting of 78 texts in Early Irish (c. 7th–10th cent.) created by the ERC-funded Chronologicon Hibernicum (ChronHib) project by bringing together pre-existing lexical and syntactic databases and adding further crucial texts from the period. In addition to being annotated for POS, morphological and syntactic information, another layer of annotation has been developed for CorPH – ‘Variation Tagging’, i.e. a tagset that numerically encodes synchronic language variation during the Early Irish period, thus allowing for much improved research on the chronological variation among the material. Another new pillar of studying linguistic variation is Bayesian Language Variation Analysis (BLaVA), in order to address the challenge that “not-so-big data” poses to statistical corpus methods. Instead of reflecting feature frequencies, BLaVA models language variation as probabilities of variation.

Keywords: Old Irish, diachronic variation, Chronologicon Hibernicum, Bayesian statistics, variation tagging

Article outline

1.Introduction
2.Characteristics of Old Irish
3.The corpus
4. Corphusator
5.Variation tagging
6.Bayesian language variation analysis
7.Advantages and benefits of the methods
8.Challenges and desiderata
Acknowledgements
References

Available under the Creative Commons Attribution (CC BY) 4.0 license.

For any use beyond this license, please contact the publisher at [email protected].

Published online: 20 September 2022

https://doi.org/10.1075/ijcl.22018.sti

References (45)

Atkinson, R.

(1887) The Passions and the Homilies from Leabhar Breac. Royal Irish Academy.

Barrett, S.

(2017) A Study of the Lexicon of the Poems of Blathmac Son of Cú Brettan. [Doctoral dissertation, Maynooth University]. MURAL – Maynooth University Research Archive Library. [URL]

Bauer, B.

(2015) The online database of the Old Irish Priscian Glosses. [URL]

in preparation). Corpus Palaeohibernicum (CorPH): From an Early Irish lexical database to a text-based corpus using Python.

Bauer, B., Hofman, R., & Moran, P.

(2017) St Gall Priscian Glosses (Version 2.0). [URL]

Bronner, D.

(2013) Verzeichnis altirischer Quellen [Directory of Old Irish Sources]. Philipps Universität Marburg.

Claris International Inc.

(2006–15) FileMaker Pro 8–14. [Computer Software]. [URL]

Crystal, D.

(2008) A Dictionary of Linguistics and Phonetics. (6th ed.). Blackwell.

Dublin Institute for Advanced Studies

(2004–) Irish Script on Screen. [URL]

Evert, S.

(2008) Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp. 1212–1248). Mouton de Gruyter.

Färber, B.

(2012–) CELT: Corpus of Electronic Texts. [URL]

Farr, F., & O’Keeffe, A.

(2002) Would as a hedging device in an Irish context: An intra-varietal comparison of institutionalised spoken interaction. In S. M. Fitzmaurice, D. Biber, & R. Reppen (Eds.), Using Corpora to Explore Linguistic Variation (pp. 25–48). John Benjamins.

Gries, S. Th., & Hilpert, M.

(2010) Modeling diachronic change in the third person singular: A multifactorial, verb- and author-specific exploratory approach. English Language and Linguistics, 14 (3), 293–320.

Griffith, A., & Stifter, D.

(2013) Dictionary and Database of the Old Irish Glosses in the Milan MS Ambr. C301 inf. [URL]

Griffith, A., Stifter, D., & Toner, G.

(2018) Early Irish lexicography – A research survey. Kratylos, 63 1, 1–28.

Haspelmath, A.

(2020) The morph as a minimal linguistic form. Morphology, 30 1, 117–134.

Hellwig, O.

(2019) Dating Sanskrit texts using linguistic features and neural networks. Indogermanische Forschungen, 124 1, 1–47.

(2020) Dating and stratifying a historical corpus with a Bayesian mixture model. In R. Sprugnoli & M. Passarotti (Eds.), Proceedings of the LREC 2020 1st Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2020) (pp. 1–10). European Language Resources Association. [URL]

Hemprich, G.

in preparation). Catalogue of Medieval Irish Literature.

Hilpert, M., & Gries, S. Th.

(2016) Quantitative approaches to diachronic corpus linguistics. In M. Kytö & P. Pahta (Eds.), The Cambridge Handbook of English Historical Linguistics (pp. 36–53). Cambridge University Press.

Hundt, M.

(2004) Animacy, agentivity, and the spread of the progressive in Modern English. English Language & Linguistics, 8 (1), 47–69.

Kavanagh, S.

(2001) A Lexicon of the Old Irish Glosses in the Würzburg Manuscript of the Epistles of St. Paul (D. S. Wodtko, Ed.). Österreichische Akademie der Wissenschaften.

Kelly, P., & Fogarty, H.

(2006–2011) Thesaurus Linguae Hibernicae. [URL]

Lash, E.

(2014) The Parsed Old and Middle Irish Corpus (POMIC) (version 0.1). [URL]

Lash, E., Qiu, F., & Stifter, D.

(2020) Introduction: Celtic studies and corpus linguistics. In E. Lash, F. Qiu, & D. Stifter (Eds.), Morphosyntactic Variation in Medieval Celtic Languages: Corpus-based Approaches (pp. 1–12). De Gruyter Mouton.

Lehmann, H. M., & Schneider, G.

(2012) Syntactic variation and lexical preference in the dative-shift alternation. In J. Mukherjee & M. Huber (Eds.), Corpus Linguistics and Variation in English: Theory and Description (pp. 65–75). Rodopi.

McCone, K.

(1996) Towards a Relative Chronology of Ancient and Medieval Celtic Sound Change. Maynooth.

(1997) The Early Irish Verb (Rev. 2nd ed. with index verborum.). An Sagart.

Ó Corráin, D.

(2017) Clavis Litterarum Hibernensium: Medieval Irish Books & Texts (c. 400 – c. 1600) (Vol. 1–31). Brepols.

Qiu, F., & Stifter, D.

(2020) Chronologicon Hibernicum: Frámaíocht dhóchúlaíoch chun dátú a dhéanamh ar fhorbairtí i dteanga na Sean-Ghaeilge [Chronologicon Hibernicum: A probabilistic framework for the dating of Old Irish language developments]. In E. Ó Raghallaigh (Ed.), Téamaí agus Tionscadail Taighde (pp. 39–59). An Sagart.

Qiu, F., Stifter, D., Bauer, B., Lash, E., & Tianbo, J.

(2018) Chronologicon Hibernicum: A probabilistic chronological framework for dating Early Irish language developments and literature. In M. Ioannides et al. (Eds.), Digital Heritage: Progress in Cultural Heritage: Documentation, Preservation, and Protection (pp. 731–740). Springer.

R Core Team

(2020) R: A Language and Environment for Statistical Computing (Version 4.0.0) [Computer Software]. R Foundation for Statistical Computing. [URL]

Rögnvaldsson, E., & Helgadóttir, S.

(2011) Morphosyntactic tagging of Old Icelandic texts and its use in studying syntactic variation and change. In C. Sporleder, A. Bosch, & K. Zervanou (Eds.), Language Technology for Cultural Heritage (pp. 63–76). Springer.

Sagart, L., Jacques, G., Lai, Y., Ryder, R. J., Thouzeau, V., Greenhill, S. J., & List, J.

(2019) Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Sciences of the USA 116 (21), 10317–10322.

Schneider, G.

(2008) Hybrid Long-Distance Functional Dependency Parsing [Doctoral dissertation, University of Zurich]. [URL]

Schreier, D.

(2005) #CCV- > #CV-: Corpus-based evidence of historical change in English phonotactics. International Journal of English Studies, 5 (1), 77–99.

Schumacher, S.

(2004) Die keltischen Primärverben: Ein vergleichendes, etymologisches und morpho-logisches Lexikon [The Celtic Primary Verbs: A Comparative, Etymological and Morphological Dictionary]. Innsbruck.

Stifter, D.

(2009) Early Irish. In M. Ball & N. Müller (Eds.), The Celtic Languages (2nd ed., pp. 55–116). Routledge.

Stifter, D., Barrett, S., Bauer, B., Ganly, E., Griffith, A., Ji, T., Lash, E., Nguyen, T. H., Osarobo, G., Qiu, F., & White, N.

(2021–) Corpus Palaeohibernicum. [URL]

Stokes, W., & Strachan, J.

(Eds.) (1901–1910) Thesaurus Palaeohibernicus: A Collection of Old Irish Glosses, Scholia, Prose and Verse. Dublin Institute for Advanced Studies.

Su, Y.-S., & Yajima, M.

(2020) R2jags: Using R to Run ‘JAGS’ (Version 0.6–1). [URL]

Rama, T., & Wichmann, S.

(2020) A test of generalized Bayesian dating: A new linguistic dating method. PLOS ONE 15 (8): e0236522.

Thurneysen, R.

(1946) A Grammar of Old Irish. The Dublin Institute for Advanced Studies.

Toner, G., & Han, X.

(2019) Language and Chronology: Text Dating by Machine Learning. Brill.

Uhlich, J.

(2018) Review article of: P. Ó Riain (ed.), The Poems of Blathmac Son of Cú Brettan: Reassessments. Irish Texts Society, 2015. Cambrian Medieval Celtic Studies, 75 1, 53–77.

Cited by (1)

Cited by 1 other publications

McEnery, Tony & Gavin Brookes

2024. Corpus linguistics and the social sciences. Corpus Linguistics and Linguistic Theory 0:0

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.