Strategies in tracing linguistic variation in a corpus of Old Irish texts (CorPH)
This article introduces Corpus PalaeoHibernicum (CorPH), a corpus currently consisting of 78 texts in Early Irish
(c. 7th–10th cent.) created by the ERC-funded Chronologicon Hibernicum (ChronHib) project by
bringing together pre-existing lexical and syntactic databases and adding further crucial texts from the period. In addition to
being annotated for POS, morphological and syntactic information, another layer of annotation has been developed for CorPH –
‘Variation Tagging’, i.e. a tagset that numerically encodes synchronic language variation during the Early Irish period, thus
allowing for much improved research on the chronological variation among the material. Another new pillar of studying linguistic
variation is Bayesian Language Variation Analysis (BLaVA), in order to address the challenge that “not-so-big data” poses to
statistical corpus methods. Instead of reflecting feature frequencies, BLaVA models language variation as probabilities of
variation.
Article outline
- 1.Introduction
- 2.Characteristics of Old Irish
- 3.The corpus
- 4.
Corphusator
- 5.Variation tagging
- 6.Bayesian language variation analysis
- 7.Advantages and benefits of the methods
- 8.Challenges and desiderata
- Acknowledgements
-
References
For any use beyond this license, please contact the publisher at rights@benjamins.nl.