Natural Language Processing for Ancient Greek: Design, advantages and challenges of language models

Stopponi, Silvia; Pedrazzini, Nilo; Peels-Matthey, Saskia; McGillivray, Barbara; Nissim, Malvina

doi:10.1075/dia.23013.sto

Article published In:

Demystifying New Methods in Historical Linguistics
Edited by Erich Round
[Diachronica 41:3] 2024
► pp. 414–435

Natural Language Processing for Ancient Greek

Design, advantages and challenges of language models

Silvia Stopponi | University of Groningen

Nilo Pedrazzini | The Alan Turing Institute

Saskia Peels-Matthey | University of Groningen

Barbara McGillivray | King’s College London

Malvina Nissim | University of Groningen

Computational methods have produced meaningful and usable results to study word semantics, including semantic change. These methods, belonging to the field of Natural Language Processing, have recently been applied to ancient languages; in particular, language modelling has been applied to Ancient Greek, the language on which we focus. In this contribution we explain how vector representations can be computed from word co-occurrences in a corpus and can be used to locate words in a semantic space, and what kind of semantic information can be extracted from language models. We compare three different kinds of language models that can be used to study Ancient Greek semantics: a count-based model, a word embedding model and a syntactic embedding model; and we show examples of how the quality of their representations can be assessed. We highlight the advantages and potential of these methods, especially for the study of semantic change, together with their limitations.

Keywords: Ancient Greek, semantic change, computational linguistics, language models, Natural Language Processing, word embeddings, semantic space

Article outline

1.Introduction: Language modelling for Ancient Greek
2.Annotation and existing annotated corpora of Ancient Greek
3.Distributional spaces
4.Count-based models and word embeddings: Potential and limitations
5.Computational studies on semantic change in Ancient Greek
6.Evaluation of the performance of language models
7.Syntactic word embeddings
8.Conclusions
Acknowledgements
Author contributions
Notes
Abbreviations
References

Available under the Creative Commons Attribution (CC BY) 4.0 license.

For any use beyond this license, please contact the publisher at [email protected].

Published online: 2 July 2024

https://doi.org/10.1075/dia.23013.sto

References (31)

References

Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Dmitry Ustalov, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72–78.

Bamman, David & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Caroline Sporleder, Antal van den Bosch & Kalliopi Zervanou (eds.), Language technology for cultural heritage: Selected papers from the LaTeCH [Language Technology for Cultural Heritage] workshop series (Theory and Applications of Natural Language Processing), 79–98. Berlin & Heidelberg: Springer.

Bianchi, Federico, Valerio Di Carlo, Paolo Nicoli & Matteo Palmonari. 2020. Compass-aligned distributional embeddings for studying semantic differences across corpora. ArXiv. [URL]. (24 August, 2023.)

Boschetti, Federico. 2009. A corpus-based approach to philological issues. Trento, Italy: University of Trento thesis.

Boschetti, Federico, Riccardo Del Gratta & Harry Diakoff. 2016. Open Ancient Greek WordNet 0.5’. Pisa: ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics “A. Zampolli”, National Research Council, in Pisa. [URL]. (24 August, 2023.)

Di Carlo, Valerio, Federico Bianchi & Matteo Palmonari. 2019. Training temporal word embeddings with a compass. AAAI-19 [Association for the Advancement of Artificial Intelligence] Conference on Artificial Intelligence, 33(1). 6326–6334.

Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).

Grover, Aditya & Jure Leskovec. 2016. Node2vec: Scalable feature learning for networks. In Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu Aggarwal, Dou Shen & Rajeev Rastogi (eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), 855–864.

Gulordava, Kristina & Marco Baroni. 2011. A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. In Sebastian Pado & Yves Peirsman (eds.), Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, 67–71.

Hamilton, William L., Jure Leskovec & Dan Jurafsky. 2016. Diachronic word embeddings reveal statistical laws of semantic change. In Katrin Erk & Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics [ACL], 1489–1501. Berlin: Association for Computational Linguistics.

Harris, Zellig S. 1954. Distributional structure. Word 10(2–3). 146–162.

Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Caroline Sporleder, Antal van den Bosch & Claire Grover (eds.), Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.

Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of pre- and post-processing on type-based embeddings in lexical semantic change detection. In Paola Merlo, Jorg Tiedemann & Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics [EACL], 125–137.

Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Marie Candito, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109–117.

Kozlowski, Austin C., Matt Taddy & James A. Evans. 2019. The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review 84(5). 905–949.

Kulkarni, Vivek, Rami Al-Rfou, Bryan Perozzi & Steven Skiena. 2015. Statistically significant detection of linguistic change. In Aldo Gangemi, Stefano Leonardi & Alessandro Panconesi (eds.), WWW ’15: Proceedings of the 24th International World Wide Web Conference, 625–635. New York: Association for Computing Machinery.

Lenci, Alessandro & Magnus Sahlgren. 2023. Distributional semantics (Studies in Natural Language Processing). Cambridge: Cambridge University Press.

Levy, Omer & Yoav Goldberg. 2014. Dependency-based word embeddings. In Kristina Toutanova & Hua Wu (eds.), Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 302–308. Baltimore: Association for Computational Linguistics.

McGillivray, Barbara. 2014. Methods in Latin computational linguistics. Leiden: Brill.

. 2022. How to use word embeddings for Natural Language Processing. SAGE Publications Ltd.

(24 August, 2023.)

Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. ArXiv. [URL]. (24 August, 2023.)

Perrone, Valerio, Marco Palma, Simon Hengchen, Alessandro Vatri, Jim Q. Smith & Barbara McGillivray. 2019. GASC: Genre-aware semantic change for Ancient Greek. In Nina Tahmasebi, Lars Borin, Adam Jatowt & Yang Xu (eds.), Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, 56–66.

Perrone, Valerio, Simon Hengchen, Marco Palma, Alessandro Vatri, Jim Q. Smith & Barbara McGillivray. 2021. Lexical semantic change for Ancient Greek and Latin. In Tahmasebi, Nina, Lars Borin, Adam Jatowt, Yang Xu & Simon Hengchen (eds.), Computational approaches to semantic change (Language Variation 6), 287–310. Berlin: Language Science Press.

Rodda, Martina A., Marco S. G. Senaldi & Alessandro Lenci. 2017. Panta rei: Tracking semantic change with distributional semantics in Ancient Greek. Italian Journal of Computational Linguistics 3(1). 11–24.

Rodda, Martina A., Philomen Probert & Barbara McGillivray. 2019. Vector space models of Ancient Greek word meaning, and a case study on Homer. TAL Traitement Automatique des Langues 60(3). 63–87.

Sandhan, Jivnesh, Om Adideva Paranjay, Komal Digumarthi, Laxmidhar Behra & Pawan Goyal. 2023. Evaluating neural word embeddings for Sanskrit. In Amba Kulkarni & Oliver Hellwig (eds.), Proceedings of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit Conference, 21–37. Canberra: Association for Computational Linguistics.

Sprugnoli, Rachele, Giovanni Moretti & Marco Passarotti. 2020. Building and comparing lemma embeddings for Latin: Classical Latin versus Thomas Aquinas. IJCoL. Italian Journal of Computational Linguistics 6(6–1). 29–45.

Stopponi, Silvia, Saskia Peels-Matthey & Malvina Nissim. 2024. AGREE: A new benchmark for the evaluation of distributional semantic models of Ancient Greek. Digital Scholarship in the Humanities.

(26 January, 2024.)

Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek corpus. Research Data Journal for the Humanities and Social Sciences 3(1). 55–65.

Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: A dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 71.

Tognini-Bonelli, Elena. 2001. Corpus linguistics at work. Amsterdam: John Benjamins.