An automatic part-of-speech tagger for Middle Low German

Koleva, Mariya; Farasyn, Melissa; Desmet, Bart; Breitbarth, Anne; Hoste, Véronique

doi:10.1075/ijcl.22.1.05kol

Article published In:

International Journal of Corpus Linguistics
Vol. 22:1 (2017) ► pp.107–140

An automatic part-of-speech tagger for Middle Low German

Mariya Koleva | Ghent University

Melissa Farasyn | Ghent University

Bart Desmet | Ghent University

Anne Breitbarth | Ghent University

Véronique Hoste | Ghent University

Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them.

Keywords: historical linguistics, part-of-speech tagging, conditional random fields, feature selection, normalization

Article outline

1.Introduction
2.Related work on historical corpora of German
- 2.1Other varieties of historical Low German
- 2.2Related language varieties of the same period and geographical area
3.The Corpus of Historical Low German
- 3.1Middle Low German
- 3.2Purpose of the corpus
- 3.3Corpus design
4.Methodology
- 4.1Tagset
- 4.2Standard experimental set-up
- 4.3Experimental data
5.Results and discussion
- 5.1In-domain experiments
- 5.2Cross-city robustness
- 5.3Feature informativeness
- 5.4Cross-genre robustness
- 5.5Error analysis
6.Improving tagging accuracy: The impact of spelling normalization and morphological information
- 6.1Corpus subset and baseline
- 6.2The effects of normalization on tagging accuracy
- 6.3The effects of morphological information on POS tagging
7.Conclusions
Notes
References

This article is currently available as a sample article.

Published online: 28 July 2017

https://doi.org/10.1075/ijcl.22.1.05kol

References (52)

References

Baron, A., & Rayson, P. (2008, August). VARD2: A tool for dealing with spelling variation in historical corpora. Paper presented at Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK.

Barteld, F., Schröder, I., & Zinsmeister, H. (2015). Unsupervised regularisation of historical texts for POS tagging. In F. Mambrini, M. Passarotti & C. Sporleder (Eds.), Proceedings of the Workshop on Corpus-Based Research in the Humanities (CRH) (pp. 3–12). Polish Academy of Sciences: Institute of Computer Science.

Bennett, P., Durrell, M., Scheible, S., & Whitt, R. J. (2010). Annotating a historical corpus of German: A case study. In Proceedings of the LREC 2010 workshop on Language Resources and Language Technology Standards (pp. 64–68). European Language Resources Association.

Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.

Biebersteadt, A. (2015). Variablenlinguistische Beobachtungen zu den mittelniederdeutschen Schreibsprachen des südlichen Ostseeraumes: Wismar und Stralsund als Beispiele. In H. U. Schmid & A. Ziegler (Eds.), 2015: Jahrbuch für Germanistische Sprachgeschichte. Bd. 6: Deutsch im Norden (pp. 88–115). Berlin/New York: De Gruyter.

Bollmann, M., Petran, F., Dipper, S., & Krasselt, J. (2014). CorA: A web-based annotation tool for historical and other non-standard language data. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) (pp. 86–90).

Braunmüller, K. (1996). Forms of language contact in the area of the Hanseatic League: Dialect contact phenomena and semicommunication. Nordic Journal of Linguistics, 19(2), 141–154.

(2002). Language contact during the Old Nordic period I: With the British Isles, Frisia and the Hanseatic League. In O. Bandle, K. Braunmüller, E. H. Jahr, A. Karker, H.-P. Naumann & U. Teleman (Eds.), The Nordic Languages: An International Handbook of the History of the Nordic Germanic Languages, Volume 11 (pp. 1028–1039). Berlin/New York: De Gruyter.

Breitbarth, A., Walkden, G., & Watts, S. (2011 April). A Corpus for Middle Low German. Paper presented at New Methods in Historical Corpora, Manchester, UK.

(2012 April). Building a corpus for Middle Low German: Notes and queries. Paper presented at the Forum for Germanic Language Studies (FGLS10), Sheffield, UK.

Britto, H., Finger, M., & Galves, C. (2002). Computational and linguistic aspects of the construction of The Tycho Brahe Parsed Corpus of Historical Portuguese. Romanistische Korpuslinguistik, Korpora und gesprochene Sprache, Romance Corpus Linguistics, Corpora and Spoken Language, ScriptOralia, 126.

Daelemans, W., Van den Bosch, A., & Zavrel, J. (1999). Forgetting examples is harmful in language learning. Machine Learning, 34(1–3), 11–43.

Daelemans, W., & Van den Bosch, A. (2005). Memory-based Language Processing. Cambridge: Cambridge University Press.

De Clercq, O. (2015). Tipping the scales: exploring the added value of deep semantic processing on readability prediction and sentiment analysis (Unpublished doctoral dissertation). Ghent University, Ghent, Belgium.

Desmet, B., Hoste, V., Verstraeten, D., & Verhasselt, J. (2013). Gallop Documentation, (LT3 Technical Report - LT3 13.03).

Desmet, B. (2014). Finding the online cry for help: Automatic text classification for suicide prevention (Unpublished doctoral dissertation). Ghent University, Ghent, Belgium.

Diel, M., Fisseni, B., Lenders, W., & Schmitz, H.-C. (2002). XML-Kodierung des Bonner Frühneuhochdeutschkorpus. Bonn: IKP-Arbeitsbericht NF 02.

Dipper, S. (2015). Annotierte Korpora für die Historische Syntaxforschung: Anwendungsbeispiele anhand des Referenzkorpus Mittelhochdeutsch. Zeitschrift für Germanistische Linguistik, 43(3), 516–563.

Dipper, S., Donhauser, K., Klein, T., Linde, S., Müller, S., & Wegera, K. P. (2013). HiTS: ein Tagset für historische Sprachstufen des Deutschen. Journal for Language Technology and Computational Linguistics, 28(1), 85–137.

Fisseni, B., Schmitz, H.-C., & Schröder, B. (2007). FnhdC/HTML und FnhdC/S. Sprache und Datenverarbeitung, 1–2/2007, 67–69.

Geyken, A., Haaf, S., Jurish, B., Schulz, M., Steinmann, J., Thomas, C., & Wiegand, F. (2011). Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv. In Digitale Wissenschaft. Stand und Entwicklung digital vernetzter Forschung in Deutschland, 20/21, September 2010, Beiträge der Tagung, 2., ergänzte Fassung (pp. 157–161).

Kroch, A., Taylor, A., & Ringe, D. (2000). The Middle English verb-second constraint: A case study in language contact and language change. In S. Herring, P. van Reenen & L. Schøsler (Eds.), Textual Parameters in Older Languages (pp. 353–392). Amsterdam/Philadelphia: Benjamins.

Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (pp. 282–289). San Francisco, CA: Morgan Kaufmann.

Linde, S., & Mittmann, R. (2013). Old German reference corpus: Digitizing the knowledge of the 19th century. In P. Bennett, M. Durrell, S. Scheible, R. J. Whitt (Eds.), New Methods in Historical Corpora (pp. 235–246). Tübingen: Narr Verlag.

Marcus, M. P., Santorini B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.

Martineau, F. (2005). Modéliser le changement: Les voies du français/Modelling change: The paths of French. Ottawa: University of Ottawa. Retrieved from [URL] (last accessed March 2017).

Moon, T., & Baldridge, J. (2007). Part-of-speech tagging for Middle English through alignment and projection of parallel diachronic texts. In Proceedings of EMNLP/CONLL-2007 (pp. 390–399).

Peters, R. (1973). Mittelniederdeutsche Sprache. In J. Goossens (Ed.), Niederdeutsch – Sprache und Literatur. Bd. 1: Sprache (pp. 66–115). Neumünster: Wachholtz.

(2003). Variation und Ausgleich in den mittelniederdeutschen Schreibsprachen. In M. Goyens & W. Verbeke (Eds.), The Dawn of the Written Vernacular in Western Europe (pp. 427–440). Leuven: Leuven University Press.

Peters, R., & Fischer, C. (2007). Der ‘Atlas spätmittelalterlicher Schreibsprachen des niederdeutschen Altlandes und angrenzender Gebiete’. In L. Czajkowski, C. Hoffmann, H. U. Schmid (Eds.), Ostmitteldeutsche Schreibsprachen im Spätmittelalter (pp. 23–33). Berlin: De Gruyter.

Peters, R., & Nagel, N. (2014). Das digitale ‘Referenzkorpus Mittelniederdeutsch/Niederrheinisch (ReN)’. Jahrbuch für Germanistische Sprachgeschichte, 5(1), 165–175. Berlin/Boston: de Gruyter.

Pettersson, E., Megyesi, B., & Nivre, J. (2013). Normalisation of historical text using context-sensitive weighted Levenhstein distance and compound splitting. In Proceedings of the 19th Nordic Conference on Computational Linguistics (NoDaLiDa 2013) (pp. 163–179). Linköping: Linköping Electronic Conference Proceedings 85.

(2014). A multilingual evaluation of three spelling normalization methods for historical text. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities (LaTeCH 2014) (pp. 32–41). Gothenburg: Association for Computational Linguistics.

Rayson, P., Archer, D., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the bard: Evaluating the accuracy of a modern POS tagger on early modern English corpora. In Proceedings of Corpus Linguistics 2007. Birmingham: University of Birmingham, UK.

Rögnvaldsson, E., & Helgadóttir, S. (2011). Morphosyntactic tagging of Old Icelandic texts and its use in studying syntactic variation and change. In C. Sporleder, A. van den Bosch, K. Zervanou (Eds.), Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series (pp. 63–76). Berlin: Springer.

Sanders, W. (1982). Sprachgeschichtliche Grundzüge des Niederdeutschen. Vandenhoeck + Ruprecht Gm.

Scheible, S., Whitt, R. J., Durrell, M., & Bennett, P. (2011a). A gold standard corpus of Early Modern German. In Proceedings of the 5th Linguistic Annotation Workshop (LAW V 2011) (pp. 124–128). Association for Computational Linguistics.

(2011b). Evaluating an ‘off-the-shelf’ POS-tagger on early modern German text. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2011), pp. 19–23. Portland, OR: Association for Computational Linguistics.

Schiller, A., Teufel, S., & Thielen, C. (1995). Guidelines für das Tagging deutscher Textkorpora mit STTS. Technical report, Universities of Stuttgart and Tübingen, 661. Retrieved from [URL] (last accessed March 2017).

Schmid, H., & Laws, F. (2008). Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008) - Volume 1 (pp. 777–784). Manchester: Association for Computational Linguistics.

Schneider, G., Lehman, H. M., & Schneider, P. (2015). Parsing early and late modern English corpora. Literary and Linguistic Computing, 30(3), 423–439.

Schröder, I. (2014). Neue Perspektiven für die mittelniederdeutsche Grammatikographie. Jahrbuch für germanistische Sprachgeschichte, 5(1), 150–164.

Schulz, S., De Pauw, G. De Clercq, O., Desmet, B., Hoste, V., Daelemans, W., & Macken, L. (2016). Multimodular Text Normalization of Dutch User-Generated Content. ACM Transactions on Intelligent Systems and Technology (TIST), 7(4), 1–22.

Silfverberg, M., Ruokolainen, B., Lindén, K., & Kurimo, M. (2014). Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 259–264). Baltimore, MD.

Sukhareva, M., & Chiarcos, C. (2016). Combining ontologies and neural networks for analyzing historical language varieties: A case study in Middle Low German. In N. Calzolari, K. Choukri, T. Declerck, M. Grobelnik, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & Stelios Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Paris: European Language Resources Association (ELRA). Retrieved from [URL] (last accessed March 2017).

Tophinke, D. (2009). Vom Vorlesetext zum Lesetext: Zur Syntax mittelniederdeutscher Rechtsverordnungen im Spätmittelalter. In A. Linke, & H. Feilke (Eds.), Oberfläche und Performanz. Untersuchungen zur Sprache als dynamische Gestalt (pp. 161–186). Tübingen: Niemeyer.

(2012). Syntaktischer Ausbau im Mittelniederdeutschen. Theoretisch-methodische Überlegungen und kursorische Analysen. Niederdeutsches Wort, 521, 19–46.

Tophinke, D., & Wallmeier, N. (2011). Textverdichtungsprozesse im Spämittelalter: Syntaktischer Wandel in mittelniederdeutschen Rechtstexten des 13.–16. Jahrhunderts. In S. Elspaß & M. Negele (Eds.) Sprachvariation und Sprachwandel in der Stadt der Frühen Neuzeit (pp. 97–116). Heidelberg: Winter.

Van de Kauter, M., Coorman, G., Lefever, E., Desmet, B., Macken, L., & Hoste, V. (2013). LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit. Computational Linguistics in the Netherlands Journal, 31, 103–120.

Walkden, G. (2016). The HeliPaD: A parsed corpus of Old Saxon. International Journal of Corpus Linguistics, 21(4), 559–571.

Wallenberg, J. C., Ingason, A. K., Sigurðsson, E. F., & Rögnvaldsson, E. (2011). Icelandic parsed historical corpus (IcePaHC) (Version 0.9). Available at [URL] (last accessed March 2017).

Yang, Y., & Eisenstein, J. (2016). Part-of-speech tagging for historical English. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), San Diego.

Cited by (2)

Cited by two other publications

Barteld, Fabian, Chris Biemann & Heike Zinsmeister

2019. Token-based spelling variant detection in Middle Low German texts. Language Resources and Evaluation 53:4 ► pp. 677 ff.

Farasyn, Melissa, George Walkden, Sheila Watts & Anne Breitbarth

2018. The interplay between genre variation and syntax in a historical Low German corpus. In Diachronic Corpora, Genre, and Language Change [Studies in Corpus Linguistics, 85], ► pp. 281 ff.

This list is based on CrossRef data as of 5 august 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.