An automatic part-of-speech tagger for Middle Low German
Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them.
Article outline
- 1.Introduction
- 2.Related work on historical corpora of German
- 2.1Other varieties of historical Low German
- 2.2Related language varieties of the same period and geographical area
- 3.The Corpus of Historical Low German
- 3.1Middle Low German
- 3.2Purpose of the corpus
- 3.3Corpus design
- 4.Methodology
- 4.1Tagset
- 4.2Standard experimental set-up
- 4.3Experimental data
- 5.Results and discussion
- 5.1In-domain experiments
- 5.2Cross-city robustness
- 5.3Feature informativeness
- 5.4Cross-genre robustness
- 5.5Error analysis
- 6.Improving tagging accuracy: The impact of spelling normalization and morphological information
- 6.1Corpus subset and baseline
- 6.2The effects of normalization on tagging accuracy
- 6.3The effects of morphological information on POS tagging
- 7.Conclusions
- Notes
-
References
This article is currently available as a sample article.
References (52)
Baron, A., & Rayson, P.
(
2008,
August).
VARD2: A tool for dealing with spelling variation in historical corpora. Paper presented at Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK.
Barteld, F., Schröder, I., & Zinsmeister, H.
(
2015)
Unsupervised regularisation of historical texts for POS tagging. In
F. Mambrini,
M. Passarotti &
C. Sporleder (Eds.),
Proceedings of the Workshop on Corpus-Based Research in the Humanities (CRH) (pp. 3–12). Polish Academy of Sciences: Institute of Computer Science.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Bennett, P., Durrell, M., Scheible, S., & Whitt, R. J.
(
2010)
Annotating a historical corpus of German: A case study. In
Proceedings of the LREC 2010 workshop on Language Resources and Language Technology Standards (pp. 64–68). European Language Resources Association.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, D., Conrad, S., & Reppen, R.
(
1998)
Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biebersteadt, A.
(
2015)
Variablenlinguistische Beobachtungen zu den mittelniederdeutschen Schreibsprachen des südlichen Ostseeraumes: Wismar und Stralsund als Beispiele. In
H. U. Schmid &
A. Ziegler (Eds.),
2015: Jahrbuch für Germanistische Sprachgeschichte. Bd. 6: Deutsch im Norden (pp. 88–115). Berlin/New York: De Gruyter.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Bollmann, M., Petran, F., Dipper, S., & Krasselt, J.
(
2014)
CorA: A web-based annotation tool for historical and other non-standard language data. In
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) (pp. 86–90).
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Braunmüller, K.
(
1996)
Forms of language contact in the area of the Hanseatic League: Dialect contact phenomena and semicommunication.
Nordic Journal of Linguistics, 19(2), 141–154.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Braunmüller, K.
(
2002)
Language contact during the Old Nordic period I: With the British Isles, Frisia and the Hanseatic League. In
O. Bandle,
K. Braunmüller,
E. H. Jahr,
A. Karker,
H.-P. Naumann &
U. Teleman (Eds.),
The Nordic Languages: An International Handbook of the History of the Nordic Germanic Languages, Volume 11 (pp. 1028–1039). Berlin/New York: De Gruyter.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Breitbarth, A., Walkden, G., & Watts, S.
(
2011 April).
A Corpus for Middle Low German. Paper presented at New Methods in Historical Corpora, Manchester, UK.
Breitbarth, A., Walkden, G., & Watts, S.
(
2012 April).
Building a corpus for Middle Low German: Notes and queries. Paper presented at the Forum for Germanic Language Studies (FGLS10), Sheffield, UK.
Britto, H., Finger, M., & Galves, C.
(
2002)
Computational and linguistic aspects of the construction of The Tycho Brahe Parsed Corpus of Historical Portuguese.
Romanistische Korpuslinguistik, Korpora und gesprochene Sprache, Romance Corpus Linguistics, Corpora and Spoken Language, ScriptOralia, 126.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Daelemans, W., Van den Bosch, A., & Zavrel, J.
(
1999)
Forgetting examples is harmful in language learning.
Machine Learning, 34(1–3), 11–43.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Daelemans, W., & Van den Bosch, A.
(
2005)
Memory-based Language Processing. Cambridge: Cambridge University Press.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
De Clercq, O.
(
2015)
Tipping the scales: exploring the added value of deep semantic processing on readability prediction and sentiment analysis (Unpublished doctoral dissertation). Ghent University, Ghent, Belgium.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Desmet, B., Hoste, V., Verstraeten, D., & Verhasselt, J.
(
2013)
Gallop Documentation, (
LT3 Technical Report - LT3 13.03).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Desmet, B.
(
2014)
Finding the online cry for help: Automatic text classification for suicide prevention (Unpublished doctoral dissertation). Ghent University, Ghent, Belgium.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Diel, M., Fisseni, B., Lenders, W., & Schmitz, H.-C.
(
2002)
XML-Kodierung des Bonner Frühneuhochdeutschkorpus. Bonn: IKP-Arbeitsbericht NF 02.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Dipper, S.
(
2015)
Annotierte Korpora für die Historische Syntaxforschung: Anwendungsbeispiele anhand des Referenzkorpus Mittelhochdeutsch.
Zeitschrift für Germanistische Linguistik, 43(3), 516–563.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Dipper, S., Donhauser, K., Klein, T., Linde, S., Müller, S., & Wegera, K. P.
(
2013)
HiTS: ein Tagset für historische Sprachstufen des Deutschen.
Journal for Language Technology and Computational Linguistics, 28(1), 85–137.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Fisseni, B., Schmitz, H.-C., & Schröder, B.
(
2007)
FnhdC/HTML und FnhdC/S.
Sprache und Datenverarbeitung, 1–2/2007, 67–69.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Geyken, A., Haaf, S., Jurish, B., Schulz, M., Steinmann, J., Thomas, C., & Wiegand, F.
(
2011)
Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv. In
Digitale Wissenschaft. Stand und Entwicklung digital vernetzter Forschung in Deutschland, 20/21, September 2010, Beiträge der Tagung, 2., ergänzte Fassung (pp. 157–161).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Kroch, A., Taylor, A., & Ringe, D.
(
2000)
The Middle English verb-second constraint: A case study in language contact and language change. In
S. Herring,
P. van Reenen &
L. Schøsler (Eds.),
Textual Parameters in Older Languages (pp. 353–392). Amsterdam/Philadelphia: Benjamins.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Lafferty, J., McCallum, A., & Pereira, F.
(
2001)
Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In
Proceedings of the 18th International Conference on Machine Learning (pp. 282–289). San Francisco, CA: Morgan Kaufmann.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Linde, S., & Mittmann, R.
(
2013)
Old German reference corpus: Digitizing the knowledge of the 19th century. In
P. Bennett,
M. Durrell,
S. Scheible,
R. J. Whitt (Eds.),
New Methods in Historical Corpora (pp. 235–246). Tübingen: Narr Verlag.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Marcus, M. P., Santorini B., & Marcinkiewicz, M. A.
(
1993)
Building a large annotated corpus of English: The Penn Treebank.
Computational Linguistics, 19(2), 313–330.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Martineau, F.
(
2005)
Modéliser le changement: Les voies du français/Modelling change: The paths of French. Ottawa: University of Ottawa. Retrieved from
[URL] (last accessed March 2017).
Moon, T., & Baldridge, J.
(
2007)
Part-of-speech tagging for Middle English through alignment and projection of parallel diachronic texts. In
Proceedings of EMNLP/CONLL-2007 (pp. 390–399).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Peters, R.
(
1973)
Mittelniederdeutsche Sprache. In
J. Goossens (Ed.),
Niederdeutsch – Sprache und Literatur. Bd. 1: Sprache (pp. 66–115). Neumünster: Wachholtz.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Peters, R.
(
2003)
Variation und Ausgleich in den mittelniederdeutschen Schreibsprachen. In
M. Goyens &
W. Verbeke (Eds.),
The Dawn of the Written Vernacular in Western Europe (pp. 427–440). Leuven: Leuven University Press.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Peters, R., & Fischer, C.
(
2007)
Der ‘Atlas spätmittelalterlicher Schreibsprachen des niederdeutschen Altlandes und angrenzender Gebiete’. In
L. Czajkowski,
C. Hoffmann,
H. U. Schmid (Eds.),
Ostmitteldeutsche Schreibsprachen im Spätmittelalter (pp. 23–33). Berlin: De Gruyter.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Peters, R., & Nagel, N.
(
2014)
Das digitale ‘Referenzkorpus Mittelniederdeutsch/Niederrheinisch (ReN)’.
Jahrbuch für Germanistische Sprachgeschichte, 5(1), 165–175. Berlin/Boston: de Gruyter.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Pettersson, E., Megyesi, B., & Nivre, J.
(
2013)
Normalisation of historical text using context-sensitive weighted Levenhstein distance and compound splitting. In
Proceedings of the 19th Nordic Conference on Computational Linguistics (NoDaLiDa 2013) (pp. 163–179). Linköping: Linköping Electronic Conference Proceedings 85.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Pettersson, E., Megyesi, B., & Nivre, J.
(
2014)
A multilingual evaluation of three spelling normalization methods for historical text. In
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities (LaTeCH 2014) (pp. 32–41). Gothenburg: Association for Computational Linguistics.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rayson, P., Archer, D., Baron, A., Culpeper, J., & Smith, N.
(
2007)
Tagging the bard: Evaluating the accuracy of a modern POS tagger on early modern English corpora. In
Proceedings of Corpus Linguistics 2007. Birmingham: University of Birmingham, UK.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rögnvaldsson, E., & Helgadóttir, S.
(
2011)
Morphosyntactic tagging of Old Icelandic texts and its use in studying syntactic variation and change. In
C. Sporleder,
A. van den Bosch,
K. Zervanou (Eds.),
Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series (pp. 63–76). Berlin: Springer.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sanders, W.
(
1982)
Sprachgeschichtliche Grundzüge des Niederdeutschen. Vandenhoeck + Ruprecht Gm.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Scheible, S., Whitt, R. J., Durrell, M., & Bennett, P.
(
2011a)
A gold standard corpus of Early Modern German. In
Proceedings of the 5th Linguistic Annotation Workshop (LAW V 2011) (pp. 124–128). Association for Computational Linguistics.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Scheible, S., Whitt, R. J., Durrell, M., & Bennett, P.
(
2011b)
Evaluating an ‘off-the-shelf’ POS-tagger on early modern German text. In
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2011), pp. 19–23. Portland, OR: Association for Computational Linguistics.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Schiller, A., Teufel, S., & Thielen, C.
(
1995)
Guidelines für das Tagging deutscher Textkorpora mit STTS. Technical report, Universities of Stuttgart and Tübingen, 661. Retrieved from
[URL] (last accessed March 2017).
Schmid, H., & Laws, F.
(
2008)
Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging.
Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008) - Volume 1 (pp. 777–784). Manchester: Association for Computational Linguistics.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Schneider, G., Lehman, H. M., & Schneider, P.
(
2015)
Parsing early and late modern English corpora.
Literary and Linguistic Computing, 30(3), 423–439.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Schröder, I.
(
2014)
Neue Perspektiven für die mittelniederdeutsche Grammatikographie.
Jahrbuch für germanistische Sprachgeschichte, 5(1), 150–164.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Schulz, S., De Pauw, G. De Clercq, O., Desmet, B., Hoste, V., Daelemans, W., & Macken, L.
(
2016)
Multimodular Text Normalization of Dutch User-Generated Content.
ACM Transactions on Intelligent Systems and Technology (TIST), 7(4), 1–22.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Silfverberg, M., Ruokolainen, B., Lindén, K., & Kurimo, M.
(
2014)
Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 259–264). Baltimore, MD.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sukhareva, M., & Chiarcos, C.
(
2016)
Combining ontologies and neural networks for analyzing historical language varieties: A case study in Middle Low German. In
N. Calzolari,
K. Choukri,
T. Declerck,
M. Grobelnik,
B. Maegaard,
J. Mariani,
A. Moreno,
J. Odijk &
Stelios Piperidis (Eds.),
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Paris: European Language Resources Association (ELRA). Retrieved from
[URL] (last accessed March 2017).
Tophinke, D.
(
2009)
Vom Vorlesetext zum Lesetext: Zur Syntax mittelniederdeutscher Rechtsverordnungen im Spätmittelalter. In
A. Linke, &
H. Feilke (Eds.),
Oberfläche und Performanz. Untersuchungen zur Sprache als dynamische Gestalt (pp. 161–186). Tübingen: Niemeyer.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Tophinke, D.
(
2012)
Syntaktischer Ausbau im Mittelniederdeutschen. Theoretisch-methodische Überlegungen und kursorische Analysen.
Niederdeutsches Wort, 521, 19–46.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Tophinke, D., & Wallmeier, N.
(
2011)
Textverdichtungsprozesse im Spämittelalter: Syntaktischer Wandel in mittelniederdeutschen Rechtstexten des 13.–16. Jahrhunderts. In
S. Elspaß &
M. Negele (Eds.)
Sprachvariation und Sprachwandel in der Stadt der Frühen Neuzeit (pp. 97–116). Heidelberg: Winter.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Van de Kauter, M., Coorman, G., Lefever, E., Desmet, B., Macken, L., & Hoste, V.
(
2013)
LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit.
Computational Linguistics in the Netherlands Journal, 31, 103–120.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Wallenberg, J. C., Ingason, A. K., Sigurðsson, E. F., & Rögnvaldsson, E.
(
2011)
Icelandic parsed historical corpus (IcePaHC) (Version 0.9). Available at
[URL] (last accessed March 2017).
Yang, Y., & Eisenstein, J.
(
2016)
Part-of-speech tagging for historical English. In
Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), San Diego.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cited by (2)
Cited by 2 other publications
Barteld, Fabian, Chris Biemann & Heike Zinsmeister
2019.
Token-based spelling variant detection in Middle Low German texts.
Language Resources and Evaluation 53:4
► pp. 677 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
Farasyn, Melissa, George Walkden, Sheila Watts & Anne Breitbarth
This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.