Chapter published in:
Late Modern English: Novel encountersEdited by Merja Kytö and Erik Smitterberg
[Studies in Language Companion Series 214] 2020
► pp. 244–268
Spelling normalisation of Late Modern English
Comparison and combination of VARD and character-based statistical machine translation
Gerold Schneider | University of Zurich
To be able to profit from natural language
processing (NLP) tools for analysing historical text, an important
step is spelling normalisation. We first compare and second combine
two different approaches: on the one hand VARD, a rule-based system
which is based on dictionary lookup and rules with non-probabilistic
but trainable weights; on the other hand a language-independent
approach to spelling normalisation based on statistical machine
translation (SMT) techniques. The rule-based system reaches the best
accuracy, up to 94% precision at 74% recall, while the SMT system
improves each tested period. We obtain the best system by combining
both approaches. Re-training VARD on specific time-periods and
domains is beneficial, and both systems benefit from a language
sequence model using collocation strength.
Published online: 18 March 2020
https://doi.org/10.1075/slcs.214.11sch
https://doi.org/10.1075/slcs.214.11sch
References
References
Aston, Guy & Burnard, Lou
Baron, Alistair & Rayson, Paul
2008 VARD
2: A tool for dealing with spelling variation in historical
corpora. In Proceedings
of the Postgraduate Conference in Corpus Linguistics,
Birmingham. Aston University, 22
May. http://acorn.aston.ac.uk/conf_proceedings.html> (29 April 2019).
Biber, Douglas, Finegan, Edward & Atkinson, Dwight
1994 ARCHER
and its challenges: Compiling and exploring a representative
corpus of historical English
registers. In Creating
and Using English Language Corpora. Papers from the 14th
International Conference on English Language Research on
Computerized Corpora, Zurich
1993, Udo Fries, Peter Schneider & Gunnel Tottie (eds), 1–13. Amsterdam: Rodopi.
Dietterich, Thomas G.
Evert, Stefan
Elsness, Johan
Fries, Udo
van Halteren, Hans, Daelemans, Walter & Zavrel, Jakub
Helgadóttir, Sigrún
Hilpert, Martin & Gries, Stefan
T.
Hundt, Marianne, Denison, David & Schneider, Gerold
Jenset, Gard
Buen
Koehn, Philipp, Hoang, Hieu, Birch, Alexandra, Callison-Burch, Chris, Federico, Marcello, Bertoldi, Nicola, Cowan, Brooke, Shen, Wade, Moran, Christine, Zens, Richard, Dyer, Chris, Bojar, Ondrej, Constantin, Alexandra & Herbst, Evan
Labov, William
Leech, Geoffrey, Hundt, Marianne, Mair, Christian & Smith, Nicholas
Lehmann, Hans Martin & Schneider, Gerold
2012 BNC Dependency Bank 1.0. In Aspects of Corpus Linguistics: Compilation, Annotation, Analysis, Signe Oksefjell Ebeling, Jarle Ebeling & Hilde Hasselgård (eds). Helsinki: VARIENG. http://www.helsinki.fi/varieng/series/volumes/12/lehmann_schneider/
Loftsson, Hrafn
López-Couso, Maria, Aarts, Bas & Méndez-Naya, Belén
Marcus, Mitch, Kim, Grace, Marcinkiewicz, M. A., MacIntyre, Robert, Bies, Ann, Ferguson, Mark, Katz, Karen & Schasberger, Britta
Och, Franz
Josef & Ney, Hermann
Pettersson, Eva, Megyesi, Beáta & Tiedemann, Jörg
2013 An
SMT approach to automatic annotation of historical
text. In Proceedings
of the NoDaLiDa 2013 workshop on Computational Historical
Linguistics. https://cl.lingfil.uu.se/~bea/publ//pettersson-megyesi-tiedemann.pdf> (29 April 2019)
Rayson, Paul, Archer, Dawn, Baron, Alistair, Culpeper, Jonathan & Smith, Nicholas
2007 Tagging
the bard: Evaluating the accuracy of a modern POS tagger on
Early Modern English
corpora. In Proceedings
of Corpus Linguistics
2007. University of Birmingham, UK. https://www.researchgate.net/publication/228359386_Tagging_the_Bard_Evaluating_the_accuracy_of_a_modern_POS_tagger_on_Early_Modern_English_corpora> (29 April 2019).
Rissanen, Matti
Rosenbach, Anette
Röthlisberger, Melanie & Schneider, Gerold
2013
Of-genitive
versus s-genitive: A corpus-based analysis
of possessive constructions in 20th-century
English. In Korpuslinguistik
und Interdisziplinäre Perspektiven auf Sprache – Corpus
Linguistics and Interdisciplinary Perspectives on Language
(CLIP), Paul Bennet, Silke Scheible & Richard
J. Whitt (eds), 163–180. Stuttgart: Narr Franke Attempto.
Samuelsson, Christer & Voutilainen, Atro
Scheible, Silke, Whitt, Richard
J., Durrell, Martin & Bennett, Paul
Schmid, Helmut
1994 Probabilistic
part-of-speech tagging using decision
trees. In Proceedings
of International Conference on New Methods in Language
Processing, 44–49. Manchester. www.cis.uni-muenchen/~schmid/tools/TreeTagger/data/tree-taggar1.pdf
Schneider, Gerold, Hundt, Marianne & Oppliger, Rahel
Schneider, Gerold, Pettersson, Eva & Percillier, Michael