Comparison and combination of VARD and character-based statistical machine translation: Spelling normalisation of Late Modern English

Schneider, Gerold

doi:10.1075/slcs.214.11sch

Part of

Late Modern English: Novel encounters
Edited by Merja Kytö and Erik Smitterberg
[Studies in Language Companion Series 214] 2020
► pp. 243–268

Spelling normalisation of Late Modern English

Comparison and combination of VARD and character-based statistical machine translation

Gerold Schneider | University of Zurich

To be able to profit from natural language processing (NLP) tools for analysing historical text, an important step is spelling normalisation. We first compare and second combine two different approaches: on the one hand VARD, a rule-based system which is based on dictionary lookup and rules with non-probabilistic but trainable weights; on the other hand a language-independent approach to spelling normalisation based on statistical machine translation (SMT) techniques. The rule-based system reaches the best accuracy, up to 94% precision at 74% recall, while the SMT system improves each tested period. We obtain the best system by combining both approaches. Re-training VARD on specific time-periods and domains is beneficial, and both systems benefit from a language sequence model using collocation strength.

Article outline

1.Introduction
2.Data and annotation
- 2.1The ARCHER corpus
- 2.2Training
3.Methods
- 3.1VARD2
- 3.2Statistical machine translation (SMT)
- 3.3Ensemble systems
- 3.4Collocations
4.Performance of the individual systems
- 4.1Performance of VARD2
- 4.2Performance of SMT
- 4.3VARD2 and SMT in comparison
5.Majority voting ensemble system
6.Adding a language sequence model
7.Application: Data-driven historical linguistics
- 7.1Concordancing: The impact of normalisation on recall
- 7.2Overuse of lexis and POS tags
8.Conclusions
Acknowledgements
Note
References

Published online: 18 March 2020

https://doi.org/10.1075/slcs.214.11sch

References (31)

References

Aston, Guy & Burnard, Lou. 1998. The BNC Handbook. Exploring the British National Corpus with SARA. Edinburgh: EUP.

Baron, Alistair & Rayson, Paul. 2008. VARD 2: A tool for dealing with spelling variation in historical corpora. In Proceedings of the Postgraduate Conference in Corpus Linguistics, Birmingham. Aston University, 22 May. <[URL]> (29 April 2019).

Biber, Douglas, Finegan, Edward & Atkinson, Dwight. 1994. ARCHER and its challenges: Compiling and exploring a representative corpus of historical English registers. In Creating and Using English Language Corpora. Papers from the 14th International Conference on English Language Research on Computerized Corpora, Zurich 1993, Udo Fries, Peter Schneider & Gunnel Tottie (eds), 1–13. Amsterdam: Rodopi.

Dietterich, Thomas G. 1997. Machine learning research: Four current directions. AI Magazine 18(4): 97–136.

Evert, Stefan. 2008. Corpora and collocations. In Corpus Linguistics. An International Handbook, Anke Lüdeling & Merja Kytö (eds), 1212–1248. Berlin: Mouton de Gruyter.

Elsness, Johan. 1997. The Perfect and the Preterite in Contemporary and Earlier English [Topics in English Linguistics 21]. Berlin: Mouton de Gruyter.

Fries, Udo. 2010. Sentence length, sentence complexity and the noun phrase in the 18th-century news publication. In Language Change and Variation from Old English to Late Modern English: A Festschrift for Minoji Akimoto, Merja Kytö, John Scahill & Harumi Tanabe (eds), 21–34. Bern: Peter Lang.

van Halteren, Hans, Daelemans, Walter & Zavrel, Jakub. 2001. Improving accuracy in word class tagging through the combination of machine learning systems. Computational Linguistics 27(2): 199–229.

Helgadóttir, Sigrún. 2004. Testing data-driven learning algorithms for PoS tagging of Icelandic. In Nordisk Sprogteknologi 2004, Henrik Holmboe (ed.), 257–265. Copenhagen: Museum Tusculanums.

Hilpert, Martin & Gries, Stefan T. 2016. Quantitative approaches to diachronic corpus linguistics. In The Cambridge Handbook of English Historical Linguistics, Merja Kytö & Päivi Pahta (eds), 36–53. Cambridge: CUP.

Hundt, Marianne, Denison, David & Schneider, Gerold. 2012. Retrieving relatives from historical data. Literary and Linguistic Computing 27(1): 3–16.

Jenset, Gard Buen. 2010. A Corpus-Based Study on the Evolution of There: Statistical Analysis and Cognitive Interpretation. PhD dissertation, University of Bergen.

Koehn, Philipp, Hoang, Hieu, Birch, Alexandra, Callison-Burch, Chris, Federico, Marcello, Bertoldi, Nicola, Cowan, Brooke, Shen, Wade, Moran, Christine, Zens, Richard, Dyer, Chris, Bojar, Ondrej, Constantin, Alexandra & Herbst, Evan. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 177–180. Prague: ACL.

Labov, William. 1969. Contraction, deletion, and inherent variability of the English copula. Language 45(4): 715–762.

Leech, Geoffrey, Hundt, Marianne, Mair, Christian & Smith, Nicholas. 2009. Change in Contemporary English. A Grammatical Study. Cambridge: CUP.

Lehmann, Hans Martin & Schneider, Gerold. 2012. BNC Dependency Bank 1.0. In Aspects of Corpus Linguistics: Compilation, Annotation, Analysis, Signe Oksefjell Ebeling, Jarle Ebeling & Hilde Hasselgård (eds). Helsinki: VARIENG. <[URL]>

Loftsson, Hrafn. 2008. Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics 31(1): 47–72.

López-Couso, Maria, Aarts, Bas & Méndez-Naya, Belén. 2012. Late Modern English syntax. In Historical Linguistics of English: An International Handbook, Vol. 1 [Handbooks of Linguistics and Communication Science [HSK] 34.1], Alexander Bergs & Laurel J. Brinton (eds), 869–887. Berlin: Mouton de Gruyter.

Marcus, Mitch, Kim, Grace, Marcinkiewicz, M. A., MacIntyre, Robert, Bies, Ann, Ferguson, Mark, Katz, Karen & Schasberger, Britta. 1994. The Penn Treebank: Annotating predicate argument structure. In Proceedings of the workshop on Human Language Technology (HLT ’94), 114–119.

Och, Franz Josef & Ney, Hermann. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 1(29): 19–51.

Pettersson, Eva, Megyesi, Beáta & Tiedemann, Jörg. 2013. An SMT approach to automatic annotation of historical text. In Proceedings of the NoDaLiDa 2013 workshop on Computational Historical Linguistics. <[URL]> (29 April 2019)

Rayson, Paul, Archer, Dawn, Baron, Alistair, Culpeper, Jonathan & Smith, Nicholas. 2007. Tagging the bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of Corpus Linguistics 2007. University of Birmingham, UK. <[URL]> (29 April 2019).

Rissanen, Matti. 1999. Chapter 3: Syntax. In The Cambridge History of the English Language, Vol. III: 1476–1776, Roger Lass (ed.), 187–331. Cambridge: CUP.

Rosenbach, Anette. 2014. English genitive variation – The state of the art. English Language and Linguistics 18(2): 215–262.

Röthlisberger, Melanie & Schneider, Gerold. 2013. Of-genitive versus s-genitive: A corpus-based analysis of possessive constructions in 20th-century English. In Korpuslinguistik und Interdisziplinäre Perspektiven auf Sprache – Corpus Linguistics and Interdisciplinary Perspectives on Language (CLIP), Paul Bennet, Silke Scheible & Richard J. Whitt (eds), 163–180. Stuttgart: Narr Franke Attempto.

Samuelsson, Christer & Voutilainen, Atro. 1997. Comparing a linguistic and a stochastic tagger. In Proceedings of ACL/EACL Joint Conference, Madrid.

Scheible, Silke, Whitt, Richard J., Durrell, Martin & Bennett, Paul. 2011. Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text. In Proceedings of the ACL-HLT 2011 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2011), Portland OR.

Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, 44–49. Manchester. <[URL]>

Schneider, Gerold, Hundt, Marianne & Oppliger, Rahel. 2016. Part-Of-Speech in historical corpora: Tagger evaluation and ensemble systems on ARCHER. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS), Bochum, Germany, September 19–21, 2016, Stefanie Dipper, Friedrich Neubarth & Heike Zinsmeister (eds), 256–264.

Schneider, Gerold, Pettersson, Eva & Percillier, Michael. 2017. Comparing rule-based and SMT-based spelling normalisation for English historical texts. Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language 133, 40–46.

Tiedemann, Jörg. 2009. Character-based PSMT for closely related languages. Proceedings of 13th Annual Conference of the European Association for Machine Translation (EAMT’09), 12–19.

Cited by (1)

Cited by one other publication

Schneider, Gerold

2022. Systematically Detecting Patterns of Social, Historical and Linguistic Change: The Framing of Poverty in Times of Poverty. Transactions of the Philological Society 120:3 ► pp. 447 ff.

This list is based on CrossRef data as of 28 june 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.