Spelling normalisation of Late Modern English
Comparison and combination of VARD and character-based
statistical machine translation
To be able to profit from natural language
processing (NLP) tools for analysing historical text, an important
step is spelling normalisation. We first compare and second combine
two different approaches: on the one hand VARD, a rule-based system
which is based on dictionary lookup and rules with non-probabilistic
but trainable weights; on the other hand a language-independent
approach to spelling normalisation based on statistical machine
translation (SMT) techniques. The rule-based system reaches the best
accuracy, up to 94% precision at 74% recall, while the SMT system
improves each tested period. We obtain the best system by combining
both approaches. Re-training VARD on specific time-periods and
domains is beneficial, and both systems benefit from a language
sequence model using collocation strength.
Article outline
- 1.Introduction
- 2.Data and annotation
- 2.1The ARCHER corpus
- 2.2Training
- 3.Methods
- 3.1VARD2
- 3.2Statistical machine translation (SMT)
- 3.3Ensemble systems
- 3.4Collocations
- 4.Performance of the individual systems
- 4.1Performance of VARD2
- 4.2Performance of SMT
- 4.3VARD2 and SMT in comparison
- 5.Majority voting ensemble system
- 6.Adding a language sequence model
- 7.Application: Data-driven historical linguistics
- 7.1Concordancing: The impact of normalisation on recall
- 7.2Overuse of lexis and POS tags
- 8.Conclusions
-
Acknowledgements
-
Note
-
References
References (31)
References
Aston, Guy & Burnard, Lou. 1998. The BNC Handbook. Exploring the British National Corpus with SARA. Edinburgh: EUP.
Baron, Alistair & Rayson, Paul. 2008. VARD
2: A tool for dealing with spelling variation in historical
corpora. In Proceedings
of the Postgraduate Conference in Corpus Linguistics,
Birmingham. Aston University, 22
May. <[URL]> (29 April 2019).
Biber, Douglas, Finegan, Edward & Atkinson, Dwight. 1994. ARCHER
and its challenges: Compiling and exploring a representative
corpus of historical English
registers. In Creating
and Using English Language Corpora. Papers from the 14th
International Conference on English Language Research on
Computerized Corpora, Zurich
1993, Udo Fries, Peter Schneider & Gunnel Tottie (eds), 1–13. Amsterdam: Rodopi.
Dietterich, Thomas G. 1997. Machine learning research: Four current directions. AI Magazine 18(4): 97–136.
Evert, Stefan. 2008. Corpora
and
collocations. In Corpus
Linguistics. An International
Handbook, Anke Lüdeling & Merja Kytö (eds), 1212–1248. Berlin: Mouton de Gruyter.
Elsness, Johan. 1997. The
Perfect and the Preterite in Contemporary and Earlier
English [Topics in English
Linguistics
21]. Berlin: Mouton de Gruyter.
Fries, Udo. 2010. Sentence
length, sentence complexity and the noun phrase in the
18th-century news
publication. In Language
Change and Variation from Old English to Late Modern
English: A Festschrift for Minoji
Akimoto, Merja Kytö, John Scahill & Harumi Tanabe (eds), 21–34. Bern: Peter Lang.
van Halteren, Hans, Daelemans, Walter & Zavrel, Jakub. 2001. Improving accuracy in word class tagging through the combination of machine learning systems. Computational Linguistics 27(2): 199–229.
Helgadóttir, Sigrún. 2004. Testing data-driven learning algorithms for PoS tagging of Icelandic. In Nordisk Sprogteknologi 2004, Henrik Holmboe (ed.), 257–265. Copenhagen: Museum Tusculanums.
Hilpert, Martin & Gries, Stefan T. 2016. Quantitative
approaches to diachronic corpus
linguistics. In The
Cambridge Handbook of English Historical
Linguistics, Merja Kytö & Päivi Pahta (eds), 36–53. Cambridge: CUP.
Hundt, Marianne, Denison, David & Schneider, Gerold. 2012. Retrieving
relatives from historical
data. Literary and Linguistic
Computing 27(1): 3–16.
Jenset, Gard Buen. 2010. A
Corpus-Based Study on the Evolution of
There: Statistical Analysis and
Cognitive
Interpretation. PhD
dissertation, University of Bergen.
Koehn, Philipp, Hoang, Hieu, Birch, Alexandra, Callison-Burch, Chris, Federico, Marcello, Bertoldi, Nicola, Cowan, Brooke, Shen, Wade, Moran, Christine, Zens, Richard, Dyer, Chris, Bojar, Ondrej, Constantin, Alexandra & Herbst, Evan. 2007. Moses:
Open source toolkit for statistical machine
translation. In Proceedings
of the 45th Annual Meeting of the ACL on Interactive Poster
and Demonstration
Sessions, 177–180. Prague: ACL.
Labov, William. 1969. Contraction,
deletion, and inherent variability of the English
copula. Language 45(4): 715–762.
Leech, Geoffrey, Hundt, Marianne, Mair, Christian & Smith, Nicholas. 2009. Change
in Contemporary English. A Grammatical
Study. Cambridge: CUP.
Lehmann, Hans Martin & Schneider, Gerold. 2012. BNC Dependency Bank 1.0. In Aspects of Corpus Linguistics: Compilation, Annotation, Analysis, Signe Oksefjell Ebeling, Jarle Ebeling & Hilde Hasselgård (eds). Helsinki: VARIENG. <[URL]>
Loftsson, Hrafn. 2008. Tagging
Icelandic text: A linguistic rule-based
approach. Nordic Journal of
Linguistics 31(1): 47–72.
López-Couso, Maria, Aarts, Bas & Méndez-Naya, Belén. 2012. Late
Modern English
syntax. In Historical
Linguistics of English: An International
Handbook, Vol. 1 [Handbooks
of Linguistics and Communication Science [HSK]
34.1], Alexander Bergs & Laurel J. Brinton (eds), 869–887. Berlin: Mouton de Gruyter.
Marcus, Mitch, Kim, Grace, Marcinkiewicz, M. A., MacIntyre, Robert, Bies, Ann, Ferguson, Mark, Katz, Karen & Schasberger, Britta. 1994. The
Penn Treebank: Annotating predicate argument
structure. In Proceedings
of the workshop on Human Language Technology (HLT
’94), 114–119.
Och, Franz Josef & Ney, Hermann. 2003. A
systematic comparison of various statistical alignment
models. Computational
Linguistics 1(29): 19–51.
Pettersson, Eva, Megyesi, Beáta & Tiedemann, Jörg. 2013. An
SMT approach to automatic annotation of historical
text. In Proceedings
of the NoDaLiDa 2013 workshop on Computational Historical
Linguistics. <[URL]> (29 April 2019)
Rayson, Paul, Archer, Dawn, Baron, Alistair, Culpeper, Jonathan & Smith, Nicholas. 2007. Tagging
the bard: Evaluating the accuracy of a modern POS tagger on
Early Modern English
corpora. In Proceedings
of Corpus Linguistics
2007. University of Birmingham, UK. <[URL]> (29 April 2019).
Rissanen, Matti. 1999. Chapter 3:
Syntax. In The
Cambridge History of the English Language, Vol. III:
1476–1776, Roger Lass (ed.), 187–331. Cambridge: CUP.
Rosenbach, Anette. 2014. English
genitive variation – The state of the
art. English Language and
Linguistics 18(2): 215–262.
Röthlisberger, Melanie & Schneider, Gerold. 2013. Of-genitive
versus s-genitive: A corpus-based analysis
of possessive constructions in 20th-century
English. In Korpuslinguistik
und Interdisziplinäre Perspektiven auf Sprache – Corpus
Linguistics and Interdisciplinary Perspectives on Language
(CLIP), Paul Bennet, Silke Scheible & Richard J. Whitt (eds), 163–180. Stuttgart: Narr Franke Attempto.
Samuelsson, Christer & Voutilainen, Atro. 1997. Comparing
a linguistic and a stochastic
tagger. In Proceedings
of ACL/EACL Joint
Conference, Madrid.
Scheible, Silke, Whitt, Richard J., Durrell, Martin & Bennett, Paul. 2011. Evaluating
an ‘off-the-shelf’ POS-tagger on Early Modern German
text. In Proceedings
of the ACL-HLT 2011 Workshop on Language Technology for
Cultural Heritage, Social Sciences, and Humanities (LaTeCH
2011), Portland OR.
Schmid, Helmut. 1994. Probabilistic
part-of-speech tagging using decision
trees. In Proceedings
of International Conference on New Methods in Language
Processing, 44–49. Manchester. <[URL]>
Schneider, Gerold, Hundt, Marianne & Oppliger, Rahel. 2016. Part-Of-Speech in historical corpora: Tagger evaluation and ensemble systems on ARCHER. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS), Bochum, Germany, September 19–21, 2016, Stefanie Dipper, Friedrich Neubarth & Heike Zinsmeister (eds), 256–264.
Schneider, Gerold, Pettersson, Eva & Percillier, Michael. 2017. Comparing
rule-based and SMT-based spelling normalisation for English
historical texts. Proceedings
of the NoDaLiDa 2017 Workshop on Processing Historical
Language 133, 40–46.
Tiedemann, Jörg. 2009. Character-based
PSMT for closely related
languages. Proceedings of
13th Annual Conference of the European Association for
Machine Translation
(EAMT’09), 12–19.
Cited by (1)
Cited by one other publication
Schneider, Gerold
2022.
Systematically Detecting Patterns of Social, Historical and Linguistic Change: The Framing of Poverty in Times of Poverty.
Transactions of the Philological Society 120:3
► pp. 447 ff.
This list is based on CrossRef data as of 5 november 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.