Spelling normalisation of Late Modern English
Comparison and combination of VARD and character-based
statistical machine translation
To be able to profit from natural language
processing (NLP) tools for analysing historical text, an important
step is spelling normalisation. We first compare and second combine
two different approaches: on the one hand VARD, a rule-based system
which is based on dictionary lookup and rules with non-probabilistic
but trainable weights; on the other hand a language-independent
approach to spelling normalisation based on statistical machine
translation (SMT) techniques. The rule-based system reaches the best
accuracy, up to 94% precision at 74% recall, while the SMT system
improves each tested period. We obtain the best system by combining
both approaches. Re-training VARD on specific time-periods and
domains is beneficial, and both systems benefit from a language
sequence model using collocation strength.
Article outline
- 1.Introduction
- 2.Data and annotation
- 2.1The ARCHER corpus
- 2.2Training
- 3.Methods
- 3.1VARD2
- 3.2Statistical machine translation (SMT)
- 3.3Ensemble systems
- 3.4Collocations
- 4.Performance of the individual systems
- 4.1Performance of VARD2
- 4.2Performance of SMT
- 4.3VARD2 and SMT in comparison
- 5.Majority voting ensemble system
- 6.Adding a language sequence model
- 7.Application: Data-driven historical linguistics
- 7.1Concordancing: The impact of normalisation on recall
- 7.2Overuse of lexis and POS tags
- 8.Conclusions
-
Acknowledgements
-
Note
-
References
References (31)
Aston, Guy & Burnard, Lou
1998 The BNC Handbook. Exploring the British National Corpus with SARA. Edinburgh: EUP.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Baron, Alistair & Rayson, Paul
2008 VARD
2: A tool for dealing with spelling variation in historical
corpora. In
Proceedings
of the Postgraduate Conference in Corpus Linguistics,
Birmingham. Aston University, 22
May.
[URL]> (
29 April 2019).
Biber, Douglas, Finegan, Edward & Atkinson, Dwight
1994 ARCHER
and its challenges: Compiling and exploring a representative
corpus of historical English
registers. In
Creating
and Using English Language Corpora. Papers from the 14th
International Conference on English Language Research on
Computerized Corpora, Zurich
1993,
Udo Fries,
Peter Schneider &
Gunnel Tottie (eds), 1–13. Amsterdam: Rodopi.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Dietterich, Thomas G.
1997 Machine learning research: Four current directions.
AI Magazine 18(4): 97–136.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Evert, Stefan
2008 Corpora
and
collocations. In
Corpus
Linguistics. An International
Handbook,
Anke Lüdeling &
Merja Kytö (eds), 1212–1248. Berlin: Mouton de Gruyter.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Elsness, Johan
1997 The
Perfect and the Preterite in Contemporary and Earlier
English [
Topics in English
Linguistics
21]. Berlin: Mouton de Gruyter.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Fries, Udo
2010 Sentence
length, sentence complexity and the noun phrase in the
18th-century news
publication. In
Language
Change and Variation from Old English to Late Modern
English: A Festschrift for Minoji
Akimoto,
Merja Kytö,
John Scahill &
Harumi Tanabe (eds), 21–34. Bern: Peter Lang.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
van Halteren, Hans, Daelemans, Walter & Zavrel, Jakub
2001 Improving accuracy in word class tagging through the combination of machine learning systems.
Computational Linguistics 27(2): 199–229.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Helgadóttir, Sigrún
2004 Testing data-driven learning algorithms for PoS tagging of Icelandic. In
Nordisk Sprogteknologi 2004,
Henrik Holmboe (ed.), 257–265. Copenhagen: Museum Tusculanums.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Hilpert, Martin & Gries, Stefan T.
2016 Quantitative
approaches to diachronic corpus
linguistics. In
The
Cambridge Handbook of English Historical
Linguistics,
Merja Kytö &
Päivi Pahta (eds), 36–53. Cambridge: CUP.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Hundt, Marianne, Denison, David & Schneider, Gerold
2012 Retrieving
relatives from historical
data.
Literary and Linguistic
Computing 27(1): 3–16.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Jenset, Gard Buen
2010 A
Corpus-Based Study on the Evolution of
There: Statistical Analysis and
Cognitive
Interpretation. PhD
dissertation, University of Bergen.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Koehn, Philipp, Hoang, Hieu, Birch, Alexandra, Callison-Burch, Chris, Federico, Marcello, Bertoldi, Nicola, Cowan, Brooke, Shen, Wade, Moran, Christine, Zens, Richard, Dyer, Chris, Bojar, Ondrej, Constantin, Alexandra & Herbst, Evan
2007 Moses:
Open source toolkit for statistical machine
translation. In
Proceedings
of the 45th Annual Meeting of the ACL on Interactive Poster
and Demonstration
Sessions, 177–180. Prague: ACL.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Labov, William
1969 Contraction,
deletion, and inherent variability of the English
copula.
Language 45(4): 715–762.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Leech, Geoffrey, Hundt, Marianne, Mair, Christian & Smith, Nicholas
2009 Change
in Contemporary English. A Grammatical
Study. Cambridge: CUP.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Lehmann, Hans Martin & Schneider, Gerold
2012 BNC Dependency Bank 1.0. In
Aspects of Corpus Linguistics: Compilation, Annotation, Analysis,
Signe Oksefjell Ebeling,
Jarle Ebeling &
Hilde Hasselgård (eds). Helsinki: VARIENG.
[URL]
Loftsson, Hrafn
2008 Tagging
Icelandic text: A linguistic rule-based
approach.
Nordic Journal of
Linguistics 31(1): 47–72.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
López-Couso, Maria, Aarts, Bas & Méndez-Naya, Belén
2012 Late
Modern English
syntax. In
Historical
Linguistics of English: An International
Handbook, Vol. 1 [
Handbooks
of Linguistics and Communication Science [HSK]
34.1],
Alexander Bergs &
Laurel J. Brinton (eds), 869–887. Berlin: Mouton de Gruyter.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Marcus, Mitch, Kim, Grace, Marcinkiewicz, M. A., MacIntyre, Robert, Bies, Ann, Ferguson, Mark, Katz, Karen & Schasberger, Britta
1994 The
Penn Treebank: Annotating predicate argument
structure. In
Proceedings
of the workshop on Human Language Technology (HLT
’94), 114–119.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Och, Franz Josef & Ney, Hermann
2003 A
systematic comparison of various statistical alignment
models.
Computational
Linguistics 1(29): 19–51.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Pettersson, Eva, Megyesi, Beáta & Tiedemann, Jörg
2013 An
SMT approach to automatic annotation of historical
text. In
Proceedings
of the NoDaLiDa 2013 workshop on Computational Historical
Linguistics.
[URL]> (
29 April 2019)
Rayson, Paul, Archer, Dawn, Baron, Alistair, Culpeper, Jonathan & Smith, Nicholas
2007 Tagging
the bard: Evaluating the accuracy of a modern POS tagger on
Early Modern English
corpora. In
Proceedings
of Corpus Linguistics
2007. University of Birmingham, UK.
[URL]> (
29 April 2019).
Rissanen, Matti
1999 Chapter 3:
Syntax. In
The
Cambridge History of the English Language, Vol. III:
1476–1776,
Roger Lass (ed.), 187–331. Cambridge: CUP.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rosenbach, Anette
2014 English
genitive variation – The state of the
art.
English Language and
Linguistics 18(2): 215–262.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Röthlisberger, Melanie & Schneider, Gerold
2013
Of-genitive
versus s-genitive: A corpus-based analysis
of possessive constructions in 20th-century
English. In
Korpuslinguistik
und Interdisziplinäre Perspektiven auf Sprache – Corpus
Linguistics and Interdisciplinary Perspectives on Language
(CLIP),
Paul Bennet,
Silke Scheible &
Richard J. Whitt (eds), 163–180. Stuttgart: Narr Franke Attempto.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Samuelsson, Christer & Voutilainen, Atro
1997 Comparing
a linguistic and a stochastic
tagger. In
Proceedings
of ACL/EACL Joint
Conference, Madrid.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Scheible, Silke, Whitt, Richard J., Durrell, Martin & Bennett, Paul
2011 Evaluating
an ‘off-the-shelf’ POS-tagger on Early Modern German
text. In
Proceedings
of the ACL-HLT 2011 Workshop on Language Technology for
Cultural Heritage, Social Sciences, and Humanities (LaTeCH
2011), Portland OR.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Schmid, Helmut
1994 Probabilistic
part-of-speech tagging using decision
trees. In
Proceedings
of International Conference on New Methods in Language
Processing, 44–49. Manchester.
[URL]
Schneider, Gerold, Hundt, Marianne & Oppliger, Rahel
2016 Part-Of-Speech in historical corpora: Tagger evaluation and ensemble systems on ARCHER. In
Proceedings of the 13th Conference on Natural Language Processing (KONVENS), Bochum, Germany,
September 19–21 2016,
Stefanie Dipper,
Friedrich Neubarth &
Heike Zinsmeister (eds), 256–264.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Schneider, Gerold, Pettersson, Eva & Percillier, Michael
2017 Comparing
rule-based and SMT-based spelling normalisation for English
historical texts.
Proceedings
of the NoDaLiDa 2017 Workshop on Processing Historical
Language 133, 40–46.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Tiedemann, Jörg
2009 Character-based
PSMT for closely related
languages.
Proceedings of
13th Annual Conference of the European Association for
Machine Translation
(EAMT’09), 12–19.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cited by (1)
Cited by 1 other publications
Schneider, Gerold
2022.
Systematically Detecting Patterns of Social, Historical and Linguistic Change: The Framing of Poverty in Times of Poverty.
Transactions of the Philological Society 120:3
► pp. 447 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
This list is based on CrossRef data as of 28 june 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.