Lima or cima?: Structure recognition and OCR in building the corpus of the Austrian Alpine Club Journal

Posch, Claudia; Rampl, Gerhard

doi:10.1075/ijcl.19094.pos

Article published In:

International Journal of Corpus Linguistics
Vol. 25:4 (2020) ► pp.489–503

Short paper

Lima or cima?

Structure recognition and OCR in building the corpus of the Austrian Alpine Club Journal

Claudia Posch | University of Innsbruck

Gerhard Rampl | University of Innsbruck

This paper outlines the construction of the corpus Alpenwort, a large, genre-based corpus of German texts on alpinism. We report on issues related to building the corpus from the Austrian Alpine Club Journal (1869–2010). First, a general description of our data and the project phases from digitization and annotation to publication is given. We focus on the most interesting challenges that the diverse layouts and the extensive use of Fraktur typefacing posed for optical layout recognition and optical character recognition (OCR) as well as post correction. The corrected data was lemmatized and annotated with part-of-speech information including named entities as well as TEI-conformant metadata. The resulting 19.9-million-word corpus is designed to be queried using CQPweb and Hyperbase and can be accessed freely online. Lastly, we give a short roadmap of current and future expansions and improvements as corpus data has been and is being enhanced in follow-up projects.

Keywords: German Fraktur typeface, document structure recognition, OCR, alpinism, specialized corpora

Article outline

1.Introduction
2.Data and project phases
3.Scanning, structure recognition, and OCR
4.XML export
5.OCR post correction
6.Linguistic mark-up and annotation
7.Conclusions and prospects
Acknowledgements
References

Published online: 27 October 2020

https://doi.org/10.1075/ijcl.19094.pos

References (49)

References

Achrainer, M. (2014). Das Historische Alpenarchiv der Alpenvereine. arbido – Fachzeitschrift für Archiv, Bibliothek und Dokumentation, 11, 14–17.

Baker, P., & McEnery, T. (Eds.) (2015). Corpora and Discourse Studies: Integrating Discourse and Corpora. Palgrave Macmillan.

Beck, F. (2006). „Schwabacher Judenlettern“: Schriftverruf im Dritten Reich [“Schwabach Jewish Typeface”: The discrediting of a typeface in the Third Reich]. In B. Brachmann (Ed.), Die Kunst des Vernetzens: Festschrift für Wolfgang Hempel [The Art of Networking: Festschrift for Wolfgang Hempel] (pp. 251–269). Verl. für Berlin-Brandenburg.

Brezina, V., Timperley, M., & McEnery, T. (2018). #LancsBox (Version 4.0) [Computer software]. [URL]

Bubenhofer, N., Volk, M., Leuenberger, F., & Wüest, D. (2015). Text+Berg-Korpus (Release 151_v01). [URL]

Carrasco, R. C. (2014). An open-source OCR evaluation tool. In A. Antonacopoulos & K. U. Schulz (Eds.), Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage – DATeCH ’14. (pp. 179–184). Madrid, Spain, 19.05.2014 – 20.05.2014. ACM. [URL].

CLARIN-D/SfS-Uni. Tübingen (2012). WebLicht: Web-Based Linguistic Chaining Tool [Computer software]. [URL]

Clausner, C., Pletschacher, S., & Antonacopoulos, A. (2020). Flexible character accuracy measure for reading-order-independent evaluation. Pattern Recognition Letters, 1311, 390–397.

Cunningham, H., Tablan, V., Roberts, A., & Bontcheva, K. (2013). Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Computational Biology, 9(2), e1002854.

Durán-Muñoz, I. (2019). Adjectives and their keyness: A corpus-based analysis of tourism discourse in English. Corpora, 14(3), 351–378.

Gander, L., Lezuo, C., & Unterweger, R. (2011). Rule based document understanding of historical books using a hybrid fuzzy classification system. In B. Barrett, M. S. Brown, R. Manmatha, & J. Gehring (Eds.), Proceedings of the 2011 Workshop on Historical Document Imaging and Processing – HIP ’11 (p. 91). ACM.

Généreux, M., & Spano, D. (2015). NLP challenges in dealing with OCR-ed documents of derogated quality. In Workshop on Replicability and Reproducibility in Natural Language Processing: Adaptive methods, resources and software (pp. 1–7). Buenos Aires.

Généreux, M., Stemle, E. W., Lyding, V., & Nicolas, L. (2014). Correcting OCR errors for German in Fraktur font. In R. Basili, A. Lenci, & B. Magnini (Eds.), The First Italian Conference on Computational Linguistics CLiC-it 2014. Proceedings (pp. 186–190). Pisa University Press. [URL]

Hartmann, S. (1998). Fraktur oder Antiqua: Der Schriftstreit von 1881 bis 1941 [Fraktur or Antiqua: The font controversy of 1881 to 1941]. Lang.

Hauser, A. W. (2007). OCR Postcorrection of Historical Texts [Unpublished Master’s thesis]. Ludwig-Maximilians-Universität.

Hiebel, G., Posch, C., Rampl, G., Gruber, E., Hanke, K., & Zangerle, E. (2017). Semantics for Mountaineering History. In 4th Digital Humanities Austria Conference – dha2017: Abstracts. Innsbruck. [URL]

Hiebel, G., Rampl, G., & Posch, C. (2020). Angereichtertes Alpenwortcorpus/Enriched Alpenwort-Corpus. (Version 1.0.0). [Data Set].

Holley, R. (2009). How good can it get? D-Lib Magazine, 15(3/4).

Kahle, P., Colutto, S., Hackl, G., & Mühlberger, G. (2017). Transkribus – A service platform for transcription, recognition and retrieval of historical documents. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (pp. 19–24).

Kermes, H., Degaetano-Ortlieb, S., Khamis, A., Knappen, J., & Teich, E. (2016). The Royal Society Corpus: From uncharted data to corpus. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 1928–1931). European Language Resources Association (ELRA). [URL]

Klijn, E. (2008). The current state-of-art in newspaper digitization. D-Lib Magazine, 14(1/2).

Law, R. W. (2019). Transnational Nazism: Ideology and Culture in German-Japanese Relations, 1919–1936. Cambridge University Press.

Mautner, G. (2015). Checks and balances: how corpus linguistics can contribute to CDA. In M. Meyer & R. Wodak (Eds.), Methods of Critical Discourse Studies (pp. 154–179). Sage.

(2019). A research note on corpora and discourse: Points to ponder in research design. Journal of Corpora and Discourse Studies, 21, 2–13.

McEnery, T., & Brookes, G. (forthcoming). Register, belief and violence: A multi-dimensional approach. In Posch, C., Rampl, G., & Irschara, K. (Eds.) Wort – Satz – Korpus. Beiträge zur Korpuslinguistik [Word – Sentence – Corpus: Contributions to Corpus Linguistics]. iup.

Mühlberger, G. (2009–2011). Functional Extension Parser [Computer software]. [URL]

(2011). Digitalisierung historischer Zeitungen aus dem Blickwinkel der automatisierten Text- und Strukturerkennung (OCR) [Digitalisation of historical newspapers from the perspective of automated text and structure recognition (OCR)]. Zeitschrift für Bibliothekswesen und Bibliographie, 58(1), 10–18.

Mühlberger, G., Zelger, J., & Sagmeister, D. (2014). User-driven correction of OCR errors. Combining crowdsourcing and information retrieval. In ICPS. Digital Access to Textual Cultural Heritage. DATeCH 2014 conference proceedings (pp. 53–56). Madrid, Spain, May 19 – 20, 2014. ACM.

Pointal, L. (2004–2016). TreeTagger Python Wrapper: CNRS – LIMSI [Computer software]. [URL]

Posch, C., & Rampl, G. (2017). Alpenwort – Korpus der Zeitschrift des Deutschen und Österreichischen Alpenvereins (1869–1998) [Alpenwort – Corpus of the Almanac of the Austrian Alpine Club]. [URL]

(2018a). Alpenwort – Corpus of the Almanac of the Austrian Alpine Club (Version 1.0.0) [Data set].

(2018b). Alpenwort Hyperbase web edition (v. 1.0). [URL]

Posch, C., Rampl, G., & Cullen, R. (2019). New Zealand Alpine Journal Archive: New Zealand’s alpine heritage at your fingertips. [URL]

Puzey, G., & Kostanski, L. (Eds.) (2016). Names and Naming: People, Places, Perceptions and Power. Multilingual Matters.

Rampl, G., Gruber, E., Posch, C., & Hiebel, G. (in press). Toponomastik und Korpuslinguistik: Bergnamen im (Kon-)Text [Toponomastics and Corpuslinguistics: Mountain names in (con)text.]. In K. Dräger, R. Heuser, & M. Prinz (Eds.), Proceedings of Toponyme – Eine Standortbestimmung [Current Tendencies in Toponyms]. De Gruyter.

Rampl, G., & Posch, C. (2019). Alpenwort CQPweb Edition. [URL]

Rheindorf, M., & Wodak, R. (2019a). ‘Austria First’ revisited: A diachronic cross-sectional analysis of the gender and body politics of the extreme right. Patterns of Prejudice, 53(3), 302–320.

(2019b). Genre-related language change: Discourse- and corpus-linguistic perspectives on Austrian German 1970–2010. Folia Linguistica, 53(1), 125–167.

Rigaud, C., Doucet, A., Coustaty, M., & Moreux, J. P. (2019). Competition on post-OCR text correction. [URL]

Rose-Redwood, R., Alderman, D., & Azaryahu, M. (2010). Geographies of toponymic inscription: New directions in Critical Place-name Studies. Progress in Human Geography, 34(4), 453–470.

Schmid, H. (1994–1995). TreeTagger – A language independent part-of-speech tagger [Computer software]. Center for Information and Language Processing. [URL]

(1999). Improvements in Part-of-Speech Tagging with an application to German. In S. Armstrong, K. Church, P. Isabelle, S. Manzi, E. Tzoukermann, & D. Yarowsky (Eds.), Natural Language Processing Using Very Large Corpora (pp. 13–25). Springer.

Schulz, K., Ringlstetter, C., Vobl, T., Gotscharek, A., & Reffle, U. (2008). PoToCo [Computer software]. Centrum für Informations- und Sprachverarbeitung. [URL]

Underwood, T., & Auvil, L. (n.d.). Basic OCR correction [Blog post]. [URL]

van Dalen-Oskam, K. (2016). Corpus-based approaches to names in literature. In C. Hough & D. Izdebska (Eds.), The Oxford Handbook of Names and Naming (pp. 344–353). Oxford University Press.

Volk, M., Bubenhofer, N., Althaus, A., Bangerter, M., Furrer, L., & Ruef, B. (2010). Challenges in building a multilingual alpine heritage corpus. In Seventh International Conference on Language Resources and Evaluation (LREC) Malta, 19 May 2010 – 21 May 2010 (pp. 1653–1659). [URL]

Volk, M., Furrer, L., & Sennrich, R. (2011). Strategies for reducing and correcting OCR Errors. In C. Sporleder, A. van den Bosch, & K. Zervanou (Eds.), Theory and Applications of Natural Language Processing: Language Technology for Cultural Heritage (pp. 3–22). Springer.

Wiegand, V. (2019). A Corpus Linguistic Approach to Meaning-making Patterns in Surveillance Discourse [Doctoral dissertation, University of Birmingham]. UBIRA E THESES. [URL]

Wiegand, V., & Mahlberg, M. (Eds.) (2019). Corpus Linguistics, Context and Culture. De Gruyter.

Cited by (2)

Cited by two other publications

Posch, Claudia

2023. Half-Witted or Hard-Working-Fun-Loving Women? – A Corpus-Assisted Study of Gendered Collocation in the New Zealand Alpine Club Journal Corpus. Zeitschrift für Anglistik und Amerikanistik 71:3 ► pp. 241 ff.

Posch, Claudia

2023. Women, Who Climb - A Corpus Linguistic Tour Description with Potential Danger Zones. Gender a výzkum / Gender and Research 23:2 ► pp. 82 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.