Lima or cima?
Structure recognition and OCR in building the corpus of the Austrian Alpine Club Journal
Claudia Posch | University of Innsbruck
Gerhard Rampl | University of Innsbruck
This paper outlines the construction of the corpus Alpenwort, a large, genre-based corpus of German texts on alpinism. We
report on issues related to building the corpus from the Austrian Alpine Club Journal (1869–2010). First, a general
description of our data and the project phases from digitization and annotation to publication is given. We focus on the most interesting
challenges that the diverse layouts and the extensive use of Fraktur typefacing posed for optical layout recognition and optical character
recognition (OCR) as well as post correction. The corrected data was lemmatized and annotated with part-of-speech information including
named entities as well as TEI-conformant metadata. The resulting 19.9-million-word corpus is designed to be queried using
CQPweb and Hyperbase and can be accessed freely online. Lastly, we give a short roadmap of current and
future expansions and improvements as corpus data has been and is being enhanced in follow-up projects.
Keywords: German Fraktur typeface, document structure recognition, OCR, alpinism, specialized corpora
Published online: 27 October 2020
https://doi.org/10.1075/ijcl.19094.pos
https://doi.org/10.1075/ijcl.19094.pos
References
References
Achrainer, M.
Baker, P., & McEnery, T.
Beck, F.
(2006) „Schwabacher Judenlettern“: Schriftverruf im Dritten Reich [“Schwabach Jewish Typeface”: The discrediting of a typeface in the Third Reich]. In B. Brachmann (Ed.), Die Kunst des Vernetzens: Festschrift für Wolfgang Hempel [The Art of Networking: Festschrift for Wolfgang Hempel] (pp. 251–269). Verl. für Berlin-Brandenburg.
Brezina, V., Timperley, M., & McEnery, T.
(2018) #LancsBox (Version 4.0) [Computer software]. http://corpora.lancs.ac.uk/lancsbox/index.php
Bubenhofer, N., Volk, M., Leuenberger, F., & Wüest, D.
Carrasco, R. C.
(2014) An open-source OCR evaluation tool. In A. Antonacopoulos & K. U. Schulz (Eds.), Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage – DATeCH ’14. (pp. 179–184). Madrid, Spain, 19.05.2014 – 20.05.2014. ACM. https://www.aclweb.org/anthology/W19-6004.pdf. 
CLARIN-D/SfS-Uni. Tübingen
(2012) WebLicht: Web-Based Linguistic Chaining Tool [Computer software]. https://weblicht.sfs.uni-tuebingen.de
Clausner, C., Pletschacher, S., & Antonacopoulos, A.
Cunningham, H., Tablan, V., Roberts, A., & Bontcheva, K.
Durán-Muñoz, I.
Gander, L., Lezuo, C., & Unterweger, R.
Généreux, M., & Spano, D.
Généreux, M., Stemle, E. W., Lyding, V., & Nicolas, L.
(2014) Correcting OCR errors for German in Fraktur font. In R. Basili, A. Lenci, & B. Magnini (Eds.), The First Italian Conference on Computational Linguistics CLiC-it 2014. Proceedings (pp. 186–190). Pisa University Press. http://clic2014.fileli.unipi.it/proceedings/Proceedings-CLICit-2014.pdf
Hartmann, S.
Hauser, A. W.
Hiebel, G., Posch, C., Rampl, G., Gruber, E., Hanke, K., & Zangerle, E.
(2017) Semantics for Mountaineering History. In 4th Digital Humanities Austria Conference – dha2017: Abstracts. Innsbruck. https://www.uibk.ac.at/congress/dha2017/bilder-und-dateien/semantics-for-mountaineering-history.pdf
Hiebel, G., Rampl, G., & Posch, C.
Kahle, P., Colutto, S., Hackl, G., & Mühlberger, G.
Kermes, H., Degaetano-Ortlieb, S., Khamis, A., Knappen, J., & Teich, E.
(2016) The Royal Society Corpus: From uncharted data to corpus. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 1928–1931). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2016/pdf/792_Paper.pdf
Law, R. W.
Mautner, G.
McEnery, T., & Brookes, G.
forthcoming). Register, belief and violence: A multi-dimensional approach. In Posch, C., Rampl, G., & Irschara, K. Eds. Wort – Satz – Korpus. Beiträge zur Korpuslinguistik Word – Sentence – Corpus: Contributions to Corpus Linguistics iup
Mühlberger, G.
(2009–2011) Functional Extension Parser [Computer software]. https://www.digitisation.eu/tools-resources/tools-for-text-digitisation/functional-extension-parser
Mühlberger, G., Zelger, J., & Sagmeister, D.
Pointal, L.
(2004–2016) TreeTagger Python Wrapper: CNRS – LIMSI [Computer software]. http://treetaggerwrapper.readthedocs.io/en/latest/#about-treetaggerwrapper
Posch, C., & Rampl, G.
(2017) Alpenwort – Korpus der Zeitschrift des Deutschen und Österreichischen Alpenvereins (1869–1998) [Alpenwort – Corpus of the Almanac of the Austrian Alpine Club]. http://alpenwort.at
Posch, C., Rampl, G., & Cullen, R.
(2019) New Zealand Alpine Journal Archive: New Zealand’s alpine heritage at your fingertips. https://www.nzaj-archive.nz
Puzey, G., & Kostanski, L.
Rampl, G., Gruber, E., Posch, C., & Hiebel, G.
in press). Toponomastik und Korpuslinguistik: Bergnamen im (Kon-)Text [Toponomastics and Corpuslinguistics: Mountain names in (con)text.]. In K. Dräger, R. Heuser, & M. Prinz Eds. Proceedings of Toponyme – Eine Standortbestimmung Current Tendencies in Toponyms De Gruyter
Rampl, G., & Posch, C.
Rheindorf, M., & Wodak, R.
Rigaud, C., Doucet, A., Coustaty, M., & Moreux, J. P.
(2019) Competition on post-OCR text correction. https://sites.google.com/view/icdar2019-postcorrectionocr
Rose-Redwood, R., Alderman, D., & Azaryahu, M.
Schmid, H.
(1994–1995) TreeTagger – A language independent part-of-speech tagger [Computer software]. Center for Information and Language Processing. http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treetagger.html
Schulz, K., Ringlstetter, C., Vobl, T., Gotscharek, A., & Reffle, U.
(2008) PoToCo [Computer software]. Centrum für Informations- und Sprachverarbeitung. http://ocr.cis.uni-muenchen.de
Underwood, T., & Auvil, L.
n.d.). Basic OCR correction [Blog post]. https://usesofscale.com/gritty-details/basic-ocr-correction
van Dalen-Oskam, K.
Volk, M., Bubenhofer, N., Althaus, A., Bangerter, M., Furrer, L., & Ruef, B.
(2010) Challenges in building a multilingual alpine heritage corpus. In Seventh International Conference on Language Resources and Evaluation (LREC) Malta, 19 May 2010 – 21 May 2010 (pp. 1653–1659). http://www.zora.uzh.ch
Volk, M., Furrer, L., & Sennrich, R.
Wiegand, V.
(2019) A Corpus Linguistic Approach to Meaning-making Patterns in Surveillance Discourse [Doctoral dissertation, University of Birmingham]. UBIRA E THESES. https://etheses.bham.ac.uk/id/eprint/9778/