Vol. 25:4 (2020) ► pp.489–503
Lima or cima?
Structure recognition and OCR in building the corpus of the Austrian Alpine Club Journal
This paper outlines the construction of the corpus Alpenwort, a large, genre-based corpus of German texts on alpinism. We report on issues related to building the corpus from the Austrian Alpine Club Journal (1869–2010). First, a general description of our data and the project phases from digitization and annotation to publication is given. We focus on the most interesting challenges that the diverse layouts and the extensive use of Fraktur typefacing posed for optical layout recognition and optical character recognition (OCR) as well as post correction. The corrected data was lemmatized and annotated with part-of-speech information including named entities as well as TEI-conformant metadata. The resulting 19.9-million-word corpus is designed to be queried using CQPweb and Hyperbase and can be accessed freely online. Lastly, we give a short roadmap of current and future expansions and improvements as corpus data has been and is being enhanced in follow-up projects.
Article outline
- 1.Introduction
- 2.Data and project phases
- 3.Scanning, structure recognition, and OCR
- 4.XML export
- 5.OCR post correction
- 6.Linguistic mark-up and annotation
- 7.Conclusions and prospects
- Acknowledgements
-
References
https://doi.org/10.1075/ijcl.19094.pos