Construction and query tools: Chapter 2. The Eurolect Observatory Multilingual Corpus

Tomatis, Marco Stefano

doi:10.1075/scl.86.02tom

Part of

Observing Eurolects: Corpus analysis of linguistic variation in EU law
Edited by Laura Mori
[Studies in Corpus Linguistics 86] 2018
► pp. 27–45

Chapter 2
The Eurolect Observatory Multilingual Corpus

Construction and query tools

Marco Stefano Tomatis

This chapter aims to explain the corpus design of the Eurolect Observatory Multilingual Corpus and the steps required to build all the different monolingual corpora the project needed to accomplish its research objectives. The first two paragraphs after the general introduction will point out the differences and the overlaps that characterize all the corpora that the author of this paper was in charge of producing as a member of the UNINT research team and that were used in the Eurolect Observatory Project for text mining. After accurately defining the data collection and corpus building strategies adopted, this paper will describe the corpus search tool that was developed in order to help scholars look for and save samples of text from the whole corpus in a convenient and easy way.

Keywords: natural language processing, corpus linguistics, AWK, corpus search tool, regular expressions, markup

Article outline

1.Introduction
2.Corpus collection
- 2.1Corpus A
- 2.2Corpus B
3.Corpus search tools
- 3.1Overview of the SearchIt tools
- 3.2Main functions of the SearchIt tools
Notes
References

Published online: 6 December 2018

https://doi.org/10.1075/scl.86.02tom

References (10)

References

Barbera, E., Corino, E., & Onesti, C. (2007). Cosa è un corpus? Per una definizione più rigorosa di corpus, token, markup. In E. Barbera, E. Corino, & C. Onesti (Eds.), Corpora e linguistica in rete (pp. 25–88). Perugia: Guerra Edizioni.

Burnage, G., & Dunlop, D. (1992). Encoding the British National Corpus. In J. Aarts, P. de Haan, & N. Oostdijk (Eds.), English language corpora: Design, analysis and exploitation. Papers from the Thirteenth International Conference on English Language Research on Computerized Corpora, Nijmegen 1992 (pp. 79–95). Amsterdam: Rodopi.

Gillam, R. (2003). Unicode demystified: A practical programmer’s guide to the encoding standard. Boston MA: Addison-Wesley.

Kenning, M. -M. (2010). What are parallel and comparable corpora and how can we use them? In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 487–500). London: Routledge.

Lenci, A., Montemagni, S., & Pirrelli, V. (2016). Testo e computer. Elementi di linguistica computazionale. Roma: Carocci.

Mori, L. (2018). Introduction The Eurolect Observatory Project. In L. Mori (Ed.), Observing Eurolects. Corpus analysis of linguistic variation in EU law (Studies in Corpus Linguistics 86). Amsterdam: John Benjamins. (this volume).

Reppen, R. (2010). Building a corpus: What are the key considerations? In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 31–37). London: Routledge.

Robbins, A. (2015). Effective AWK programming: Universal text processing and pattern matching. Sebastopol, CA: O’Reilly Media.

Schmitt, L. M., Christianson, K., & Gupta, R. (2007). Linguistic computing with UNIX Tools. In A. Kao & S. R. Poteet (Eds.), Natural language processing and text mining (pp. 221–258). London: Springer.

Weisser, M. (2016). Practical corpus linguistics: An introduction to corpus-based language analysis. Hoboken, NJ: Wiley & Sons.

Cited by (5)

Cited by five other publications

Order by:

Mori, Laura & Benedikt Szmrecsanyi

2021. Mapping Eurolects. Languages in Contrast 21:2 ► pp. 186 ff.

Mori, Laura

2018. Chapter 1. Introduction. In Observing Eurolects [Studies in Corpus Linguistics, 86], ► pp. 1 ff.

Portelli, Sergio & Sandro Caruana

2018. Chapter 11. Observing Eurolects. In Observing Eurolects [Studies in Corpus Linguistics, 86], ► pp. 267 ff.

Sandrelli, Annalisa

2018. Chapter 4. Observing Eurolects. In Observing Eurolects [Studies in Corpus Linguistics, 86], ► pp. 63 ff.

Sosonis, Vilelmini, Katia Lida Kermanidis & Sotirios Livas

2018. Chapter 8. Observing Eurolects. In Observing Eurolects [Studies in Corpus Linguistics, 86], ► pp. 169 ff.

This list is based on CrossRef data as of 27 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.

Chapter 2The Eurolect Observatory Multilingual Corpus

Construction and query tools

Cited by five other publications

Chapter 2
The Eurolect Observatory Multilingual Corpus