The advantage of using relational databases for large corpora: Speed, advanced queries, and unlimited annotation

Davies, Mark

doi:10.1075/ijcl.10.3.02dav

Article published In:

International Journal of Corpus Linguistics
Vol. 10:3 (2005) ► pp.307–334

The advantage of using relational databases for large corpora

Speed, advanced queries, and unlimited annotation

Mark Davies | Brigham Young University

Relational databases can be used to create large corpora that provide both very good search performance and a wide range of queries. This paper outlines how this approach has been used to create theCorpus del Español, which contains 100 million words of text in Spanish texts from the 1200s-1900s. The main databases are composed of n-grams tables (all unique 1, 2, 3, and 4 word sequences) and the associated frequency of all n-grams in each century (historical Spanish) and register (Modern Spanish). These tables are then joined to other tables containing part of speech, lemma, synonyms, and user-defined lists of words and lemma. There is essentially no limit to the amount of annotation that can be added in additional tables (with little or no impact on performance), and the SQL-based queries allow a wide range of searches that are not available with traditional corpora.

Keywords: n-grams, Spanish, historical, relational databases, SQL

Published online: 1 September 2005

https://doi.org/10.1075/ijcl.10.3.02dav

Cited by (11)

Cited by 11 other publications

Order by:

FILE‐MURIEL, RICHARD J.

2023. Phonetics, Phonology, and Usage‐Based Approaches. In The Handbook of Usage‐Based Linguistics, ► pp. 107 ff.

Haas, Timothy C.

2021. The First Political-Ecological Database and Its Use in Episode Analysis. Frontiers in Conservation Science 2

Lavid-López, Julia

2021. Corpus resources and tools. In Corpora in Translation and Contrastive Research in the Digital Age [Benjamins Translation Library, 158], ► pp. 1 ff.

Zięba, Anna

2018. Google Books Ngram Viewer in Socio-Cultural Research. Research in Language 16:3 ► pp. 357 ff.

Arkhangel’skii, T. A. & O. A. Sozinova

2015. A multimedia corpus of the Yiddish language. Automatic Documentation and Mathematical Linguistics 49:2 ► pp. 47 ff.

Upeksha, Dimuthu, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun, Chinthana Wimalasuriya, N. H. N. D. de Silva & Gihan Dias

2015. Comparison Between Performance of Various Database Systems for Implementing a Language Corpus. In Beyond Databases, Architectures and Structures [Communications in Computer and Information Science, 521], ► pp. 82 ff.

Huo, Yan Juan

2014. Computer Aided Design of Chinese College English Teaching Materials Based on COCA Corpus. Applied Mechanics and Materials 590 ► pp. 916 ff.

Duchon, Andrew, Manuel Perea, Nuria Sebastián-Gallés, Antonia Martí & Manuel Carreiras

2013. EsPal: One-stop shopping for Spanish word properties. Behavior Research Methods 45:4 ► pp. 1246 ff.

Kratky, Michal, Radim Baca, David Bednar, Jiri Walder, Jiri Dvorsky & Peter Chovanec

2011. 2011 Sixth International Conference on Digital Information Management, ► pp. 73 ff.

Gries, Stefan Th.

2009. What is Corpus Linguistics?. Language and Linguistics Compass 3:5 ► pp. 1225 ff.

Newman, John, Jingxia Lin, Terry Butler & Eric Zhang

2007. The Wenzhou Spoken Corpus. Corpora 2:1 ► pp. 97 ff.

This list is based on CrossRef data as of 11 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.