Open-source Corpora
Using the net to fish for linguistic data
Serge Sharoff | University of Leeds
The paper proposes a methodology for collecting “open-source” corpora, i.e. corpora that are automatically collected from the Internet and distributed in the form of a list of links with open-source software for recreating their full text. The result is a random snapshot of Internet pages which contain stretches of connected text in a given language. The paper discusses a methodology for acquiring such corpora, two ways of documenting them (using a set of metatextual categories and by comparison to frequency lists from existing corpora) and their function as benchmarks for comparing results of linguistic inquiry. Experiments with a variety of languages show that Internet-derived corpora can be successfully used in the absence of large representative corpora that are rare and expensive to build.
Keywords: representative corpora, frequency lists, Internet, corpus composition
Published online: 08 December 2006
https://doi.org/10.1075/ijcl.11.4.05sha
https://doi.org/10.1075/ijcl.11.4.05sha
Cited by
Cited by 25 other publications
Atwell, Eric Steven
Biber, Douglas & Jesse Egbert
Biber, Douglas, Jesse Egbert & Mark Davies
Chang, Ching-Yun & Stephen Clark
Coupé, Christophe, Yoon Oh, Dan Dediu & François Pellegrino
Dash, Niladri Sekhar & S. Arulmozi
De Belder, Jan & Marie-Francine Moens
DELIN, J., S. SHAROFF, S. LILLFORD & C. BARNES
Egbert, Jesse, Douglas Biber & Mark Davies
Ibrahim, Anna, Patricia E. Cowell & Rosemary A. Varley
Kilgarriff, Adam, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan Khalil, Sofie Johansson Kokkinakis, Robert Lew, Serge Sharoff, Ravikiran Vadlapudi & Elena Volodina
Masrai, Ahmed & James Milton
McCarthy, Diana
McCarthy, Diana
McCarthy, Diana
McCarthy, Diana & Roberto Navigli
McCarthy, Diana, Ravi Sinha & Rada Mihalcea
Moon, Taesun & Katrin Erk
Song, Jiayin, Jingyue Hu, Leung-Pun Wong, Lap-Kei Lee & Tianyong Hao
Song, Jiayin, Yingshan Shen, John Lee & Tianyong Hao
TALALAKINA, EKATERINA, DENIS STUKAL & MIKHAIL KAMROTOV
Wild, Kate, Andrew Church, Diana McCarthy & Jacquelin Burgess
This list is based on CrossRef data as of 15 march 2021. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.