Using the net to fish for linguistic data
The paper proposes a methodology for collecting “open-source” corpora, i.e. corpora that are automatically collected from the Internet and distributed in the form of a list of links with open-source software for recreating their full text. The result is a random snapshot of Internet pages which contain stretches of connected text in a given language. The paper discusses a methodology for acquiring such corpora, two ways of documenting them (using a set of metatextual categories and by comparison to frequency lists from existing corpora) and their function as benchmarks for comparing results of linguistic inquiry. Experiments with a variety of languages show that Internet-derived corpora can be successfully used in the absence of large representative corpora that are rare and expensive to build.
Cited by 30 other publications
This list is based on CrossRef data as of 14 january 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.