Roughly, the data available for linguistic research stem from either of two sources: intuitions about language or observations of linguistic events. Collections of data of the latter kind are called corpora. Although corpus data have been used throughout the history of linguistic research, a real breakthrough in their use came in the course of the 20th century when it became possible to store and search large quantities of text electronically. In the second half of the previous century the use of corpus data in their new form was stimulated by the dissatisfaction felt by some with the preference of the linguistic mainstream for intuitive data. Positions taken with respect to the appropriateness for linguistic research of either corpus data or intuitive data have occasionally been quite extreme, but the best policy for any linguist is probably to regard the two as being complementary, rather than in opposition to each other. However, it must be borne in mind that corpus data reflect what people actually say and write, and as such provide the most appropriate data for linguists who want to investigate the use of language rather than linguistic competence or linguistic universals. And since the study of language use is not only concerned with the description of what people actually say and write, but also with the question why in a given verbal or situational context they use one linguistic construct rather than another, it follows that for a collection of linguistic events to be a corpus, it has to meet minimally two conditions. The first is that it should present a faithful record of the utterances contained in running texts (rather than, say, a collection of examples of a particular linguistic phenomenon), the second is that it should give information about the questions by whom, where, when and why the texts were produced. In other words, apart from a record of utterances, a corpus should contain the fullest possible information about the verbal and situational contexts in which the utterances were produced. The fact that corpora are repositories of language use entails that corpus-based studies are naturally biased towards the study of specific languages, genres and language varieties.
References
Aarts, J.
1991Intuition-based and observation-based grammars. In K. Aijmer & B. Altenberg (eds.): 44–62.
Aarts, J.
2002aDoes corpus linguistics exist? Some old and new issues. In L. Breivik & A. Hasselgren (eds.): 1–17.
Aarts, J.
2002bReview of E. Tognini Bonelli (2001). International Journal of Corpus Linguistics 7 (1): 118–123.
Aarts, J., P. De Haan & N. Oostdijk
(eds.)1993English language corpora. Rodopi.
Adolphs, S.
2008Corpus and context. Investigating pragmatic functions in spoken discourse. Benjamins. BoP
Aijmer, K. & B. Altenberg
(eds.)1991Corpus linguistics. Studies in honour of Jan Svartvik. Longman.
Aijmer, K. & B. Altenberg
(eds.)2004Advances in corpus linguistics. Rodopi.
Biber, D., U. Connor & T. Upton
2007Discourse on the move: using corpus analysis to describe discourse structure. Benjamins. BoP
(eds.)2008Linking up contrastive and learner corpus research. Rodopi.
Granger, S.
1993International Corpus of Learner English. In J. Aarts, P. De Haan & N. Oostdijk (eds.): 57–69.
Granger, S.
(ed.)1998Learner English on Computer. Longman.
Greenbaum, S.
1992A new corpus of English: ICE. In J. Svartvik (ed.): 171–179.
Greenbaum, S.
(ed.)1996Comparing English worldwide: The International Corpus of English. Oxford University Press. BoP
Johansson, S.
1980The LOB corpus of British English texts: presentation and comments. ALLC Journal 1: 25–36.
Johansson, S.
1998On the role of corpora in cross-linguistic research. In S. Johansson & S. Oksefjell (eds.): 3–24.
Johansson, S.
2007Seeing through multilingual corpora: On the use of corpora in contrastive studies. Benjamins. BoP
Johansson, S. & S. Oksefjell
(eds.)1998Corpora and cross-linguistic research. Rodopi. TSB
Karlsson, F., A. Voutilainen, J. Heikkilä & A. Antilla
1995Constraint grammar. A language-independent system for parsing unrestricted text. Mouton de Gruyter.
Knowles, G.
1993The Machine-Readable Spoken English Corpus. In J. Aarts, P. De Haan & N. Oostdijk (eds.): 107–119.
Kučera, H. & W.N. Francis
1967Computational analysis of present-day American English. Brown University Press.
Kytö, M.
1991Manual to the diachronic part of the Helsinki corpus of English texts. Helsinki University Department of English.
Kytö, M., M. Rydén & E. Smitterberg
(eds.)2009Nineteenth-century English. Stability and Change. Cambridge University Press.
Laitinen, N.
2002 ‘Extending the Corpus of Early English Correspondence to the 18th century.’ Helsinki English Studies 2. Available: http://www.eng.helsinki.fi/hes/.
Leech, G., M. Hundt, C. Mair & N. Smith
2009Change in contemporary English. A grammatical Study. Cambridge University Press.
Leistina, P. & C. Meyer
2003Corpus analysis. Language structure and language use. Rodopi.
Meyer, C., R. Grabowski, H. Han, K. Mantzouranis & S. Moses
2003The World Wide Web as linguistic corpus. In P. Leystina & C. Meyer (eds.): 241–254.
Nurmi, A.
1999The Corpus of Early English Correspondence Sampler (CEECS). ICAME Journal 23: 53–64.
Quirk, R.
1960Towards a description of English usage. Transactions of the Philological Society: 40–61.
Sinclair, J.
2004Intuition and annotation – the discussion continues. In K. Aijmer & B. Altenberg (eds.): 39–59.
Svartvik, J.
(ed.)1990The London-Lund corpus of spoken English. Lund University Press.
Svartvik, J.
(ed.)1992Directions in corpus linguistics. Mouton de Gruyter. BoP