Corpus analysis

Jan Aarts
Table of contents

Roughly, the data available for linguistic research stem from either of two sources: intuitions about language or observations of linguistic events. Collections of data of the latter kind are called corpora. Although corpus data have been used throughout the history of linguistic research, a real breakthrough in their use came in the course of the 20th century when it became possible to store and search large quantities of text electronically. In the second half of the previous century the use of corpus data in their new form was stimulated by the dissatisfaction felt by some with the preference of the linguistic mainstream for intuitive data. Positions taken with respect to the appropriateness for linguistic research of either corpus data or intuitive data have occasionally been quite extreme, but the best policy for any linguist is probably to regard the two as being complementary, rather than in opposition to each other. However, it must be borne in mind that corpus data reflect what people actually say and write, and as such provide the most appropriate data for linguists who want to investigate the use of language rather than linguistic competence or linguistic universals. And since the study of language use is not only concerned with the description of what people actually say and write, but also with the question why in a given verbal or situational context they use one linguistic construct rather than another, it follows that for a collection of linguistic events to be a corpus, it has to meet minimally two conditions. The first is that it should present a faithful record of the utterances contained in running texts (rather than, say, a collection of examples of a particular linguistic phenomenon), the second is that it should give information about the questions by whom, where, when and why the texts were produced. In other words, apart from a record of utterances, a corpus should contain the fullest possible information about the verbal and situational contexts in which the utterances were produced. The fact that corpora are repositories of language use entails that corpus-based studies are naturally biased towards the study of specific languages, genres and language varieties.

Full-text access is restricted to subscribers. Log in to obtain additional credentials. For subscription information see Subscription & Price.


Aarts, J.
1991Intuition-based and observation-based grammars. In K. Aijmer & B. Altenberg (eds.): 44–62. Google Scholar
2002aDoes corpus linguistics exist? Some old and new issues. In L. Breivik & A. Hasselgren (eds.): 1–17. Google Scholar
2002bReview of E. Tognini Bonelli (2001). International Journal of Corpus Linguistics 7 (1): 118–123. DOI logoGoogle Scholar
Aarts, J., P. De Haan & N. Oostdijk
(eds.) 1993English language corpora. Rodopi. Google Scholar
Adolphs, S.
2008Corpus and context. Investigating pragmatic functions in spoken discourse. Benjamins. DOI logoGoogle Scholar
Aijmer, K. & B. Altenberg
(eds.) 1991Corpus linguistics. Studies in honour of Jan Svartvik. Longman. Google Scholar
(eds.) 2004Advances in corpus linguistics. Rodopi. DOI logoGoogle Scholar
Biber, D., U. Connor & T. Upton
2007Discourse on the move: using corpus analysis to describe discourse structure. Benjamins. DOI logoGoogle Scholar
Borin, L.
(ed.) 2002Parallel corpora, parallel worlds. Rodopi. DOI logoGoogle Scholar
Breivik, L. & A. Hasselgren
(eds.) 2002From the COLT's mouth … and others’. Rodopi. DOI logoGoogle Scholar
Culpeper, J. & M. Kytö
2010Early Modern English Dialogues. Spoken interaction as writing. Cambridge University Press. Google Scholar
Curzan, A.
2003c. Cambridge University Press. Google Scholar
Gilquin, G., S. Papp & M. B. Díez-Bedmar
(eds.) 2008Linking up contrastive and learner corpus research. Rodopi. DOI logoGoogle Scholar
Granger, S.
1993International Corpus of Learner English. In J. Aarts, P. De Haan & N. Oostdijk (eds.): 57–69. Google Scholar
(ed.) 1998Learner English on Computer. Longman. Google Scholar
Greenbaum, S.
1992A new corpus of English: ICE. In J. Svartvik (ed.): 171–179. DOI logoGoogle Scholar
(ed.) 1996Comparing English worldwide: The International Corpus of English. Oxford University Press. Google Scholar
Johansson, S.
1980The LOB corpus of British English texts: presentation and comments. ALLC Journal 1: 25–36. Google Scholar
1998On the role of corpora in cross-linguistic research. In S. Johansson & S. Oksefjell (eds.): 3–24. Google Scholar
2007Seeing through multilingual corpora: On the use of corpora in contrastive studies. Benjamins. DOI logoGoogle Scholar
Johansson, S. & S. Oksefjell
(eds.) 1998Corpora and cross-linguistic research. Rodopi. Google Scholar
Karlsson, F., A. Voutilainen, J. Heikkilä & A. Antilla
1995Constraint grammar. A language-independent system for parsing unrestricted text. Mouton de Gruyter. DOI logoGoogle Scholar
Knowles, G.
1993The Machine-Readable Spoken English Corpus. In J. Aarts, P. De Haan & N. Oostdijk (eds.): 107–119. Google Scholar
Kučera, H. & W. N. Francis
1967Computational analysis of present-day American English. Brown University Press. Google Scholar
Kytö, M.
1991Manual to the diachronic part of the Helsinki corpus of English texts. Helsinki University Department of English. Google Scholar
Kytö, M., M. Rydén & E. Smitterberg
(eds.) 2009Nineteenth-century English. Stability and Change. Cambridge University Press. Google Scholar
Laitinen, N.
2002 ‘Extending the Corpus of Early English Correspondence to the 18th century.’ Helsinki English Studies 2. Available: http://​www​.eng​.helsinki​.fi​/hes/. Google Scholar
Leech, G., M. Hundt, C. Mair & N. Smith
2009Change in contemporary English. A grammatical Study. Cambridge University Press. DOI logoGoogle Scholar
Leistina, P. & C. Meyer
2003Corpus analysis. Language structure and language use. Rodopi. DOI logoGoogle Scholar
Meyer, C., R. Grabowski, H. Han, K. Mantzouranis & S. Moses
2003The World Wide Web as linguistic corpus. In P. Leystina & C. Meyer (eds.): 241–254. Google Scholar
Nurmi, A.
1999The Corpus of Early English Correspondence Sampler (CEECS). ICAME Journal 23: 53–64. Google Scholar
Quirk, R.
1960Towards a description of English usage. Transactions of the Philological Society: 40–61. DOI logoGoogle Scholar
Sinclair, J.
2004Intuition and annotation – the discussion continues. In K. Aijmer & B. Altenberg (eds.): 39–59. Google Scholar
Svartvik, J.
(ed.) 1990The London-Lund corpus of spoken English. Lund University Press. Google Scholar
(ed.) 1992Directions in corpus linguistics. Mouton de Gruyter. DOI logoGoogle Scholar
Tognini Bonelli, E.
2001Corpus linguistics at work. Benjamins. DOI logoGoogle Scholar