Corpus analysis

Jan Aarts

Table of contents

References
Related articles

Roughly, the data available for linguistic research stem from either of two sources: intuitions about language or observations of linguistic events. Collections of data of the latter kind are called corpora. Although corpus data have been used throughout the history of linguistic research, a real breakthrough in their use came in the course of the 20th century when it became possible to store and search large quantities of text electronically. In the second half of the previous century the use of corpus data in their new form was stimulated by the dissatisfaction felt by some with the preference of the linguistic mainstream for intuitive data. Positions taken with respect to the appropriateness for linguistic research of either corpus data or intuitive data have occasionally been quite extreme, but the best policy for any linguist is probably to regard the two as being complementary, rather than in opposition to each other. However, it must be borne in mind that corpus data reflect what people actually say and write, and as such provide the most appropriate data for linguists who want to investigate the use of language rather than linguistic competence or linguistic universals. And since the study of language use is not only concerned with the description of what people actually say and write, but also with the question why in a given verbal or situational context they use one linguistic construct rather than another, it follows that for a collection of linguistic events to be a corpus, it has to meet minimally two conditions. The first is that it should present a faithful record of the utterances contained in running texts (rather than, say, a collection of examples of a particular linguistic phenomenon), the second is that it should give information about the questions by whom, where, when and why the texts were produced. In other words, apart from a record of utterances, a corpus should contain the fullest possible information about the verbal and situational contexts in which the utterances were produced. The fact that corpora are repositories of language use entails that corpus-based studies are naturally biased towards the study of specific languages, genres and language varieties.

References

Aarts, J.

1991 Intuition-based and observation-based grammars. In K. Aijmer & B. Altenberg (eds.): 44–62.

2002a Does corpus linguistics exist? Some old and new issues. In L. Breivik & A. Hasselgren (eds.): 1–17.

2002b Review of E. Tognini Bonelli (2001). International Journal of Corpus Linguistics 7 (1): 118–123.

Aarts, J., P. De Haan & N. Oostdijk

(eds.) 1993 English language corpora. Rodopi.

Adolphs, S.

2008 Corpus and context. Investigating pragmatic functions in spoken discourse. Benjamins.

BoP

Aijmer, K. & B. Altenberg

(eds.) 1991 Corpus linguistics. Studies in honour of Jan Svartvik. Longman.

(eds.) 2004 Advances in corpus linguistics. Rodopi.

Biber, D., U. Connor & T. Upton

2007 Discourse on the move: using corpus analysis to describe discourse structure. Benjamins.

BoP

Borin, L.

(ed.) 2002 Parallel corpora, parallel worlds. Rodopi.

Breivik, L. & A. Hasselgren

(eds.) 2002 From the COLT's mouth … and others’. Rodopi.

Culpeper, J. & M. Kytö

2010 Early Modern English Dialogues. Spoken interaction as writing. Cambridge University Press.

Curzan, A.

2003 c. Cambridge University Press. BoP

Gilquin, G., S. Papp & M.B. Díez-Bedmar

(eds.) 2008 Linking up contrastive and learner corpus research. Rodopi.

Granger, S.

1993 International Corpus of Learner English. In J. Aarts, P. De Haan & N. Oostdijk (eds.): 57–69.

(ed.) 1998 Learner English on Computer. Longman.

Greenbaum, S.

1992 A new corpus of English: ICE. In J. Svartvik (ed.): 171–179.

(ed.) 1996 Comparing English worldwide: The International Corpus of English. Oxford University Press. BoP

Johansson, S.

1980 The LOB corpus of British English texts: presentation and comments. ALLC Journal 1: 25–36.

1998 On the role of corpora in cross-linguistic research. In S. Johansson & S. Oksefjell (eds.): 3–24.

2007 Seeing through multilingual corpora: On the use of corpora in contrastive studies. Benjamins.

BoP

Johansson, S. & S. Oksefjell

(eds.) 1998 Corpora and cross-linguistic research. Rodopi. TSB

Karlsson, F., A. Voutilainen, J. Heikkilä & A. Antilla

1995 Constraint grammar. A language-independent system for parsing unrestricted text. Mouton de Gruyter.

Knowles, G.

1993 The Machine-Readable Spoken English Corpus. In J. Aarts, P. De Haan & N. Oostdijk (eds.): 107–119.

Kučera, H. & W.N. Francis

1967 Computational analysis of present-day American English. Brown University Press.

Kytö, M.

1991 Manual to the diachronic part of the Helsinki corpus of English texts. Helsinki University Department of English.

Kytö, M., M. Rydén & E. Smitterberg

(eds.) 2009 Nineteenth-century English. Stability and Change. Cambridge University Press.

Laitinen, N.

2002 ‘Extending the Corpus of Early English Correspondence to the 18th century.’ Helsinki English Studies 2. Available: http://www.eng.helsinki.fi/hes/.

Leech, G., M. Hundt, C. Mair & N. Smith

2009 Change in contemporary English. A grammatical Study. Cambridge University Press.

Leistina, P. & C. Meyer

2003 Corpus analysis. Language structure and language use. Rodopi.

Meyer, C., R. Grabowski, H. Han, K. Mantzouranis & S. Moses

2003 The World Wide Web as linguistic corpus. In P. Leystina & C. Meyer (eds.): 241–254.

Nurmi, A.

1999 The Corpus of Early English Correspondence Sampler (CEECS). ICAME Journal 23: 53–64.

Quirk, R.

1960 Towards a description of English usage. Transactions of the Philological Society: 40–61.

Sinclair, J.

2004 Intuition and annotation – the discussion continues. In K. Aijmer & B. Altenberg (eds.): 39–59.

Svartvik, J.

(ed.) 1990 The London-Lund corpus of spoken English. Lund University Press.

(ed.) 1992 Directions in corpus linguistics. Mouton de Gruyter.

BoP

Tognini Bonelli, E.

2001 Corpus linguistics at work. Benjamins.