The TV and Movies corpora
Design, construction, and use
This paper discusses the creation and use of the TV Corpus (subtitles from 75,000 episodes, 325 million words, 6
English-speaking countries, 1950s-2010s) and the Movies Corpus (subtitles from 25,000 movies, 200 million words, 6 English-speaking
countries, 1930s–2010s), which are available at
English-Corpora.org. The corpora compare
well to the BNC-Conversation data in terms of informality, lexis, phraseology, and syntax. But at 525 million words in total size, they are
more than 30 times as large as BNC-Conversation (both BNC1994 and BNC2014 combined), which means that they can be used to look at a wide
range of linguistic phenomena. The TV and Movies corpora also allow useful comparisons of very informal language across time (containing
texts from the 1930s and later for the movies, and from the 1950s onwards for TV shows) and between dialects of English (such as British and
American English).
Article outline
- 1.Introduction
- 2.Rationale for the TV and Movies corpora
- 3.Creating the TV and Movies corpora
- 4.Using metadata to create “Virtual Corpora”
- 5.Informal nature of the language in the TV and Movies corpora
- 6.Dialectal and historical variation in English
- 6.1Dialectal differences
- 6.2Change over time
- 7.Conclusion
- Note
-
References
References
Baker, P.
(
2011)
Times may change but we’ll always have money: A corpus driven examination of vocabulary change in four diachronic corpora.
Journal of English Linguistics, 39(1), 65–88.


Bednarek, M.
(
2018)
Language and Television Series: A Linguistic Approach to TV Dialogue. Cambridge University Press.


Bednarek, M.
(
2019)
Creating Dialogue for TV: Screenwriters Talk Television. Routledge.


Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E.
(
1999)
Longman Grammar of Spoken and Written English. Longman.

BNC Consortium
(
2007)
British National Corpus (
version 3, BNC XML ed.).
[URL]
Brysbaert, M., & New, B.
(
2009)
Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English.
Behavior Research Methods, 41(4), 977–990.


Brysbaert, M., Mandera, P., & Keuleers, E.
(
2018)
The word frequency effect in word processing: An updated review.
Current Directions in Psychological Science, 27(1), 45–50.


Canavan, A., & Zipperlen, G.
(
1996)
CALLFRIEND American English-Non-Southern Dialect (LDC96S46). Linguistic Data Consortium
[URL].
Canavan, A., Graff, D., & Zipperlen, G.
(
1997)
CALLHOME American English Speech (LDC97S42). Linguistic Data Consortium
[URL].
Davies, M.
(
2011)
The Corpus of Contemporary American English as the first reliable monitor corpus of English.
Literary and Linguistic Computing, 25(4), 447–465.


Davies, M.
(
2012)
Expanding horizons in historical linguistics with the 400 million word Corpus of Historical American English.
Corpora, 7(2), 121–157.


Davies, M.
(
2015)
Corpora: An introduction. In
D. Biber &
R. Reppen (Eds.),
Cambridge Handbook of English Corpus Linguistics (pp. 11–31). Cambridge University Press.


Davies, M.
(
2017)
Using large online corpora to examine lexical, semantic, and cultural variation in different dialects and time periods. In
E. Friginal (Ed.),
Studies in Corpus-Based Sociolinguistics (pp. 19–82). Routledge.


Davies, M.
(
2018)
Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. In
C. Suhr,
T. Nevalainen &
I. Taavitsainen (Eds.),
From Data to Evidence in English Language Research (pp. 34–55). Brill.


Forchini, P.
(
2012)
Movie Language Revisited: Evidence from Multi-Dimensional Analysis and Corpora. Peter Lang.


Greenbaum, S.
(
1996)
Comparing English Worldwide: The International Corpus of English. Clarendon Press.

Godfrey, J. J., & Holliman, E.
(
1993)
Switchboard-1 Release 2 (LDC97S62). Linguistic Data Consortium.
[URL]
Van Heuven, W., Mandera, P., Keuleers, E., & Brysbaert, M.
(
2014)
SUBTLEX-UK: A new and improved word frequency database for British English.
The Quarterly Journal of Experimental Psychology, 67(6), 1176–1190.


Levshina, N.
(
2017)
Online film subtitles as a corpus: An n-gram approach.
Corpora, 12(3), 311–338.


Lison, P., & Tiedemann, J.
(
2016)
OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In
N. Calzolari,
K. Choukri,
T. Declerck,
S. Goggi,
M. Grobelnik,
B. Maegaard,
J. Mariani,
H. Mazo,
A. Moreno,
J. Odijk, &
S. Piperidis (Eds.),
Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA).
[URL]
Love, R.
(
2020)
Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014. Routledge.


Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T.
Lugea, J.
(
2019)
The intralingual subtitling of The Wire: Changes of style and substance.
Journal of Applied Linguistics and Professional Practice, 12(1), 23–49.


Piazza, R., Bednarek, M., & Rossi, F.
Rayson, P., & Garside, R.
(
1998)
The CLAWS web tagger.
ICAME Journal, 22(4), 121–123.

Simpson, R., Briggs, L., Ovens, J., & Swales, J.
(
2002)
The Michigan Corpus of Academic Spoken English. The Regents of the University of Michigan.

Tiedemann, J.
(
2016)
OPUS – parallel corpora for everyone.
Baltic Journal of Modern Computing, 4(2), 384.

Veirano Pinto, M.
(
2018)
Variation in movies and television programs: The impact of corpus sampling. In
V. Werner (Ed.),
The Language of Pop Culture (pp. 139–161). Routledge.


Cited by
Cited by 1 other publications
Cappelle, Bert
2020.
Not on my watch and similar not-fragments: stored forms with pragmatic content.
Acta Linguistica Hafniensia 52:2
► pp. 217 ff.

This list is based on CrossRef data as of 27 april 2021. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.