The TV and Movies corpora: Design, construction, and use

Davies, Mark

doi:10.1075/ijcl.00035.dav

Article published In:

Corpus approaches to telecinematic language
Edited by Monika Bednarek, Valentin Werner and Marcia Veirano Pinto
[International Journal of Corpus Linguistics 26:1] 2021
► pp. 10–37

The TV and Movies corpora

Design, construction, and use

Mark Davies | Brigham Young University

This paper discusses the creation and use of the TV Corpus (subtitles from 75,000 episodes, 325 million words, 6 English-speaking countries, 1950s-2010s) and the Movies Corpus (subtitles from 25,000 movies, 200 million words, 6 English-speaking countries, 1930s–2010s), which are available at English-Corpora.org. The corpora compare well to the BNC-Conversation data in terms of informality, lexis, phraseology, and syntax. But at 525 million words in total size, they are more than 30 times as large as BNC-Conversation (both BNC1994 and BNC2014 combined), which means that they can be used to look at a wide range of linguistic phenomena. The TV and Movies corpora also allow useful comparisons of very informal language across time (containing texts from the 1930s and later for the movies, and from the 1950s onwards for TV shows) and between dialects of English (such as British and American English).

Keywords: TV, movies, diachronic, dialects, speech

Article outline

1.Introduction
2.Rationale for the TV and Movies corpora
3.Creating the TV and Movies corpora
4.Using metadata to create “Virtual Corpora”
5.Informal nature of the language in the TV and Movies corpora
6.Dialectal and historical variation in English
- 6.1Dialectal differences
- 6.2Change over time
7.Conclusion
Note
References

Published online: 17 November 2020

https://doi.org/10.1075/ijcl.00035.dav

References (32)

Baker, P.

(2009) The BE06 corpus of British English and recent language change. International Journal of Corpus Linguistics, 14(3), 312–337.

(2011) Times may change but we’ll always have money: A corpus driven examination of vocabulary change in four diachronic corpora. Journal of English Linguistics, 39(1), 65–88.

Bednarek, M.

(2018) Language and Television Series: A Linguistic Approach to TV Dialogue. Cambridge University Press.

(2019) Creating Dialogue for TV: Screenwriters Talk Television. Routledge.

Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E.

(1999) Longman Grammar of Spoken and Written English. Longman.

BNC Consortium

(2007) British National Corpus (version 3, BNC XML ed.). [URL]

Brysbaert, M., & New, B.

(2009) Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.

Brysbaert, M., Mandera, P., & Keuleers, E.

(2018) The word frequency effect in word processing: An updated review. Current Directions in Psychological Science, 27(1), 45–50.

Canavan, A., & Zipperlen, G.

(1996) CALLFRIEND American English-Non-Southern Dialect (LDC96S46). Linguistic Data Consortium [URL].

Canavan, A., Graff, D., & Zipperlen, G.

(1997) CALLHOME American English Speech (LDC97S42). Linguistic Data Consortium [URL].

Davies, M.

(2009) the 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159–190.

(2011) The Corpus of Contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–465.

(2012) Expanding horizons in historical linguistics with the 400 million word Corpus of Historical American English. Corpora, 7(2), 121–157.

(2015) Corpora: An introduction. In D. Biber & R. Reppen (Eds.), Cambridge Handbook of English Corpus Linguistics (pp. 11–31). Cambridge University Press.

(2017) Using large online corpora to examine lexical, semantic, and cultural variation in different dialects and time periods. In E. Friginal (Ed.), Studies in Corpus-Based Sociolinguistics (pp. 19–82). Routledge.

(2018) Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. In C. Suhr, T. Nevalainen & I. Taavitsainen (Eds.), From Data to Evidence in English Language Research (pp. 34–55). Brill.

Forchini, P.

(2012) Movie Language Revisited: Evidence from Multi-Dimensional Analysis and Corpora. Peter Lang.

Greenbaum, S.

(1996) Comparing English Worldwide: The International Corpus of English. Clarendon Press.

Godfrey, J. J., & Holliman, E.

(1993) Switchboard-1 Release 2 (LDC97S62). Linguistic Data Consortium. [URL]

Van Heuven, W., Mandera, P., Keuleers, E., & Brysbaert, M.

(2014) SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176–1190.

Levshina, N.

(2017) Online film subtitles as a corpus: An n-gram approach. Corpora, 12(3), 311–338.

Lison, P., & Tiedemann, J.

(2016) OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA). [URL]

Love, R.

(2020) Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014. Routledge.

Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T.

(2017) The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319–344.

Lugea, J.

(2019) The intralingual subtitling of The Wire: Changes of style and substance. Journal of Applied Linguistics and Professional Practice, 12(1), 23–49.

Piazza, R., Bednarek, M., & Rossi, F.

(Eds.) (2011) Telecinematic Discourse: Approaches to the Language of Films and Television Series. John Benjamins.

Quaglio, P.

(2009) Television Dialogue: The Sitcom Friends vs. Natural Conversation. John Benjamins.

Rayson, P., & Garside, R.

(1998) The CLAWS web tagger. ICAME Journal, 22(4), 121–123.

Simpson, R., Briggs, L., Ovens, J., & Swales, J.

(2002) The Michigan Corpus of Academic Spoken English. The Regents of the University of Michigan.

Tiedemann, J.

(2016) OPUS – parallel corpora for everyone. Baltic Journal of Modern Computing, 4(2), 384.

Veirano Pinto, M.

(2014) Dimensions of variation in North American movies. In T. Berber Sardinha & M. Veirano Pinto (Eds.), Multi-dimensional Analysis, 25 Years on: A Tribute to Douglas Biber (pp. 109–146). John Benjamins.

(2018) Variation in movies and television programs: The impact of corpus sampling. In V. Werner (Ed.), The Language of Pop Culture (pp. 139–161). Routledge.

Cited by (19)

Cited by 19 other publications

Order by:

Castro, Adrián

2024. Telecinematic stylistics: Language and style in fantasy TV series. Language and Literature: International Journal of Stylistics 33:1 ► pp. 3 ff.

Leedham, Maria

2024. Depictions of social workers and other caring professionals on television. Journal of Social Work

Flesch, Marie

2023. “Dude” and “Dudette”, “Bro” and “Sis”: A Diachronic Study of Four Address Terms in the TV Corpus. Anglica. An International Journal of English Studies :32/2 ► pp. 23 ff.

Hirota, Tomoharu & Laurel J. Brinton

2023. “You betcha I’m a ’Merican”. International Journal of Corpus Linguistics 28:4 ► pp. 528 ff.

Jucker, Andreas H. & Daniela Landert

2023. The diachrony of im/politeness in American and British movies (1930–2019). Journal of Pragmatics 209 ► pp. 123 ff.

Landert, Daniela, Tanja Säily & Mika Hämäläinen

2023. TV series as disseminators of emerging vocabulary: Non-codified expressions in the TV Corpus. ICAME Journal 47:1 ► pp. 63 ff.

Yusufali, Hussein, Stefan Goetze & Roger K. Moore

2023. Bridging the Communication Rate Gap: Enhancing Text Input for Augmentative and Alternative Communication (AAC). In HCI International 2023 – Late Breaking Papers [Lecture Notes in Computer Science, 14055], ► pp. 434 ff.

Zhang, Jinyi, Ye Tian, Jiannan Mao, Mei Han, Feng Wen, Cong Guo, Zhonghui Gao & Tadahiro Matsumoto

2023. WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation. Electronics 12:5 ► pp. 1140 ff.

Ha, Hung Tan

2022. Lexical Profile of Newspapers Revisited: A Corpus-Based Analysis. Frontiers in Psychology 13

Ha, Hung Tan

2022. Vocabulary Demands of Informal Spoken English Revisited: What Does It Take to Understand Movies, TV Programs, and Soap Operas?. Frontiers in Psychology 13

López-Rodríguez, Clara Inés

2022. Emotion at the end of life: Semantic annotation and key domains in a pilot study audiovisual corpus. Lingua 277 ► pp. 103401 ff.

Montero Perez, Maribel

2022. Second or foreign language learning through watching audio-visual input and the role of on-screen text. Language Teaching 55:2 ► pp. 163 ff.

Gentile, Federico Pio

2021. The 19-2 Anglified Police Procedural Noir. In Corpora, Corpses and Corps, ► pp. 241 ff.

Gentile, Federico Pio

2021. The Research Methodology. In Corpora, Corpses and Corps, ► pp. 15 ff.

Gentile, Federico Pio

2021. The Motive ‘Whydunit’ Television Hybrid. In Corpora, Corpses and Corps, ► pp. 177 ff.

Gentile, Federico Pio

2021. The Linguistic and Cultural Environment of Canadian Television. In Corpora, Corpses and Corps, ► pp. 71 ff.

Werner, Valentin

2021. Chapter 8. A register approach toward pop lyrics in EFL education. In Corpus-based Approaches to Register Variation [Studies in Corpus Linguistics, 103], ► pp. 209 ff.

[no author supplied]

2022. List of Example Stand-alone Corpus Description Articles. In Designing and Evaluating Language Corpora, ► pp. 224 ff.

[no author supplied]

2023. Language and Characterisation in Television Series [Studies in Corpus Linguistics, 106],

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.