This paper discusses the creation and use of the TV Corpus (subtitles from 75,000 episodes, 325 million words, 6
English-speaking countries, 1950s-2010s) and the Movies Corpus (subtitles from 25,000 movies, 200 million words, 6 English-speaking
countries, 1930s–2010s), which are available at English-Corpora.org. The corpora compare
well to the BNC-Conversation data in terms of informality, lexis, phraseology, and syntax. But at 525 million words in total size, they are
more than 30 times as large as BNC-Conversation (both BNC1994 and BNC2014 combined), which means that they can be used to look at a wide
range of linguistic phenomena. The TV and Movies corpora also allow useful comparisons of very informal language across time (containing
texts from the 1930s and later for the movies, and from the 1950s onwards for TV shows) and between dialects of English (such as British and
American English).
Baker, P. (2011). Times may change but we’ll always have money: A corpus driven examination of vocabulary change in four diachronic corpora. Journal of English Linguistics, 39(1), 65–88.
Bednarek, M. (2018). Language and Television Series: A Linguistic Approach to TV Dialogue. Cambridge University Press.
Bednarek, M. (2019). Creating Dialogue for TV: Screenwriters Talk Television. Routledge.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman Grammar of Spoken and Written English. Longman.
BNC Consortium. (2007). British National Corpus (version 3, BNC XML ed.). [URL]
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.
Brysbaert, M., Mandera, P., & Keuleers, E. (2018). The word frequency effect in word processing: An updated review. Current Directions in Psychological Science, 27(1), 45–50.
Canavan, A., & Zipperlen, G. (1996). CALLFRIEND American English-Non-Southern Dialect (LDC96S46). Linguistic Data Consortium [URL].
Canavan, A., Graff, D., & Zipperlen, G. (1997). CALLHOME American English Speech (LDC97S42). Linguistic Data Consortium [URL].
Davies, M. (2011). The Corpus of Contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–465.
Davies, M. (2012). Expanding horizons in historical linguistics with the 400 million word Corpus of Historical American English. Corpora, 7(2), 121–157.
Davies, M. (2015). Corpora: An introduction. In D. Biber & R. Reppen (Eds.), Cambridge Handbook of English Corpus Linguistics (pp. 11–31). Cambridge University Press.
Davies, M. (2017). Using large online corpora to examine lexical, semantic, and cultural variation in different dialects and time periods. In E. Friginal (Ed.), Studies in Corpus-Based Sociolinguistics (pp. 19–82). Routledge.
Davies, M. (2018). Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. In C. Suhr, T. Nevalainen & I. Taavitsainen (Eds.), From Data to Evidence in English Language Research (pp. 34–55). Brill.
Forchini, P. (2012). Movie Language Revisited: Evidence from Multi-Dimensional Analysis and Corpora. Peter Lang.
Greenbaum, S. (1996). Comparing English Worldwide: The International Corpus of English. Clarendon Press.
Godfrey, J. J., & Holliman, E. (1993). Switchboard-1 Release 2 (LDC97S62). Linguistic Data Consortium. [URL]
Van Heuven, W., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176–1190.
Levshina, N. (2017). Online film subtitles as a corpus: An n-gram approach. Corpora, 12(3), 311–338.
Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA). [URL]
Love, R. (2020). Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014. Routledge.
Lugea, J. (2019). The intralingual subtitling of The Wire: Changes of style and substance. Journal of Applied Linguistics and Professional Practice, 12(1), 23–49.
Rayson, P., & Garside, R. (1998). The CLAWS web tagger. ICAME Journal, 22(4), 121–123.
Simpson, R., Briggs, L., Ovens, J., & Swales, J. (2002). The Michigan Corpus of Academic Spoken English. The Regents of the University of Michigan.
Tiedemann, J. (2016). OPUS – parallel corpora for everyone. Baltic Journal of Modern Computing, 4(2), 384.
Veirano Pinto, M. (2014). Dimensions of variation in North American movies. In T. Berber Sardinha & M. Veirano Pinto (Eds.), Multi-dimensional Analysis, 25 Years on: A Tribute to Douglas Biber (pp. 109–146). John Benjamins.
Veirano Pinto, M. (2018). Variation in movies and television programs: The impact of corpus sampling. In V. Werner (Ed.), The Language of Pop Culture (pp. 139–161). Routledge.
Cited by (22)
Cited by 22 other publications
Castro, Adrián
2024. Telecinematic stylistics: Language and style in fantasy TV series. Language and Literature: International Journal of Stylistics 33:1 ► pp. 3 ff.
Leedham, Maria
2024. Depictions of social workers and other caring professionals on television. Journal of Social Work 24:5 ► pp. 664 ff.
Li, Haowei, Jinyi Zhang, Ye Tian & Tadahiro Matsumoto
2024. 2024 2nd International Conference on Signal Processing and Intelligent Computing (SPIC), ► pp. 980 ff.
Bednarek, Monika
2023. Corpus linguistics and television series: A personal reflection. TV/Series 22
Flesch, Marie
2023. “Dude” and “Dudette”, “Bro” and “Sis”: A Diachronic Study of Four Address Terms in the TV Corpus. Anglica. An International Journal of English Studies :32/2 ► pp. 23 ff.
2023. The diachrony of im/politeness in American and British movies (1930–2019). Journal of Pragmatics 209 ► pp. 123 ff.
Landert, Daniela, Tanja Säily & Mika Hämäläinen
2023. TV series as disseminators of emerging vocabulary: Non-codified expressions in the TV Corpus. ICAME Journal 47:1 ► pp. 63 ff.
Viollain, Cécile
2023. What TV series “do” to phonology and vice-versa – or should TV series be used as phonological corpora?. TV/Series 22
Yusufali, Hussein, Stefan Goetze & Roger K. Moore
2023. Bridging the Communication Rate Gap: Enhancing Text Input for Augmentative and Alternative Communication (AAC). In HCI International 2023 – Late Breaking Papers [Lecture Notes in Computer Science, 14055], ► pp. 434 ff.
Zhang, Jinyi, Ye Tian, Jiannan Mao, Mei Han, Feng Wen, Cong Guo, Zhonghui Gao & Tadahiro Matsumoto
2023. WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation. Electronics 12:5 ► pp. 1140 ff.
Ha, Hung Tan
2022. Lexical Profile of Newspapers Revisited: A Corpus-Based Analysis. Frontiers in Psychology 13
Ha, Hung Tan
2022. Vocabulary Demands of Informal Spoken English Revisited: What Does It Take to Understand Movies, TV Programs, and Soap Operas?. Frontiers in Psychology 13
López-Rodríguez, Clara Inés
2022. Emotion at the end of life: Semantic annotation and key domains in a pilot study audiovisual corpus. Lingua 277 ► pp. 103401 ff.
Montero Perez, Maribel
2022. Second or foreign language learning through watching audio-visual input and the role of on-screen text. Language Teaching 55:2 ► pp. 163 ff.
Gentile, Federico Pio
2021. The 19-2 Anglified Police Procedural Noir. In Corpora, Corpses and Corps, ► pp. 241 ff.
Gentile, Federico Pio
2021. The Research Methodology. In Corpora, Corpses and Corps, ► pp. 15 ff.
Gentile, Federico Pio
2021. The Motive ‘Whydunit’ Television Hybrid. In Corpora, Corpses and Corps, ► pp. 177 ff.
Gentile, Federico Pio
2021. The Linguistic and Cultural Environment of Canadian Television. In Corpora, Corpses and Corps, ► pp. 71 ff.
This list is based on CrossRef data as of 18 november 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.