Comparing Corpora

Kilgarriff, Adam

doi:10.1075/ijcl.6.1.05kil

Article published In:

International Journal of Corpus Linguistics
Vol. 6:1 (2001) ► pp.97–133

Comparing Corpora

Adam Kilgarriff | ITRI, University of Brighton

Corpus linguistics lacks strategies for describing and comparing corpora. Currently most descriptions of corpora are textual, and questions such as ‘what sort of a corpus is this?’, or ‘how does this corpus compare to that?’ can only be answered impressionistically. This paper considers various ways in which different corpora can be compared more objectively. First we address the issue, ‘which words are particularly characteristic of a corpus?’, reviewing and critiquing the statistical methods which have been applied to the question and proposing the use of the Mann-Whitney ranks test. Results of two corpus comparisons using the ranks test are presented. Then, we consider measures for corpus similarity. After discussing limitations of the idea of corpus similarity, we present a method for evaluating corpus similarity measures. We consider several measures and establish that a\chi\tsup{2}-based one performs best. All methods considered in this paper are based on word and ngram frequencies; the strategy is defended.

Keywords: similarity, homogeneity, word frequency

Published online: 17 December 2001

https://doi.org/10.1075/ijcl.6.1.05kil

Cited by

Cited by 169 other publications

Order by:

Adamov, Abzetdin Z.

2015. 2015 9th International Conference on Application of Information and Communication Technologies (AICT), ► pp. 76 ff.

Adamov, Abzetdin Z.

2018. 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT), ► pp. 1 ff.

Ahmed, Saifuddin, Kokil Jaidka & Jaeho Cho

2016. The 2014 Indian elections on Twitter: A comparison of campaign strategies of political parties. Telematics and Informatics 33:4 ► pp. 1071 ff.

Ancarno, Clyde

2015. When are public apologies ‘successful’? Focus on British and French apology press uptakes. Journal of Pragmatics 84 ► pp. 139 ff.

Anthony, L.

2005. IPCC 2005. Proceedings. International Professional Communication Conference, 2005., ► pp. 729 ff.

Anthony, L.

2006. Developing a Freeware, Multiplatform Corpus Analysis Toolkit for the Technical Writing Classroom. IEEE Transactions on Professional Communication 49:3 ► pp. 275 ff.

Babych, Bogdan, Fangzhong Su, Anthony Hartley, Ahmet Aker, Monica Lestari Paramita, Paul Clough & Robert Gaizauskas

2019. Cross-Language Comparability and Its Applications for MT. In Using Comparable Corpora for Under-Resourced Areas of Machine Translation [Theory and Applications of Natural Language Processing, ], ► pp. 13 ff.

Baker, Paul

2010. Will Ms ever be as frequent as Mr?. Gender and Language 4:1 ► pp. 125 ff.

Baker, Paul

2011. Times May Change, But We Will Always Have Money: Diachronic Variation in Recent British English. Journal of English Linguistics 39:1 ► pp. 65 ff.

Bashlovkina, Vasilisa, Riley Matthews, Zhaobin Kuang, Simon Baumgartner & Michael Bendersky

2023. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ► pp. 3737 ff.

Benko, Vladimír

2014. Aranea: Yet Another Family of (Comparable) Web Corpora. In Text, Speech and Dialogue [Lecture Notes in Computer Science, 8655], ► pp. 247 ff.

Bentum, M., L. ten Bosch, A van den Bosch & M. Ernestus

2022. Speech register influences listeners’ word expectations. Brain and Language 235 ► pp. 105197 ff.

Bentum, Martijn, Louis ten Bosch, Antal van den Bosch & Mirjam Ernestus

2019. Do speech registers differ in the predictability of words?. International Journal of Corpus Linguistics 24:1 ► pp. 98 ff.

Bernardini, Silvia & Adriano Ferraresi

2011. Practice, Description and Theory Come Together – Normalization or Interference in Italian Technical Translation?. Meta 56:2 ► pp. 226 ff.

Biemann, Chris, Gerhard Heyer & Uwe Quasthoff

2022. Sprachstatistik. In Wissensrohstoff Text, ► pp. 177 ff.

Billot, Jennie & Virginia King

2017. The missing measure? Academic identity and the induction process. Higher Education Research & Development 36:3 ► pp. 612 ff.

Bittar, André, Sumithra Velupillai, Angus Roberts & Rina Dutta

2021. Using General-purpose Sentiment Lexicons for Suicide Risk Assessment in Electronic Health Records: Corpus-Based Analysis. JMIR Medical Informatics 9:4 ► pp. e22397 ff.

Bradlow, Eric

2010. Automated Marketing Research Using Online Customer Reviews. SSRN Electronic Journal

Brezina, Vaclav & Miriam Meyerhoff

2014. Significant or random?. International Journal of Corpus Linguistics 19:1 ► pp. 1 ff.

Brooks, Penelope A. & Atif M Memon

2009. 2009 IEEE International Conference on Software Maintenance, ► pp. 243 ff.

Brown, David West

2018. English and Empire,

Brunzel, Marko & Myra Spiliopoulou

2007. Domain Relevance on Term Weighting. In Natural Language Processing and Information Systems [Lecture Notes in Computer Science, 4592], ► pp. 427 ff.

Buckley, Kevin & Carl Vogel

2019. Using character N-grams to explorediachronic change in medieval English. Folia Linguistica 53:s40-s2 ► pp. 249 ff.

Buhler, Thomas & Richard Stephenson

2021. Is Local Planning Really ‘Local’? National Planning Context as a Determining Factor for Local Discourse. Planning Theory & Practice 22:2 ► pp. 227 ff.

Buzila, Eduard

2018. Kritische Diskursanalyse: Was ist das und warum ist sie kritisch? (Critical Discourse Analysis: What Is It Useful for and Why Is It Critical?). SSRN Electronic Journal

Carpuat, Marine, Pascale Fung & Grace Ngai

2006. Aligning word senses using bilingual corpora. ACM Transactions on Asian Language Information Processing 5:2 ► pp. 89 ff.

Carter, Pelham, Matt Gee, Hollie McIlhone, Harkeeret Lally & Robert Lawson

2021. Comparing manual and computational approaches to theme identification in online forums: A case study of a sex work special interest community. Methods in Psychology 5 ► pp. 100065 ff.

Chatzitheodorou, Konstantinos & Vassilios Kappatos

2020. Terminology study on vibration-based condition monitoring technique. Vibroengineering PROCEDIA 34 ► pp. 20 ff.

Chatzitheodorou, Konstantinos & Vassilios Kappatos

2021. Hybrid extraction of multi-word terms: an application on vibration-based condition monitoring technique. Mathematical Models in Engineering 7:1 ► pp. 1 ff.

Chen, Xiao Xiao, Shi Li Ge & Min Lin

2011. An Analysis of Statistical Techniques Applying to Multi-Feature Similarity Comparison between Corpora. Applied Mechanics and Materials 66-68 ► pp. 2323 ff.

Cislaru, Georgeta & Frédérique Sitri

2012. De l'émergence à l'impact social des discours : hétérogénéités d'un corpus. Langages n° 187:3 ► pp. 59 ff.

Conrad, Susan M. & Kimberly R. LeVelle

2008. Corpus Linguistics and Second Language Instruction. In The Handbook of Educational Linguistics, ► pp. 539 ff.

Cook, Paul & Laurel J. Brinton

2017. Building and evaluating web corpora representing national varieties of English. Language Resources and Evaluation 51:3 ► pp. 643 ff.

Crawford, Lynn, Julien Pollack & David England

2006. Uncovering the trends in project management: Journal emphases over the last 10 years. International Journal of Project Management 24:2 ► pp. 175 ff.

Crawford, Lynn, Julien Pollack & David England

2007. How Standard Are Standards: An Examination of Language Emphasis in Project Management Standards. Project Management Journal 38:3 ► pp. 6 ff.

Curtotti, Michael & Eric McCreath

2011. A Corpus of Australian Contract Language: Description, Profiling and Analysis. SSRN Electronic Journal

Cvrček, Václav, Zuzana Komrsková, David Lukeš, Petra Poukarová, Anna Řehořková, Adrian Jan Zasina & Vladimír Benko

2020. Comparing web-crawled and traditional corpora. Language Resources and Evaluation 54:3 ► pp. 713 ff.

Dag, J.N., V. Gervasi, S. Brinkkemper & B. Regnell

2004. Proceedings. 12th IEEE International Requirements Engineering Conference, 2004., ► pp. 265 ff.

Davalos, Sergio & Ehsan H. Feroz

2022. A textual analysis of the US Securities and Exchange Commission's accounting and auditing enforcement releases relating to the Sarbanes–Oxley Act. Intelligent Systems in Accounting, Finance and Management 29:1 ► pp. 19 ff.

De Groot, Elizabeth, Catherine Nickerson, Hubert Korzilius & Marinel Gerritsen

2016. Picture This. Journal of Business and Technical Communication 30:2 ► pp. 165 ff.

Desagulier, Guillaume

2014. Visualizing distances in a set of near-synonyms. In Corpus Methods for Semantics [Human Cognitive Processing, 43], ► pp. 145 ff.

Desagulier, Guillaume

2016. A lesson from associative learning: asymmetry and productivity in multiple-slot constructions. Corpus Linguistics and Linguistic Theory 12:2

Desagulier, Guillaume

2017. Association and Productivity. In Corpus Linguistics and Statistics with R [Quantitative Methods in the Humanities and Social Sciences, ], ► pp. 197 ff.

Dias, Gaël & Špela Vintar

2005. Unsupervised Learning of Multiword Units from Part-of-Speech Tagged Corpora: Does Quantity Mean Quality?. In Progress in Artificial Intelligence [Lecture Notes in Computer Science, 3808], ► pp. 669 ff.

Diaz, Brett A.

2022. Finding social (mis)alignment in older adult and opioid health policy implementation with corpus-assisted discourse analysis. Applied Corpus Linguistics 2:2 ► pp. 100020 ff.

Dunn, Jonathan

2020. Mapping languages: the Corpus of Global Language Use. Language Resources and Evaluation 54:4 ► pp. 999 ff.

Dunn, Jonathan

2022. Natural Language Processing for Corpus Linguistics,

Dyevre, Arthur & Nicolas Lampach

2021. Issue attention on international courts: Evidence from the European Court of Justice. The Review of International Organizations 16:4 ► pp. 793 ff.

Eiteljoerge, Sarah F. V., Nausicaa Pouscoulous & Elena V. M. Lieven

2018. Some Pieces Are Missing: Implicature Production in Children. Frontiers in Psychology 9

Evans, Roger, Alexander Gelbukh, Gregory Grefenstette, Patrick Hanks, Miloš Jakubíček, Diana McCarthy, Martha Palmer, Ted Pedersen, Michael Rundell, Pavel Rychlý, Serge Sharoff & David Tugwell

2018. Adam Kilgarriff’s Legacy to Computational Linguistics and Beyond. In Computational Linguistics and Intelligent Text Processing [Lecture Notes in Computer Science, 9623], ► pp. 3 ff.

Ferrario, Beatrice & Stefanie Stantcheva

2022. Eliciting People's First-Order Concerns: Text Analysis of Open-Ended Survey Questions. SSRN Electronic Journal

Fitz, Hartmut & Franklin Chang

2017. Meaningful questions: The acquisition of auxiliary inversion in a connectionist model of sentence production. Cognition 166 ► pp. 225 ff.

Fredner, Erik

2022. A Meaning Apart from Its Indistinguishable Words. Nathaniel Hawthorne Review 48:1 ► pp. 82 ff.

Gablasova, Dana, Vaclav Brezina & Tony McEnery

2017. Exploring Learner Language Through Corpora: Comparing and Interpreting Corpus Frequency Information. Language Learning 67:S1 ► pp. 130 ff.

Gabrilovich, Evgeniy, Susan Dumais & Eric Horvitz

2004. Proceedings of the 13th international conference on World Wide Web, ► pp. 482 ff.

García Salido, Marcos & Marcos Garcia

2018. Comparing learners’ and native speakers’ use of collocations in written Spanish. International Review of Applied Linguistics in Language Teaching 56:4 ► pp. 401 ff.

Ghazzawi, Nizar, Benoît Robichaud, Patrick Drouin & Fatiha Sadat

2017. Automatic extraction of specialized verbal units. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 23:2 ► pp. 207 ff.

Gries, Stefan Th.

2006. Exploring variability within and between corpora: some methodological considerations. Corpora 1:2 ► pp. 109 ff.

Grimm, Stephan, Andreas Abecker, Johanna Völker & Rudi Studer

2011. Ontologies and the Semantic Web. In Handbook of Semantic Web Technologies, ► pp. 507 ff.

Hafsa, Fatima, Nicole Darnall & Stuart Bretschneider

2022. Social Public Purchasing: Addressing a Critical Void in Public Purchasing Research. Public Administration Review 82:5 ► pp. 818 ff.

Hahn, Udo & Joachim Wermter

2004. Tagging Medical Documents with High Accuracy. In PRICAI 2004: Trends in Artificial Intelligence [Lecture Notes in Computer Science, 3157], ► pp. 852 ff.

Hayles, Nathalie K.

2016. Œuvres citées. In Lire et penser en milieux numériques, ► pp. 391 ff.

Holtz, Peter, Emanuel Deutschmann & Henrik Dobewall

2017. Cross-Cultural Psychology and the Rise of Academic Capitalism: Linguistic Changes inCCRandJCCPArticles, 1970-2014. Journal of Cross-Cultural Psychology 48:9 ► pp. 1410 ff.

Horák, Aleš, Vít Baisa, Adam Rambousek & Vít Suchomel

2019. A New Approach for Semi-Automatic Building and Extending a Multilingual Terminology Thesaurus. International Journal on Artificial Intelligence Tools 28:02 ► pp. 1950008 ff.

Izbassarov, Tleusher & Cemil Turan

2022. 2022 International Conference on Smart Information Systems and Technologies (SIST), ► pp. 1 ff.

Jahić, Sead & Jernej Vičič

2023. Annotated Lexicon for Sentiment Analysis in the Bosnian Language. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave 11:2 ► pp. 59 ff.

Jamatia, Anupam, Björn Gambäck & Amitava Das

2018. Collecting and Annotating Indian Social Media Code-Mixed Corpora. In Computational Linguistics and Intelligent Text Processing [Lecture Notes in Computer Science, 9624], ► pp. 406 ff.

Jockers, Matthew L. & Ted Underwood

2015. Text‐Mining the Humanities. In A New Companion to Digital Humanities, ► pp. 291 ff.

Jordanous, Anna

2012. A Standardised Procedure for Evaluating Creative Systems: Computational Creativity Evaluation Based on What it is to be Creative. Cognitive Computation 4:3 ► pp. 246 ff.

Jordanous, Anna, Bill Keller & Peter Csermely

2016. Modelling Creativity: Identifying Key Components through a Corpus-Based Approach. PLOS ONE 11:10 ► pp. e0162959 ff.

Jung, Boo Kyung

2022. The nature of L2 input. Korean Linguistics 18:2 ► pp. 182 ff.

Jung, Boo Kyung & Gyu-Ho Shin

2023. Use of locative postposition-verb construction in Korean: analysis of L1-Korean corpora and L2-Korean textbooks. Corpora 18:1 ► pp. 15 ff.

Justen, Lennart, Kilian Muller, Marco Niemann & Jorg Becker

2022. 2022 IEEE 24th Conference on Business Informatics (CBI), ► pp. 40 ff.

Kettunen, Kimmo

2020. How to Do Lexical Quality Estimation of a Large OCRed Historical Finnish Newspaper Collection with Scarce Resources. Digital Studies/Le champ numérique 10:1

Kettunen, Kimmo & Matti La Mela

2021. Semantic tagging and the Nordic tradition of everyman’s rights. Digital Scholarship in the Humanities

Kilgarriff, Adam

2012. Review of Paquot (2010): Academic Vocabulary in Learner Writing: From Extraction to Analysis. International Journal of Corpus Linguistics 17:1 ► pp. 125 ff.

Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý & Vít Suchomel

2014. The Sketch Engine: ten years on. Lexicography 1:1 ► pp. 7 ff.

Kilgarriff, Adam & Gregory Grefenstette

2003. Introduction to the Special Issue on the Web as Corpus. Computational Linguistics 29:3 ► pp. 333 ff.

Koplenig, Alexander

2015. The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Data Sets—Reconstructing the Composition of the German Corpus in Times of WWII. Digital Scholarship in the Humanities ► pp. fqv037 ff.

Koplenig, Alexander

2015. Why the quantitative analysis of diachronic corpora that does not consider the temporal aspect of time-series can lead to wrong conclusions. Digital Scholarship in the Humanities ► pp. fqv030 ff.

Koplenig, Alexander

2017. A Data-Driven Method to Identify (Correlated) Changes in Chronological Corpora. Journal of Quantitative Linguistics 24:4 ► pp. 289 ff.

Koplenig, Alexander

2018. Using the parameters of the Zipf–Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes – a large-scale corpus analysis. Corpus Linguistics and Linguistic Theory 14:1 ► pp. 1 ff.

Kuosmanen, Sonja

2021. Terms of reference and objectivity in US press reports in the Gulf War in 1990. Journalism 22:8 ► pp. 2053 ff.

La, Hanh Luong & Rudi Bekkers

2021. Science and Technology Relatedness: The Case of DNA Nanoscience and DNA Nanotechnology. In Innovation, Catch-up and Sustainable Development [Economic Complexity and Evolution, ], ► pp. 29 ff.

Lam, Phoenix

2009. The making of a BNC customised spoken corpus for comparative purposes. Corpora 4:2 ► pp. 167 ff.

Lebedeva, Maria, Tatyana Veselovskaya, Olga Kupreshchenko & Antonina Laposhina

2021. Corpus-Based Evaluation of Textbook Content: A Case of Russian Language Primary School Textbooks for Migrants. In Facing Diversity in Child Foreign Language Education [Second Language Learning and Teaching, ], ► pp. 215 ff.

Lee, Thomas

2007. Constraint-based Ontology Induction from Online Customer Reviews. Group Decision and Negotiation 16:3 ► pp. 255 ff.

Lefer, Marie-Aude

2012. Word-formation in translated language: The impact of language-pair specific features and genre variation. Across Languages and Cultures 13:2 ► pp. 145 ff.

LI, BO, ERIC GAUSSIER & DAN YANG

2018. Measuring bilingual corpus comparability. Natural Language Engineering 24:4 ► pp. 523 ff.

Li, Haipeng & Jonathan Dunn

2022. Corpus similarity measures remain robust across diverse languages. Lingua 275 ► pp. 103377 ff.

Li, Haipeng, Jonathan Dunn & Andrea Nini

2023. Register variation remains stable across 60 languages. Corpus Linguistics and Linguistic Theory 19:3 ► pp. 397 ff.

Li, Paul Luo, Connie Yang, Sophia Liu & Mary Hu

2019. 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), ► pp. 347 ff.

Li, Yi

2022. Review of Ji & Oakes (2021): Corpus Exploration of Lexis and Discourse in Translation. Translation and Translanguaging in Multilingual Contexts 8:2 ► pp. 206 ff.

Lien, Hsin-Yi

2022. Revisiting Keyword Analysis in a Specialized Corpus: Religious Terminology Extraction. Journal of Quantitative Linguistics 29:3 ► pp. 269 ff.

Lindemann, David & Iñaki San Vicente

2015. Building Corpus-based Frequency Lemma Lists. Procedia - Social and Behavioral Sciences 198 ► pp. 266 ff.

Lippincott, Thomas, Diarmuid Ó Séaghdha & Anna Korhonen

2011. Exploring subdomain variation in biomedical language. BMC Bioinformatics 12:1

Liu, Xiao-Yue & Chunyu Kit

2009. 2009 International Conference on Machine Learning and Cybernetics, ► pp. 3499 ff.

Liu, Yong, Pavel Dmitriev, Yifei Huang, Andrew Brooks, Li Dong, Mengyue Liang, Zvi Boshernitzan, Jiwei Cao & Bobby Nguy

2022. Transfer learning meets sales engagement email classification: Evaluation, analysis, and strategies. Concurrency and Computation: Practice and Experience 34:8

Messerli, Thomas C. & Miriam A. Locher

2024. Responding to subtitled K-drama: Artefact-orientation in timed comments. Discourse, Context & Media 58 ► pp. 100756 ff.

Mishne, Gilad, David Carmel, Ron Hoory, Alexey Roytman & Aya Soffer

2005. Proceedings of the 14th ACM international conference on Information and knowledge management, ► pp. 453 ff.

Mitchell, Andrew S.

2020. Mode-2 Knowledge Production within Community-Based Sustainability Projects: Applying Textual and Thematic Analytics to Action Research Conversations. Administrative Sciences 10:4 ► pp. 90 ff.

Moadel‐Attie, Roxanne, Sheri R. Levy, Bonita London & Rami Al‐Rfou

2018. Evolution of Social Identity Terms in Lay and Academic Sources: Implications for Research and Public Policy. Analyses of Social Issues and Public Policy 18:1 ► pp. 323 ff.

Mori, Laura

2018. Chapter 1. Introduction. In Observing Eurolects [Studies in Corpus Linguistics, 86], ► pp. 1 ff.

Nakamata, Naoki

2019. Vocabulary depends on topic, and so does grammar . Journal of Japanese Linguistics 35:2 ► pp. 213 ff.

Natt och Dag, Johan & Vincenzo Gervasi

2005. Managing Large Repositories of Natural Language Requirements. In Engineering and Managing Software Requirements, ► pp. 219 ff.

Neumann, Stella & Silvia Hansen-Schirra

2013. Exploiting the Incomparability of Comparable Corpora for Contrastive Linguistics and Translation Studies. In Building and Using Comparable Corpora, ► pp. 321 ff.

Nishino, Ryutaro & Kayoko Nohara

2013. Characteristics of UI English: From Non-native’s Viewpoint. In Cross-Cultural Design. Methods, Practice, and Case Studies [Lecture Notes in Computer Science, 8023], ► pp. 323 ff.

O'Boyle, Aisling

2014. ‘You’ and ‘I’ in university seminars and spoken learner discourse. Journal of English for Academic Purposes 16 ► pp. 40 ff.

Padó, Sebastian & Mirella Lapata

2007. Dependency-Based Construction of Semantic Space Models. Computational Linguistics 33:2 ► pp. 161 ff.

Pal, Samiran, Avinash Singh, Soham Datta, Sangameshwar Patil, Indrajit Bhattacharya & Girish Palshikar

2021. Semantic Templates for Generating Long-Form Technical Questions. In Text, Speech, and Dialogue [Lecture Notes in Computer Science, 12848], ► pp. 235 ff.

Palander-Collin, Minna & Minna Nevala

2020. Person reference and democratization in British English. Language Sciences 79 ► pp. 101265 ff.

Panunzi, A., M. Fabbri & M. Moneglia

2005. First International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS'05), ► pp. 253 ff.

Paquot, Magali & Luke Plonsky

2017. Quantitative research methods and study quality in learner corpus research. International Journal of Learner Corpus Research 3:1 ► pp. 61 ff.

Paulett, John M. & Curtis P. Langlotz

2009. Improving language models for radiology speech recognition. Journal of Biomedical Informatics 42:1 ► pp. 53 ff.

PEIRSMAN, YVES, DIRK GEERAERTS & DIRK SPEELMAN

2010. The automatic identification of lexical variation between language varieties. Natural Language Engineering 16:4 ► pp. 469 ff.

Pinnis, Mārcis, Nikola Ljubešić, Dan Ştefănescu, Inguna Skadiņa, Marko Tadić, Tatjana Gornostaja, Špela Vintar & Darja Fišer

2019. Extracting Data from Comparable Corpora. In Using Comparable Corpora for Under-Resourced Areas of Machine Translation [Theory and Applications of Natural Language Processing, ], ► pp. 89 ff.

Pojanapunya, Punjaporn & Richard Watson Todd

2018. Log-likelihood and odds ratio: Keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory 14:1 ► pp. 133 ff.

Ponomareva, Natalia & Mike Thelwall

2012. Biographies or Blenders: Which Resource Is Best for Cross-Domain Sentiment Analysis?. In Computational Linguistics and Intelligent Text Processing [Lecture Notes in Computer Science, 7181], ► pp. 488 ff.

Rastelli, Stefano & Akira Murakami

2022. Apparently identical verbs can be represented differently: comparing L1–L2 inflection with contingency-based measure ΔP. Corpora 17:1 ► pp. 97 ff.

Rawson, Caleb, Brady J. Twedt & Jessica C. Watkins

2023. Managers’ Strategic Use of Concurrent Disclosure: Evidence from 8-K Filings and Press Releases. The Accounting Review 98:4 ► pp. 345 ff.

Remus, Robert

2012. 2012 IEEE 12th International Conference on Data Mining Workshops, ► pp. 717 ff.

Rietveld, Toni, Roeland Van hout & Mirjam Ernestus

2004. Pitfalls in Corpus Research. Computers and the Humanities 38:4 ► pp. 343 ff.

Rodríguez-Puente, Paula

2019. The English Phrasal Verb, 1650–Present,

Rodríguez-Puente, Paula

2019. Chapter 8. Interpersonality in legal written discourse. In Corpus-based Research on Variation in English Legal Discourse [Studies in Corpus Linguistics, 91], ► pp. 171 ff.

Roy, Jean-Hugues

2021. Kittens and Jesus : What would remain in a newsless Facebook?. SSRN Electronic Journal

Salido, Marcos García, Paula Lorente & Almudena Basanta

2019. Las construcciones con verbos de apoyo del español en la producción escrita de aprendices francófonos. Journal of Spanish Language Teaching 6:1 ► pp. 32 ff.

Santini, Marina, Arne Jönsson, Wiktor Strandqvist, Gustav Cederblad, Mikael Nyström, Marjan Alirezaie, Leili Lind, Eva Blomqvist, Maria Lindén & Annica Kristoffersson

2019. Designing an Extensible Domain-Specific Web Corpus for “Layfication”. In Cyber-Physical Systems for Social Applications [Advances in Systems Analysis, Software Engineering, and High Performance Computing, ], ► pp. 98 ff.

Santini, Marina, Wiktor Strandqvist, Mikael Nyström, Marjan Alirezai & Arne Jönsson

2018. Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora. In Database and Expert Systems Applications [Communications in Computer and Information Science, 903], ► pp. 207 ff.

Savage, Saiph, Andres Monroy-Hernandez & Tobias Höllerer

2016. Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, ► pp. 813 ff.

Savoy, Jacques

2010. Lexical Analysis of US Political Speeches. Journal of Quantitative Linguistics 17:2 ► pp. 123 ff.

Schröter, Julian, Keli Du, Julia Dudar, Cora Rok & Christof Schöch

2021. From Keyness to Distinctiveness – Triangulation and Evaluationin Computational Literary Studies. Journal of Literary Theory 15:1-2 ► pp. 81 ff.

Shaikina, Alevtina A. & Anastasia A. Funkner

2020. Medical Corpora Comparison Using Topic Modeling. Procedia Computer Science 178 ► pp. 244 ff.

Sharma, Raksha, Dibyendu Mondal & Pushpak Bhattacharyya

2018. A Comparison Among Significance Tests and Other Feature Building Methods for Sentiment Analysis: A First Study. In Computational Linguistics and Intelligent Text Processing [Lecture Notes in Computer Science, 10762], ► pp. 3 ff.

Sharon, Aviv J. & Ayelet Baram-Tsabari

2014. Measuring mumbo jumbo: A preliminary quantification of the use of jargon in science communication. Public Understanding of Science 23:5 ► pp. 528 ff.

Siegert, Ingo, Ronald Böck & Andreas Wendemuth

2018. Using a PCA-based dataset similarity measure to improve cross-corpus emotion recognition. Computer Speech & Language 51 ► pp. 1 ff.

Sierra, Gerardo, Tonatiuh Hernández-García, Helena Gómez-Adorno, Gemma Bel-Enguix, David Pinto, Beatriz Beltrán & Vivek Singh

2022. A case study in authorship attribution: The Mondrigo1. Journal of Intelligent & Fuzzy Systems 42:5 ► pp. 4473 ff.

Siino, Marco, Elisa Di Nuovo, Ilenia Tinnirello & Marco La Cascia

2022. Fake News Spreaders Detection: Sometimes Attention Is Not All You Need. Information 13:9 ► pp. 426 ff.

Skadiņa, Inguna, Robert Gaizauskas, Andrejs Vasiļjevs & Monica Lestari Paramita

2019. Introduction. In Using Comparable Corpora for Under-Resourced Areas of Machine Translation [Theory and Applications of Natural Language Processing, ], ► pp. 1 ff.

Smetanin, Sergey

2022. Pulse of the Nation: Observable Subjective Well-Being in Russia Inferred from Social Network Odnoklassniki. Mathematics 10:16 ► pp. 2947 ff.

Smith, Catherine, Svenja Adolphs, Kevin Harvey & Louise Mullany

2014. Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus. Corpora 9:2 ► pp. 137 ff.

Sokolova, Marina, Mohak Shah & Stan Szpakowicz

2006. Comparative Analysis of Text Data in Successful Face-to-Face and Electronic Negotiations. Group Decision and Negotiation 15:2 ► pp. 127 ff.

Sokolova, Marina & Stan Szpakowicz

2005. Analysis and Classification of Strategies in Electronic Negotiations. In Advances in Artificial Intelligence [Lecture Notes in Computer Science, 3501], ► pp. 145 ff.

Staab, S.

2001. Human language technologies for knowledge management. IEEE Intelligent Systems 16:6 ► pp. 84 ff.

Stanislav, Petr, Jan Švec & Luboš Šmídl

2012. Unsupervised Synchronization of Hidden Subtitles with Audio Track Using Keyword Spotting Algorithm. In Text, Speech and Dialogue [Lecture Notes in Computer Science, 7499], ► pp. 422 ff.

Subašić, Ilija & Bettina Berendt

2011. Peddling or Creating? Investigating the Role of Twitter in News Reporting. In Advances in Information Retrieval [Lecture Notes in Computer Science, 6611], ► pp. 207 ff.

Supran, Geoffrey & Naomi Oreskes

2021. Rhetoric and frame analysis of ExxonMobil's climate change communications. One Earth 4:5 ► pp. 696 ff.

Swanson, Zane L., Edward Walker, Louise Miller & Richard Green

2018. Using the CPA Examination Blueprints to Enhance Learning Objectives for Accounting Courses. SSRN Electronic Journal

Sönning, Lukas

2023. Evaluation of keyness metrics: performance and reliability. Corpus Linguistics and Linguistic Theory 0:0

Tay, Dennis

2015. Metaphor in case study articles on Chinese university counseling service websites. Chinese Language and Discourse. An International and Interdisciplinary Journal 6:1 ► pp. 28 ff.

Taylor, Charlotte

2013. Searching for similarity using corpus-assisted discourse studies. Corpora 8:1 ► pp. 81 ff.

Tellez, Fernando Perez, David Pinto, John Cardiff & Paolo Rosso

2009. 2009 Eighth Mexican International Conference on Artificial Intelligence, ► pp. 97 ff.

Th. Gries, Stefan & Martin Hilpert

2008. The identification of stages in diachronic data: variability-based neighbour clustering. Corpora 3:1 ► pp. 59 ff.

Underwood, Ted, Michael L. Black, Loretta Auvil & Boris Capitanu

2013. 2013 IEEE International Conference on Big Data, ► pp. 95 ff.

Uribe, Diego

2014. LSA Based Approach to Domain Detection. In Human-Inspired Computing and Its Applications [Lecture Notes in Computer Science, 8856], ► pp. 62 ff.

VERHAGEN, VÉRONIQUE, MARIA MOS, AD BACKUS & JOOST SCHILPEROORD

2018. Predictive language processing revealing usage-based variation. Language and Cognition 10:2 ► pp. 329 ff.

Voelkel, Svenja & Franziska Kretzschmar

2021. Introducing Linguistic Research,

Vogel, Carl & Gerard Lynch

2008. Computational Stylometry: Who’s in a Play?. In Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction [Lecture Notes in Computer Science, 5042], ► pp. 169 ff.

Vogel, Carl, Gerard Lynch & Jerom Janssen

2009. Universum Inference and Corpus Homogeneity. In Research and Development in Intelligent Systems XXV, ► pp. 367 ff.

Wilson Black, Joshua

2023. Creating specialized corpora from digitized historical newspaper archives. Digital Scholarship in the Humanities 38:2 ► pp. 779 ff.

Wirén, Mats

2014. Roland Schäfer & Felix Bildhauer, Web Corpus Construction (Synthesis Lectures on Human Language Technologies 22). Morgan & Claypool, 2013. Pp. xv + 129.. Nordic Journal of Linguistics 37:3 ► pp. 457 ff.

Woźniak, Michał, Agnieszka Wołos, Urszula Modrzyk, Rafał L. Górski, Jan Winkowski, Michał Bajczyk, Sara Szymkuć, Bartosz A. Grzybowski & Maciej Eder

2018. Linguistic measures of chemical diversity and the “keywords” of molecular collections. Scientific Reports 8:1

Ädel, Annelie

2020. Corpus Compilation. In A Practical Handbook of Corpus Linguistics, ► pp. 3 ff.

Šemelík, Martin

2016. Zu neuen Möglichkeiten der lexikographischen Erfassung von Wortbildungskonkurrenzen.Ge-vs.-werkkorpuslinguistisch betrachtet. International Journal of Lexicography ► pp. ecw016 ff.

Švec, Jan, Jan Hoidekr, Daniel Soutner & Jan Vavruška

2011. Web Text Data Mining for Building Large Scale Language Modelling Corpus. In Text, Speech and Dialogue [Lecture Notes in Computer Science, 6836], ► pp. 356 ff.

Švec, Jan, Jan Lehečka, Pavel Ircing, Lucie Skorkovská, Aleš Pražák, Jan Vavruška, Petr Stanislav & Jan Hoidekr

2014. General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes. Language Resources and Evaluation 48:2 ► pp. 227 ff.

[no author supplied]

2013. Web Corpus Construction [Synthesis Lectures on Human Language Technologies, ],

[no author supplied]

2014. Educated Fiji English [Varieties of English Around the World, G47],

[no author supplied]

2015. Crime and Corpus [Linguistic Approaches to Literature, 20],

[no author supplied]

2018. Patterns of Change in 18th-century English [Advances in Historical Sociolinguistics, 8],

This list is based on CrossRef data as of 31 march 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.