Challenges in Corpus Linguistics

Rethinking corpus compilation and analysis

Editors

Mark Kaunisto | Tampere University

Marco Schilk | University of Hildesheim

Hardbound – Available

ISBN 9789027215888 | EUR 115.00 | USD 149.00

e-Book –

ISBN 9789027246530 | EUR 115.00 | USD 149.00

This book contributes to the discussion of challenges faced in different areas of corpus linguistics, namely the compilation, annotation, and analysis of linguistic corpora. In a field of growing corpus sizes and expanding possibilities of gathering data, some old issues persist, while at the same time new problems have emerged. As the compilation and study of language corpora gets increasingly sophisticated and complex, continuous attention on ways of dealing with the data in question and challenges in text selection and interpretation is needed. The contributions to this volume address problems relating to a variety of areas in corpus linguistic study, including corpus annotation, data variability, learner language, social media texts, and database utilization. The authors provide critical overviews and research-based analyses, discuss the nature of some of the common pitfalls, and offer solutions to existing problems.

[Studies in Corpus Linguistics, 118] 2024. vii, 172 pp.

Publishing status: Available
Published online on 6 September 2024

© John Benjamins

https://doi.org/10.1075/scl.118

Table of Contents

Acknowledgements | pp. vii–viii

From fallacies and pitfalls to solutions and future directions: Navigating the evolving terrain of corpus linguistics

Mark Kaunisto | pp. 1–8

Engaging with bad (meta)data in historical corpus linguistics

Turo Vartiainen and Tanja Säily | pp. 9–34

Named entities as potentially problematic items in corpora

Mark Kaunisto | pp. 35–54

Challenges in the compilation, annotation, and analysis of learner corpus data

Marcus Callies | pp. 55–67

Early newspapers as data for corpus linguistics (and Digital Humanities): Issues in using the British Library Newspapers database as a corpus

Turo Hiltunen | pp. 68–88

Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices

Stefan Hartmann | pp. 89–105

Text length and short texts: An overview of the problem

Aatu Liimatta | pp. 106–125

Corpus genre categories: Issues at the intersection of linguistics and literature

Daniel Ocic Ihrmark | pp. 126–141

Modeling fine-grained sociolinguistic variation: The promises and pitfalls of Twitter corpora and neural word embeddings

Filip Miletić, Anne Przewozny-Desriaux and Ludovic Tanguy | pp. 142–170

Subject index | pp. 171–172

Subjects

Linguistics

Theoretical linguistics

Corpus linguistics

Applied linguistics

Computational & corpus linguistics

Main BIC Subject

CFX: Computational linguistics

Main BISAC Subject

LAN009000: LANGUAGE ARTS & DISCIPLINES / Linguistics / General

ONIX Metadata

U.S. Library of Congress Control Number: 2024029608 | Marc record