Creating and using Web corpora

Thelwall, Mike

doi:10.1075/ijcl.10.4.07the

Article published In:

International Journal of Corpus Linguistics
Vol. 10:4 (2005) ► pp.517–541

Creating and using Web corpora

Mike Thelwall | University of Wolverhampton

The Web has recently been used as a corpus for linguistic investigations, often with the help of a commercial search engine. We discuss some potential problems with collecting data from commercial search engine and with using the Web as a corpus. We outline an alternative strategy for data collection, using a personal Web crawler. As a case study, the university Web sites of three nations (Australia, New Zealand and the UK) were crawled. The most frequent words were broadly consistent with non-Web written English, but with some academic-related words amongst the top 50 most frequent. It was also evident that the university Web sites contained a significant amount of non-English text, and academic Web English seems to be more future-oriented than British National Corpus written English.

Keywords: academic language, web corpus, web

Published online: 7 November 2005

https://doi.org/10.1075/ijcl.10.4.07the

Cited by (5)

Cited by five other publications

Order by:

CANAN HÄNSEL, EVA & DAGMAR DEUBER

2013. Globalization, postcolonial Englishes, and the English language press in Kenya, Singapore, and Trinidad and Tobago. World Englishes 32:3 ► pp. 338 ff.

Perelmutter, Renee

2012. Interactive properties: Modern Russian predicate adjectives in affirmative and negative contexts. Russian Linguistics 36:1 ► pp. 65 ff.

Koteyko, Nelya

2010. Mining the internet for linguistic and social data: An analysis of ‘carbon compounds’ in Web feeds. Discourse & Society 21:6 ► pp. 655 ff.

Baroni, Marco, Silvia Bernardini, Adriano Ferraresi & Eros Zanchetta

2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43:3 ► pp. 209 ff.

Nebot, Esther Monzó

2008. Corpus-based Activities in Legal Translator Training. The Interpreter and Translator Trainer 2:2 ► pp. 221 ff.

This list is based on CrossRef data as of 3 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.