Publications

Publication details [#42343]

Thelwall, Mike. 2005. Creating and using Web corpora. International Journal of Corpus Linguistics 10 (4) : 517–541.
Publication type
Article in journal
Publication language
English
Language as a subject
Place, Publisher
John Benjamins
Journal DOI
10.1075/ijcl

Annotation

The Web has recently been used as a corpus for linguistic investigations, often with the help of a commercial search engine. We discuss some potential problems with collecting data from commercial search engine and with using the Web as a corpus. We outline an alternative strategy for data collection, using a personal Web crawler. As a case study, the university Web sites of three nations (Australia, New Zealand and the UK) were crawled. The most frequent words were broadly consistent with non-Web written English, but with some academic-related words amongst the top 50 most frequent. It was also evident that the university Web sites contained a significant amount of non-English text, and academic Web English seems to be more future-oriented than British National Corpus written English.