Creating and using Web corpora
The Web has recently been used as a corpus for linguistic investigations, often with the help of a commercial search engine. We discuss some potential problems with collecting data from commercial search engine and with using the Web as a corpus. We outline an alternative strategy for data collection, using a personal Web crawler. As a case study, the university Web sites of three nations (Australia, New Zealand and the UK) were crawled. The most frequent words were broadly consistent with non-Web written English, but with some academic-related words amongst the top 50 most frequent. It was also evident that the university Web sites contained a significant amount of non-English text, and academic Web English seems to be more future-oriented than British National Corpus written English.
Keywords: academic language, web corpus, web
Published online: 07 November 2005
Cited by 4 other publications
Baroni, Marco, Silvia Bernardini, Adriano Ferraresi & Eros Zanchetta
CANAN HÄNSEL, EVA & DAGMAR DEUBER
This list is based on CrossRef data as of 28 august 2021. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.