Chapter 1
Building a comparable corpus of online discussions on Wikipedia
The EFG WikiCorpus
This chapter presents the EFG WikiCorpus, a corpus composed of all the talk pages dedicated to (co)writing an article in the English, French
and German Wikipedias. This chapter explains the place
of talk pages in Wikipedia and describes what is the basic structure of a talk page before detailing the building
process of the EFG WikiCorpus: from the Wikipedia archives to a TEI resource encoded according to the TEI CMC-core schema. It concludes with a quantitative
overview of the EFG WikiCorpus and the EFG WikiDemoCorpus, a derived subcorpus used for qualitative analyses in
various contributions of this volume.
Article outline
- 1.Introduction
- 2.Wikipedia talk pages: Wikipedia’s backstage
- 2.1The main characteristics of Wikipedia talk
pages
- 2.2The basic structure of a Wikipedia talk page
(tp)
- 2.3Talk page encoding: The TEI CMC-core schema
- 3.Building the EFG WikiCorpus
- 3.1Searching for relevant content in the Wikipedia
archives and the wikiCode
- 3.2Extracting talk pages and TEI encoding metadata
- 3.3Parsing the wikiCode and TEI CMC-core encoding
- 3.3.1The global content structure of a talk
page
- 3.3.2Structuring and encoding the threads into
posts
- 3.3.3Templates and special features
- 4.The resulting EFG WikiCorpus
- 4.1Quantitative overview of the talk page
content
- 4.2Metadata overview and multilingual
alignments
- 4.3Brief linguistic overview
- 4.4The EFG WikiDemoCorpus (WDC): A derived subcorpus for more qualitative analyses
- 5.Conclusion
-
Notes
-
References