An aspect of corpus compilation that poses a particular challenge is the question of how to transcribe orthographically units that are not part of any standardised vocabulary. Among the problematic categories we find voiced pauses, minimal response signals, interjections, certain discourse markers, phonologically reduced forms, colloquialisms and dialect forms. Such semi-lexical features are usually represented by regular phonemic-graphemic correspondences but are nevertheless often inconsistently handled. This paper reviews a number of existing transcription guidelines and assesses whether the recommendations they provide are sufficient and detailed enough to secure a consistent transcription of the categories mentioned. Further, the paper assesses to what extent transcription of semi-lexical features is consistent within and across two spoken corpora. On the basis of a cross-corpus comparison of the Bergen Corpus of London Teenage Language (COLT) and the London English Corpus (LEC), the paper provides specific recommendations for corpus transcription.
Andersen, G. (2016). Using the corpus-driven method to chart discourse-pragmatic change. In H. Pichler (Ed.), Discourse-Pragmatic Variation and Change in English: New Methods and Insights (pp. 21–40). Cambridge: Cambridge University Press.
Berglund, Y. (2005). Expressions of Future in Present-day English: A Corpus-based Approach. Uppsala: Acta Universitatis Upsaliensis.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman Grammar of Spoken and Written English. London: Longman.
Brinton, L. (1996). Pragmatic Markers in English. Berlin: Mouton de Gruyter.
Cheshire, J., Fox, S., Kerswill, P., & Torgersen, E. (2008). Ethnicity, friendship network and social practices as the motor of dialect change: Linguistic innovation in London. Sociolinguistica Jahrbuch, 221, 1–23.
Cheshire, J., Kerswill, P., Fox, S., & Torgersen, E. (2011). Contact, the feature pool and the speech community: The emergence of Multicultural London English. Journal of Sociolinguistics, 15(2), 151–196.
Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., & Danae, P. (1993). Outline of discourse transciption. In J.A. Edwards & M.D. Lampert (Eds.), Talking Data: Transcription and Coding in Discourse Research (pp. 45–89). Hillsdale, NJ: Lawrence Erlbaum.
Edwards, J.A. (1993). Principles and contrasting systems of discourse transcription. In J.A. Edwards & M.D. Lampert (Eds.), Talking Data: Transcription and Coding in Discourse Research (pp. 3–31). Hillsdale, NJ: Lawrence Erlbaum.
French, J.P. (1992). Notes and conventions for soundscript transcribers. Unpublished manuscript.
Gibbon, D., Moore, R., & Winsky, R. (Eds.) (1997). Handbook of Standards and Resources for Spoken Language Systems. Berlin: Mouton de Gruyter.
Jefferson, G. (1983). Issues in the transcription of naturally occurring talk: Caricature versus capturing pronunciational particulars. Tilburg Papers in Language and Literature, 341, 1–12.
Johansson, S. (1995). The approach of the Text Encoding Initiative to the encoding of spoken discourse. In G. Leech, G. Myers & J. Thomas (Eds.), Spoken English on Computer: Transcription, Mark-up and Application (pp. 82–98). Harlow: Longman.
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Nelson, G. (2002). International Corpus of English: Markup Manual for: Spoken Texts. Retrieved from [URL] (last accessed November 2015).
Payne, J. (1995). The COBUILD spoken corpus: Transcription conventions. In G. Leech, G. Myers & J. Thomas (Eds.), Spoken English on Computer: Transcription, Mark-up and Application (pp. 203–207). Harlow: Longman.
Poplack, S. & Tagliamonte, S. (2000). The grammaticization of going to in (African American) English. Language Variation and Change, 11(3), 315–342.
Sachs, H., Schegloff, E.A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language, 50(4), 696–735.
Sinclair, J. (1995). From theory to practice. In G. Leech, G. Myers & J. Thomas (Eds.), Spoken English on Computer: Transcription, Mark-up and Application. (pp. 99–109). Harlow: Longman.
TEI, T.-E. I. TEI P5: Guidelines for Electronic Text Encoding and Interchange.
Thompson, P. (2005). Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 59–70). Oxford: Oxbow Books.
Torgersen, E., Gabrielatos, C., Hoffman, S., & Fox, S. (2011). A corpus-based study of pragmatic markers in London English. Corpus Linguistics and Linguistic Theory, 7(1), 93–118.
van den Heuvel, H., & Boves, L. (2001). Annotation in the SpeechDat projects. International Journal of Speech Technology, 41, 127–143.
Wynne, M. (Ed.). (2005). Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books.
Cited by (4)
Cited by four other publications
Taylor, Roxanne
2022. Lexical and functional adpositions: the view from <em>of</em> in Old and present-day English. Glossa: a journal of general linguistics 7:1
Põldvere, Nele, Johan Frid, Victoria Johansson & Carita Paradis
2021. Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2. Research in Corpus Linguistics 9:1 ► pp. 35 ff.
Pizarro Pedraza, Andrea
2019. MadSex: collecting a spoken corpus of indirectly elicited sexual concepts. Language Resources and Evaluation 53:1 ► pp. 191 ff.
KIRK, JOHN M.
2017. Developments in the spoken component of ICE corpora. World Englishes 36:3 ► pp. 371 ff.
This list is based on CrossRef data as of 5 august 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.