Flexible multi-layer spoken dialogue corpora

Sauer, Simon; Lüdeling, Anke

doi:10.1075/ijcl.21.3.06sau

Article published In:

Compilation, transcription, markup and annotation of spoken corpora
Edited by John M. Kirk and Gisle Andersen
[International Journal of Corpus Linguistics 21:3] 2016
► pp. 419–438

Flexible multi-layer spoken dialogue corpora

Simon Sauer | Humboldt-Universität zu Berlin,

Anke Lüdeling

This paper describes the construction of deeply annotated spoken dialogue corpora. To ensure a maximum of flexibility — in the degree of normalization, the types and formats of annotations, the possibilities for modifying and extending the corpus, or the use for research questions not originally anticipated — we propose a flexible multi-layer standoff architecture. We also take a closer look at the interoperability of tools and formats compatible with such an architecture. Free access to the corpus data through corpus queries, visualizations, and downloads — including documentation, metadata, and the original recordings — enables transparency, verifiability, and reproducibility of every step of interpretation throughout corpus construction and of any research findings obtained from this data.

Keywords: spoken corpora, annotation tools, annotation, standoff, multi-layer architecture

Published online: 29 September 2016

https://doi.org/10.1075/ijcl.21.3.06sau

References (50)

Anderson, A.H., Bader, M., Gurman Bard, E., Boyle, E., Doherty, G., Garrod, S., Isard, S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thompson, H.S., & Weinert, R. (1991). The HCRC Map Task Corpus. Language and Speech, 34(4), 351–366.

Belz, M. (2013). Disfluencies und Reparaturen bei Muttersprachlern und Lernern: Eine kontrastive Analyse. Humboldt-Universität zu Berlin. Retrieved from [URL] (last accessed March 2014).

BeMaTaC. (2014). BeMaTaC: A Deeply Annotated Multimodal Map-task Corpus of Spoken Learner and Native German. Retrieved from [URL] (last accessed March 2014).

Boersma, P. (2010). Praat: A system for doing phonetics by computer. Glot International, 5(9/10), 341–345.

Brinckmann, C., Kleiner, S., Knöbl, R., & Berend, N. (2008). German today: An areally extensive corpus of spoken Standard German. In N. Calzolari, Kh. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp. 3185–3191). Paris: ELRA.

Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In L. Màrquez & D. Klein (Eds.), Proceedings of the 10th Conference on Computational Natural Language Learning (pp. 149–164). Stroudsburg, PA: Association for Computational Linguistics.

Burnard, L. (Ed.). (2007). Reference Guide for the British National Corpus (XML Edition). Oxford: Research Technologies Service. Retrieved from [URL] (last accessed March 2014).

Carletta J., Evert, S., Heid, U., Kilgour, J., Robertson, J., & Voormann, H. (2003). The NITE XML Toolkit: Flexible annotation for multi-modal language data. Behavior Research Methods, Instruments, & Computers, 35(3), 353–363.

Carletta J., Evert, S., Heid, U., & Kilgour, J. (2005). The NITE XML Toolkit: Data model and query. Language Resources and Evaluation, 39(4), 313–334.

Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J., & Stede, M. (2009). A flexible framework for integrating annotations from different tools and tagsets. Traitement Automatique des Langues, 49(2), 271–291.

Creative Commons. (2014). About the Licenses - Creative Commons. Retrieved from [URL] (last accessed March 2014).

Dipper, S. (2005). XML-based stand-off representation and exploitation of multi-level linguistic annotation. In R. Eckstein & R. Tolksdorf (Eds.), Proceedings of Berliner XML Tage 2005 (pp. 39–50). Berlin: Humboldt-Universität zu Berlin.

Dipper, S., Lüdeling, A., & Reznicek, M. (2013). NoSta-D: A corpus of German non-standard varieties. In M. Zampieri & S. Diwersy (Eds.), Non-Standard Data Sources in Corpus-Based Research (pp. 69–76). Aachen: Shaker.

Druskat, S., Bierkandt, L., Gast, V., Rzymski, C., & Zipser, F. (2014). Atomic: An open-source software platform for multi-level corpus annotation. In J. Ruppenhofer & G. Faaß (Eds.), Proceedings of the 12th Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2014) (pp. 228–234). Retrieved from [URL] (last accessed May 2015).

Gerdes, K. (2014). Arborator [Computer software]. Retrieved from [URL] (last accessed March 2014).

Giesel, L., Klapi, M., Krüger, D., Nunberger, I., Rasskazova, O., & Sauer, S. (2013) Berlin Map Task Corpus: A deeply annotated multimodal map-task corpus of spoken learner and native German. Poster presented at the 35. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft , Potsdam, Germany. Retrieved from [URL] (last accessed March 2014).

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I.H. (2009). The WEKA data mining software: An update. In O.R. Zaiane (Ed.), SIGKDD Explorations, 11(1), 10–18.

Hanke, T., & Storz, J. (2008). iLex: A database tool for integrating sign language corpus linguistics and sign language lexicography. In O. Crasborn, E. Efthimiou, T. Hanke, E. Thoutenhoofd & I. Zwitserlood (Eds.), LREC 2008 Workshop, Proceedings, W 25: 3rd Workshop on the Representation and Processing of Sign Languages: Construction and Exploitation of Sign Language Corpora (pp. 64–67). Paris: ELRA.

Himmelmann, N.P. (2012). Linguistic data types and the interface between language documentation and description. Language Documentation & Conservation, 61, 187–207.

Hinrichs, E.W., Hinrichs, M., & Zastrow, T. (2010). WebLicht: Web-Based LRT services for German. In ACL 2010 System Demonstrations, Proceeding (pp. 25–29). Stroudsburg, PA: Association for Computational Linguistics.

Ide, N., & Suderman, K. (2007). GrAF: A graph-based format for linguistic annotations. In B. Boguraev, N. Ide, A. Meyers, Sh. Nariyama, M. Stede, J. Wiebe & G. Wilcock (Eds.), ACL 2007 Workshop, Proceedings, Linguistic Annotation Workshop (pp. 25–29). Stroudsburg, PA: Association for Computational Linguistics.

Kirk, J.M. (this volume). The pragmatic annotation scheme of the SPICE-Ireland corpus.

Krause, T., Lüdeling, A., Odebrecht, C., & Zeldes, A. (2012). Multiple tokenization in a diachronic corpus. Paper presented at Exploring Ancient Languages through Corpora Conference 2012 , Oslo. Retrieved from [URL] (last accessed March 2014).

Krause, T., & Zeldes, A. (2014). ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities. Retrieved from [URL] (last accessed May 2015).

Lüdeling, A. (2011). Corpora in linguistics: Sampling and annotation. In K. Grandin (Ed.), Going Digital. Evolutionary and Revolutionary Aspects of Digitization (pp. 220–243). New York, NY: Science History Publications.

Max Planck Society. (2014). Max Planck Open Access: Berlin Declaration. Retrieved from [URL] (last accessed March 2014).

Müller, C., & Strube, M. (2006). Multi-level annotation of linguistic data with MMAX2. In S. Braun, K. Kohn & J. Mukherjee (Eds.), Corpus Technology and Language Pedagogy (pp. 197–214). Frankfurt am Main: Peter Lang,

Nivre, J. (2008). Treebanks. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 225–241). Berlin: Mouton de Gruyter.

Pajas P., & Stepanek, J. (2008). Recent advances in a feature-rich framework for treebank annotation. In Proceedings of the 22nd International Conference on Computational Linguistics (pp. 673–680). Stroudsburg, PA: Association for Computational Linguistics.

R Core Team. (2013). R: A Language and Environment for Statistical Computing [Computer software]. Retrieved from [URL] (last accessed March 2014).

Sauer, S., & Rasskazova, O. (2014). BeMaTaC: Eine digitale multimodale Ressource für Sprach- und Dialogforschung. Poster presented at the workshop Grenzen überschreiten – Digitale Geisteswissenschaft heute und morgen , Berlin, Germany. Retrieved from [URL] (last accessed March 2014).

Schiel, F., Draxler, C., & Harrington, J. (2011). Phonemic segmentation and labelling using the MAUS technique. Workshop New Tools and Methods for Very-Large-Scale Phonetics Research . Retrieved from [URL] (last accessed April 2016).

Schiller, A., Teufel, S., Stöckert, C., & Thielen, C. (1999). Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Retrieved from [URL] (last accessed March 2014).

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing . Retrieved from [URL] (last accessed November 2014).

. 2008. Tokenizing and part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 527–551). Berlin: Mouton de Gruyter.

Schmidt, T. (2004). Transcribing and annotating spoken language with EXMARaLDA. In A. Witt, U. Heid, H.S. Thompson, J. Carletta & P. Wittenburg (Eds.), LREC 2004 Workshop, Proceedings, XML-based Richly Annotated Corpora (pp. 69–74). Paris: ELRA.

Schmidt, T., & Wörner, K. (2009.) EXMARaLDA: Creating, analysing and sharing spoken language corpora for pragmatic research. Pragmatics, 19(4), 565–582.

Schmidt, T., Hedeland, H., Lehmberg, T., & Wörner, K. (2010). HAMATAC: The Hamburg MapTask Corpus. Retrieved from [URL] (last accessed March 2014).

Sloetjes, H., & Wittenburg, P. (2008). Annotation by category: ELAN and ISO DCR. In N. Calzolari, Kh. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp. 816–820). Paris: ELRA.

Stede, M. (2011). Discourse Processing. San Rafael, CA: Morgan & Claypool.

Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. 2012. Brat: A web-based tool for NLP-assisted text annotation. In F. Segond (Ed.), Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 102–107). Stroudsburg, PA: Association for Computational Linguistics.

Stührenberg, M. (2012). The TEI and current standards for structuring linguistic data. In P. Bański, E. Litta Modignani Picozzi & A. Witt (Eds.), Journal of the Text Encoding Initiative, 31. Retrieved from [URL] (last accessed March 2014).

TEI Consortium. (2014). TEI: Text Encoding Initiative. Retrieved from [URL] (last accessed March 2014).

Thompson, P. (2005). Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 59–70). Oxford: Oxbow Books. Retrieved from [URL] (last accessed March 2014).

Wichmann, A. (2008). Speech corpora and spoken corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 187–207). Berlin: Mouton de Gruyter.

Wörner, K. (2009). Werkzeuge zur flachen Annotation von Transkriptionen gesprochener Sprache. Bielefeld: Bielefeld University. Retrieved from [URL] (last accessed April 2016).

Wynne, M. (2008). Searching and concordancing. In A. Lüdeling, & M. Kytö. (Eds.), Corpus Linguistics. An International Handbook (pp. 706–737). Berlin: Mouton de Gruyter.

Yimam, S.M., Gurevych, I., Eckart de Castilho, R., & Biemann, C. (2013). WebAnno: A flexible, web-based and visually supported system for distributed annotations. In M. Butt & S. Hussain (Eds.), 51st Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference System Demonstration (pp. 1–6). Stroudsburg, PA: Association for Computational Linguistics.

Zeldes, A., Ritz, J., Lüdeling, A., & Chiarcos, C. (2009). ANNIS: A search tool for multi-layer annotated corpora. In M. Mahlberg, V. González-Díaz & C. Smith (Eds.), Proceedings of Corpus Linguistics 2009. Retrieved from [URL] (last accessed March 2014).

Zipser, F., & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using standards. In G. Budin, L. Romary, T. Declerck & P. Wittenburg (Eds.), LREC 2010 Workshop, Proceedings, W4: Language Resource and Language Technology Standards. Paris: ELRA. Retrieved from [URL] (last accessed November 2014).

Cited by (9)

Cited by nine other publications

Order by:

Lemmenmeier-Batinić, Dolores, Josip Batinić & Anastasia Escher

2023. Map Task Corpus of Heritage BCMS spoken by second-generation speakers in Switzerland. Language Resources and Evaluation 57:4 ► pp. 1607 ff.

Hirschmann, Hagen & Thomas Schmidt

2022. Gesprochene Lernerkorpora: Methodisch-technische Aspekte der Erhebung, Erschließung und Nutzung. Zeitschrift für germanistische Linguistik 50:1 ► pp. 36 ff.

Wisniewski, Katrin

2022. Gesprochene Lernerkorpora des Deutschen: Eine Bestandsaufnahme. Zeitschrift für germanistische Linguistik 50:1 ► pp. 1 ff.

Põldvere, Nele, Johan Frid, Victoria Johansson & Carita Paradis

2021. Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2. Research in Corpus Linguistics 9:1 ► pp. 35 ff.

Weise, Andreas, Vered Silber-Varod, Anat Lerner, Julia Hirschberg & Rivka Levitan

2020. Entrainment in spoken Hebrew dialogues. Journal of Phonetics 83 ► pp. 101005 ff.

Zeldes, Amir

2020. Corpus Architecture. In A Practical Handbook of Corpus Linguistics, ► pp. 49 ff.

Belz, Malte, Simon Sauer, Anke Lüdeling & Christine Mooshammer

2017. Fluently disfluent?. International Journal of Learner Corpus Research 3:2 ► pp. 118 ff.

KIRK, JOHN M.

2017. Developments in the spoken component of ICE corpora. World Englishes 36:3 ► pp. 371 ff.

Diemer, Stefan, Marie-Louise Brunner & Selina Schmidt

2016. Compiling computer-mediated spoken language corpora. International Journal of Corpus Linguistics 21:3 ► pp. 348 ff.

This list is based on CrossRef data as of 5 august 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.