This paper presents practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German, a large collection of spontaneous verbal interaction from diverse discourse domains. After introducing the aims and organisational circumstances of the construction of FOLK, the general idea discussed is that good practices cannot be developed without considering methodological, technological and organisational aspects on equal footing. Starting from this idea, this paper inspects more closely some actual practices in FOLK, namely the handling of legal (especially privacy protection) issues, the decisions taken for the transcription and annotation workflow, and the question of how to best disseminate a corpus like FOLK. The final section sketches some possible future improvements for practices in FOLK.
Baude, O., Blanche-Benveniste, C., Calas, M.-F., Cappeau, P., Corderereix, P., Goury, L., Jacobson, M., de Lambertierie, I., Marchello-Nizia, C., & Mondada, L. (2006). Corpus Oraux: Guide des Bonnes Pratiques. Orléans: Presses Universitaires d’Orléans. Retrieved from [URL] (last accessed October 2014).
Berens, F.-J., Jäger, K.-H., Schank, G., & Schwitalla, J. (1976). Projekt Dialogstrukturen. Ein Arbeitsbericht. Heutiges Deutsch, I(12), 1–147.
Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1,2), 23–60.
Bird, S., & Simons, G. (2002). Seven dimensions of portability for language documentation and description. Language, 79(3), 557–582.
Brinckmann, C., Kleiner, S., Knöbl, R., & Berend, N. (2008). German today: An areally extensive corpus of spoken standard German.
Proceedings 6th International Conference on Language Resources and Evaluation (LREC 2008)
, Marrakesch, Marokko (pp. 3185–3191). Retrieved from [URL] (last accessed November 2015).
Carletta, J., Kilgour, J., O’Donnell, T., Evert, S., & Voorman, H. (2003). The NITE object model library for handling structured linguistic annotation on multimodaldata sets.
Proceedings of the EACL Workshop on Language Technology and the Semantic Web. Budapest
(pp. 17–24). Retrieved from [URL] (last accessed November 2015).
CLARIN (2010). Interoperability and standards. CLARIN deliverable D5.C-3. Retrieved from [URL] (last accessed November 2015).
Deppermann, A., & Hartung, M. (2011). Was gehört in ein nationales Gesprächskorpus? Kriterien, Probleme und Prioritäten der Stratifikation des ‘Forschungs- und Lehrkorpus Gesprochenes Deutsch’ (FOLK) am Institut für Deutsche Sprache (Mannheim). In E. Felder, M. Müller, & F. Vogel, F.. (Eds.), Korpuspragmatik. Thematische Korpora als Basis diskurslinguistischer Analysen (pp. 414–450). Berlin: de Gruyter.
Deppermann, A., & Proske, N. (2015). Grundeinheiten der Sprache und des Sprechens. In C. Dürscheid & J.-G. Schneider (Eds.), Satz, Äußerung, Schema (pp. 17–47). Berlin: de Gruyter,
Goldman, J., Renals, S., Bird, S., de Jong, F., Federico, M., Fleischhauer, C., Kornbluh, M., Lamel, L., Oard, D.W., Stewart, C., & Wright, R. (2005). Accessing the spoken word. International Journal on Digital Libraries, 5(4), 287–298.
Habscheid, S. (2014). Haben sich Sprach- und Literaturwissenschaft heute noch etwas zu sagen? Eine Antwort aus sprachwissenschaftlicher Perspektive – am Beispiel eines gesprächslinguistischen Forschungsprojekts über Pausengespräche im Theater. In H.-R. Fluck & J. Zhu (Eds.), Vielfalt und Interkulturalität der internationalen Germanistik. Festgabe für Siegfried Grosse zum 90. Geburtstag (pp. 73–85). Tübingen: Stauffenburg,.
Hedeland, H., Lehmberg, T., Schmidt, T., & Wörner, K. (2014). Multilingual corpora at the Hamburg Centre for Language Corpora. In S. Ruhi, M. Haugh, T. Schmidt & K. Wörner (Eds.), Best Practices for Spoken Corpora in Linguistic Research (pp. 208–224). Newcastle-upon-Tyne: Cambridge Scholars Press.
Hee, K. (2012). Polizeivernehmungen von Migranten: Eine gesprächsanalytische Studie interkultureller Interaktionen in Institutionen. Heidelberg: Universitätsverlag Winter.
IDS [Institut für Deutsche Sprache] (1975). Gesprochene Sprache. Tübingen: Narr.
Kellner, B., Lehmberg, T., Schröder, I., & Wörner, K. (2008). Data structures for the analysis of regional language variation. In A. Storrer, A. Geyken, A. Siebert & K.-M. Würzner (Eds.), Text Resources and Lexical Knowledge (pp. 53–63). Berlin: Walter de Gruyter.
Kupietz, M., & Schmidt, T. (2015). Schriftliche und mündliche Korpora am IDS als Grundlage für die empirische Forschung. In L.M. Eichinger, (Ed.), Sprachwissenschaft im Fokus: Positionsbestimmungen und Perspektiven (pp. 297–322). Berlin: De Gruyter Mouton.
Kucharczik, K. (no date). Korpus der gesprochenen Sprache im Ruhrgebiet (KgSR). Retrieved from [URL] (last accessed January 2014).
Leech, G., Myers, G., & Thomas, J. (Eds.) (1995). Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum.
Ochs, E. (1979). Transcription as theory. In E. Ochs & B.B. Schieffelin (Eds.) Developmental Pragmatics (pp. 43–72). New York, NY: Academic Press.
Oostdijk, N., & Broeder, D. (2003). The Spoken Dutch Corpus and its exploitation environment. In A. Abeille, S. Hansen-Schirra & H. Uszkoreit (Eds.) Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). 14 April, 2003. Budapest, Hungary (pp. 93–101).
Parisse, C., & Morgenstern, A. (2010). A multi-software integration platform and support for multimedia transcripts of language. In M. Kipp, J.C. Martin, P. Paggio & D. Heylen (Eds.), Proceedings of the LREC 2010 Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, (pp. 106–110). Retrieved from [URL] (last accessed November 2015).
Rehbein, J., Grießhaber, W., Löning, P., Hartung, M., & Bührig, K. (1993). Manual für das computergestützte Transkribieren mit dem Programm syncWRITER nach dem Verfahren der Halbinterpretativen Arbeitstranskriptionen (HIAT). Hamburg: Universität Hamburg.
Rehbein, J., Schmidt, T., Meyer, B., Watzke, F., & Herkenrath, A. (2004) Handbuch für das computergestützte Transkribieren nach HIAT. Retrieved from [URL] (last accessed November 2015).
Rohlfing, K., Loehr, D., Duncan, S., Brown, A., Franklin, A., Kimbara, I., Milde, J.-T., Parrill, F., Rose, T., Schmidt, T., Sloetjes, H., & Thies, A. (2006). Comparison of multimodal annotation tools: Workshop report. In Gesprächsforschung: Online-Zeitschrift zur verbalen Interaktion 71, 99–123.
Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German.
Proceedings of the ACL SIGDAT-Workshop
. Dublin, Ireland. Retrieved from [URL] (last accessed November 2015).
Schmidt, T., & Schütte, W. (2010). FOLKER: An annotation tool for efficient transcription of natural, multi-party interaction. In
Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC10)
, Valletta, Malta (pp. 2091–2096). Retrieved from [URL] (last accessed November 2015).
Schmidt, T. (2011). A TEI-based approach to standardising spoken language transcription. Journal of the Text Encoding Initiative 11. Retrieved from [URL] (last accessed November 2015).
Schmidt, T. (2012). EXMARaLDA and the FOLK tools. In
Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC’10)
, Istanbul, Turkey: European Language Resources Association (ELRA), (pp. 236–240). Retrieved from [URL] (last accessed November 2015).
Schmidt, T. (2014). The Database for Spoken German – DGD2. In
Proceedings of the Ninth conference on International Language Resources and Evaluation (LREC’14)
, Reykjavik, Iceland: European Language Resources Association (ELRA) (pp. 1451–1457). Retrieved from [URL] (last accessed November 2015).
Schmidt, T., Dickgießer S., & Gasch, J. (2013). Die Datenbank für Gesprochenes Deutsch (DGD2). Mannheim: Institut für Deutsche Sprache. Retrieved from [URL] (last accessed November 2015).
Schmidt, T., & Wörner, K. (2014). EXMARaLDA. In J. Durand, U. Gut & G. Kristoffersen (Eds.), The Oxford Handbook of Corpus Phonology (pp. 402–419.). Oxford: Oxford University Press.
Selting, M., Auer, P., Barden, B.Bergmann, J., Couper-Kuhlen, E., Günthner, S., Meier, C., Quasthoff, U., Schlobinski, P., & Uhmann, S. (1998). Gesprächsanalytisches Transkriptionssystem (GAT). Linguistische Berichte, 1731, 91–122.
Selting, M., Auer, P., Barth-Weingarten, D., Bergmann, J., Bergmann P., Birkner, K., Couper-Kuhlen, E., Deppermann, A., Gilles, P., Günthner, S., & Hartung, M. (2009). Gesprächsanalytisches Transkriptionssystem 2 (GAT 2). In Gesprächsforschung: Online-Zeitschrift zur verbalen Interaktion,101, 353–402.
Stift, U.-M., & Schmidt, T. (2014). Mündliche Korpora am IDS: Vom Deutschen Spracharchiv zur Datenbank für Gesprochenes Deutsch. In Ansichten und Einsichten. 50 Jahre Institut für Deutsche Sprache (pp. 360–375). Mannheim: Institut für Deutsche Sprache (IDS).
Thompson, P. (2005). Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 59–70). Oxford: Oxbow Books. Retrieved from [URL] (last accessed November 2015).
Westpfahl, S., & Schmidt, T. (2013). POS für(s) FOLK: Part of Speech Tagging des Forschungs- und Lehrkorpus Gesprochenes Deutsch. Journal for Language Technology and Computational Linguistics, 28(1), 139–156.
Wiese, H., Freywald, U., Schalowski, S., & Mayr, K. (2012). Das KiezDeutsch- Korpus. Spontansprachliche Daten Jugendlicher aus urbanen Wohngebieten. Deutsche Sprache 401, 97–123.
Cited by (21)
Cited by 21 other publications
Deppermann, Arnulf, Alexandra Gubina, Katharina König & Martin Pfeiffer
2024. Request for confirmation sequences in German. Open Linguistics 10:1
Gubina, Alexandra & Arnulf Deppermann
2024. Rejecting the validity of inferred attributions of incompetence in German talk-in-interaction. Journal of Pragmatics 221 ► pp. 150 ff.
Hashimoto, Brett & Kyra Nelson
2024. Recent trends in corpus design and reporting: A methodological synthesis. Research in Corpus Linguistics 12:1 ► pp. 59 ff.
Yu, Guodong, Yaxin Wu, Paul Drew & Chase Wesley Raymond
2022. Gesprochene Lernerkorpora: Methodisch-technische Aspekte der Erhebung, Erschließung und Nutzung. Zeitschrift für germanistische Linguistik 50:1 ► pp. 36 ff.
Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina & Tony McEnery
2022. The Spoken BNC2014. International Journal of Corpus Linguistics► pp. 319 ff.
Stratton, James M.
2022. Tapping into German Adjective Variation: A Variationist Sociolinguistic Approach. Journal of Germanic Linguistics 34:1 ► pp. 63 ff.
2021.
What Do Newsmark-Type Responses Invite? The Response Space After German
echt
. Research on Language and Social Interaction 54:4 ► pp. 374 ff.
Knight, Dawn, Steve Morris, Laura Arman, Jennifer Needs & Mair Rees
2021. Processing and (Re)presenting Corpora. In Building a National Corpus, ► pp. 105 ff.
Põldvere, Nele, Johan Frid, Victoria Johansson & Carita Paradis
2021. Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2. Research in Corpus Linguistics 9:1 ► pp. 35 ff.
PÕLDVERE, NELE, VICTORIA JOHANSSON & CARITA PARADIS
2021. OnThe London–Lund Corpus 2: design, challenges and innovations. English Language and Linguistics 25:3 ► pp. 459 ff.
Saccone, Valentina & Chiara Trombetta
2021. Parenthetical Units and Structures in Italian and German spoken language: Prosodic and textual analysis. CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos 8 ► pp. 1 ff.
Chen, Yu-Hua & Radovan Bruncak
2020. Transcribear – Introducing a secure online transcription and annotation tool. Digital Scholarship in the Humanities 35:2 ► pp. 265 ff.
Ghyselen, Anne-Sophie, Anne Breitbarth, Melissa Farasyn, Jacques Van Keymeulen & Arjan van Hessen
2020. Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study. Frontiers in Artificial Intelligence 3
Deppermann, Arnulf & Elwys De Stefani
2019. Defining in talk-in-interaction: Recipient-design through negative definitional components. Journal of Pragmatics 140 ► pp. 140 ff.
Batinić, Dolores & Thomas Schmidt
2018. Reconstruction of Separable Particle Verbs in a Corpus of Spoken German. In Language Technologies for the Challenges of the Digital Age [Lecture Notes in Computer Science, 10713], ► pp. 3 ff.
Meliss, Meike, Christine Möhrs & Maria Ribeiro Silveira
2018. Erwartungen an eine korpusbasierte lexikografische Ressource zur ‚Lexik des gesprochenen Deutsch in der Interaktion‘: Ergebnisse aus zwei empirischen Studien. Zeitschrift für Angewandte Linguistik 2018:68 ► pp. 103 ff.
Meliss, Meike, Christine Möhrs & Maria Ribeiro Silveira
2019. Anforderungen und Erwartungen an eine lexikografische Ressource des gesprochenen Deutsch aus der L2-Lernerperspektive. Lexicographica 34:2018 ► pp. 89 ff.
This list is based on CrossRef data as of 19 november 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.