Multi-word discourse markers and their corpus-driven identification: The case of MWDM extraction from the reference corpus of spoken Slovene

Dobrovoljc, Kaja

doi:10.1075/ijcl.16127.dob

Article published In:

International Journal of Corpus Linguistics
Vol. 22:4 (2017) ► pp.551–582

Multi-word discourse markers and their corpus-driven identification

The case of MWDM extraction from the reference corpus of spoken Slovene

Kaja Dobrovoljc | University of Ljubljana | Trojina, Institute for Applied Slovene Studies

With expanding evidence on the formulaic nature of human communication, there is a growing need to extend discourse marker research to functionally analogue multi-word expressions. In contrast to the common qualitative approaches to discourse marker identification in corpora, this paper presents a corpus-driven semi-automatic approach to identification of multi-word discourse markers (MWDMs) in the reference corpus of spoken Slovene. Using eight statistical measures, we identified 173 structurally fixed discourse-marking MWEs, distinguished by a high number of tokens, a large proportion of grammatical words and semantic heterogeneity. This is a significantly longer list than would have been gained by manual inspection of smaller corpus samples. Although frequency-based methods produced satisfactory results, best precision in MWDM identification was achieved using the t-score association measure, while the overall poor performance of the mutual information suggests its inadequacy for extraction of MWDMs and other MWEs with similar lexical and distributional features.

Keywords: discourse markers, multi-word units, collocation extraction, association measures, spoken corpora

Article outline

1.Introduction
2.Multi-word discourse markers
- 2.1Related research on discourse-marking multi-word expressions
- 2.2Multi-word discourse markers in this study
3.Statistical methods for MWE identification in corpora
4.Aims, data and methodology
- 4.1The GOS corpus
- 4.2N-gram extraction
- 4.3N-gram ranking
  - 4.3.1Selected frequency-based measures
  - 4.3.2Selected association-based measures
  - 4.3.3Comparability of selected measures
- 4.4MWDM identification
- 4.5MWDM classification
5.Results
- 5.1Features of identified MWDMs
- 5.2Comparison of statistical methods
- 5.3Comparison of statistical and manual methods
6.Discussion and conclusions
Acknowledgements
Notes
References

Published online: 1 December 2017

https://doi.org/10.1075/ijcl.16127.dob

References (84)

Adolphs, S., & Carter, R.

(2013) Spoken Corpus Linguistics: From Monomodal to Multimodal. London/New York: Routledge.

Aijmer, K.

(1996) Conversational Routines in English: Convention and Creativity. London/New York: Addison Wesley Longman.

(2002) English Discourse Particles. Amsterdam/Philadelphia: John Benjamins Publishing Company.

Alonso, L., Castellón, I., & Padró, L.

(2002) X-TRACTOR: A tool for extracting discourse markers. In A. Lenci, S. Montemagni & V. Pirelli (Eds.), Proceedings of the LREC 2002 Workshop on Linguistic Knowledge Acquisition and Representation: Bootrstrapping Annotated Language Data (pp. 100–105). Paris: ELRA.

Balažic Bulc, T.

(2009) Torej, namreč, zato … o konektorjih: Raba in funkcija konektorjev v slovenskem in hrvaškem jezikoslovnem diskurzu. Ljubljana: Filozofska fakulteta.

Biber, D.

(2009) A corpus-driven approach to formulaic language in English: Multi-word patterns in speech and writing. International Journal of Corpus Linguistics, 14(3), 275–311.

Biber, D., Conrad, S., & Cortes, V.

(2004) If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.

Biber, D., Johansson, S., Leech, G., & Conrad, S.

(1999) Longman Grammar of Spoken and Written English. Harlow: Pearson Education.

Blakemore, D.

(2006) Divisions of labour: The analysis of parentheticals. Lingua, 116(10), 1670–1687.

Bolly, C., Crible, L., Degand, L., & Uygur, D.

forthcoming). Towards a model for discourse marker annotation in spoken French: From potential to feature-based discourse markers. In C. Fedriani & A. Sanso Eds. Pragmatic Markers, Discourse Markers and Modal Particles: New Perspectives pp. 71 98 Amsterdam/Philadelphia John Benjamins

Brinton, L. J. (2008) The Comment Clause in English: Syntactic Origis and Pragmatic Development. Cambridge: Cambridge University Press.

Brinton, L. J., & Traugott, E. C.

(2005) Lexicalization and Language Change. Cambridge: Cambridge University Press.

Bybee, J.

(2010) Language, Usage and Cognition. Cambridge: Cambridge University Press.

Church, K. W., & Hanks, P.

(1990) Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.

Conklin, K., & Schmitt, N.

(2007) Formulaic sequences: Are they processed more quickly than nonformulaic language by native and nonnative speakers? Applied Linguistics, 29(1), 72–89.

Crible, L.

(forthcoming) Towards an operational category of discourse markers: A definition and its model. In C. Fedriani & A. Sanso (Eds.), Discourse markers, Pragmatics Markers and Modal Particles: New Perspectives. Amsterdam/Philadelphia: John Benjamins.

Csomay, E.

(2013) Lexical bundles in discourse structure: A corpus-based study of classroom discourse. Applied Linguistics, 34(3), 369–388.

da Silva, J. F., & Lopes, G. P.

(1999) A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. In J. Rogers (Ed.), Proceedings of the 6th Meeting on the Mathematics of Language (pp. 369–381). Orlando, FL: University of Central Florida.

Degand, L., Cornillie, B., & Pietrandrea, P.

(Eds.) (2013) Discourse Markers and Modal Particles: Categorization and Description. Amsterdam/Philadelphia: John Benjamins.

Degand, L., & Evers-Vermeul, J.

(2015) Grammaticalization or pragmaticalization of discourse markers?: More than a terminological issue. Journal of Historical Pragmatics, 16(1), 59–85.

Dehé, N., & Kavalova, Y.

(Eds.) (2007) Parentheticals. Amsterdam/Philadelphia: John Benjamins.

Dér, C.

(2010) On the status of discourse markers. Acta Linguistica Hungarica, 57(1), 3–28.

Dice, L. R.

(1945) Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302.

Dobrovoljc, K.

forthcoming). Lexical features of spoken language in user-generated content: The case of multi-word discourse markers (Doctoral dissertation). Faculty of Arts, University of Ljubljana, Slovenia.

Dobrovoljc, K., & Nivre, J.

(2016) The Universal Dependencies treebank of spoken Slovenian. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (pp. 1566–1573). Paris: ELRA.

Dunning, T.

(1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

Erman, B., & Warren, B.

(2000) The idiom principle and the open choice principle. Text – Interdisciplinary Journal for the Study of Discourse, 20(1), 29–62.

Evert, S.

(2009) Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp. 1212–1248). Berlin/New York: Mouton de Gruyter.

Fischer, K.

(Ed.) (2006a) Approaches to Discourse Particles. Oxford: Elsevier.

Fischer, K. (2006b) Towards an understanding of the spectrum of approaches to discourse particles: Introduction to the volume. In K. Fischer (Ed.), Approaches to Discourse Particles (pp. 1–20). Oxford: Elsevier.

(2014) Discourse markers. In K. P. Schneider & A. Barron (Eds.), Pragmatics of Discourse (pp. 271–294). Berlin: Mouton De Gruyter.

Fox Tree, J. E., & Schrock, J. C.

(1999) Discourse markers in spontaneous speech: Oh what a difference an oh makes. Journal of Memory and Language, 40(2), 280–295.

Fraser, B.

(2013) Combinations of contrastive discourse markers in English. International Review of Pragmatics, 5(2), 318–340.

Gantar, P., Kosem, I., & Krek, S.

(2016) Discovering automated lexicography: The case of the Slovene Lexical Database. International Journal of Lexicography, 29(2), 200–225.

Gries, S. Th.

(2012) Frequencies, probabilities, and association measures in usage-/exemplar-based linguistics: Some necessary clarification. Studies in Language, 11(3), 477–510.

(2013) 50-something years of work on collocations: What is or should be next … International Journal of Corpus Linguistics, 18(1), 137–166.

Hansen, M. -B.  M.

(1998) The semantic status of discourse markers. Lingua, 1041, 235–260.

(2006) A dynamic polysemy approach to the lexical semantics of discourse markers (with an exemplary analysis of French toujours). In K. Fischer (Ed.), Approaches to Discourse Particles (pp. 21–41). Oxford: Elsevier.

Heine, B.

(2013) On discourse markers: Grammaticalization, pragmaticalization, or something else? Linguistics, 51(6), 1205–1247.

Jucker, A. H., & Ziv, Y.

(Eds.) (1998) Discourse Markers. Amsterdam/Philadelphia: John Benjamins.

Kilgarriff, A., Rychly, P., Kovar, V., & Baisa, V.

(2012) Finding multiwords of more than two words. In R. V. Fjeld & J. M. Torjusen (Eds.), Proceedings of the 15th EURALEX International Congress (pp. 693–700). Oslo: Department of Linguistics and Scandinavian Studies, University of Oslo.

Koops, C., & Lohmann, A.

(2015) A quantitative approach to the grammaticalization of discourse markers: Evidence from their sequencing behavior. International Journal of Corpus Linguistics, 20(2), 232–259.

Krek, S.

(2012) The Slovene Language in the Digital Age. Berlin/Heidelberg: Springer.

Lapshinova-Koltunski, E., & Kunz, K.

(2014) Conjunctions across languages, registers and modes: Semi-automatic extraction and annotation. In A. Diaz Negrillo & F. J. Daz Prez (Eds.), Specialisation and Variation Language Corpora (pp. 77–104). Bern: Peter Lang.

Lin, P. M. S.

(2013) The prosody of formulaic expression in the IBM/Lancaster Spoken English Corpus. International Journal of Corpus Linguistics, 18(4), 561–588.

Ljubešić, N., Dobrovoljc, K., & Fišer, D.

(2015) MWELex – MWE lexica of Croatian, Slovene and Serbian extracted from parsed corpora. Informatica, 39(3), 293–300.

Logar, N., Gantar, P., & Kosem, I.

(2014) Collocations and examples of use: A lexical-semantic approach to terminology. Slovenščina 2.0, 2(1), 41–61.

Louwerse, M. M., & Mitchell, H. H.

(2003) Toward a taxonomy of a set of discourse markers in dialog: A theoretical and computational account. Discourse Processes, 351, 199–239.

Manning, C., & Schütze, H.

(1999) Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.

Maschler, Y., & Schiffrin, D. (2015) Discourse markers: Language, meaning, and context. In D. Tanen, H. E. Hamilton & D. Schiffrin (Eds.), The Handbook of Discourse Analysis (pp. 189–221). Hoboken, NJ: John Wiley & Sons.

McCarthy, M., & Carter, R.

(2006) This, that and the other: Multi-word clusters in spoken English as visible patterns of interaction. In M. McCarthy (Ed.), Explorations in Corpus Linguistics (pp. 7–26). Cambridge: Cambridge University Press.

Nesi, H., & Basturkmen, H.

(2006) Lexical bundles and discourse signalling in academic lectures. International Journal of Corpus Linguistics, 11(3), 283–304.

O’Donnell, M. B.

(2010) The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal, 351, 135–169.

Oakes, M. P.

(1998) Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.

Overstreet, M.

(2000) Whales, Candlelight, and Stuff Like That: General Extenders in English Discourse. Oxford/New York: Oxford University Press

Pecina, P.

(2010) Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1–2), 137–158.

Prasad, R., & Bunt, H.

(2015) Semantic relations in discourse: The current state of ISO 24617–8. In H. Bunt (Ed.), Proceedings of the 11th Joint ACL-ISO Workshop on Interoperable Semantic Annotation (pp. 80–92). London: Queen Mary University of London.

Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B.

(2008) The Penn Discourse TreeBank 2.0. In N. Calozolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp. 2961–2968). Paris: ELRA.

Prasad, R., Joshi, A., & Webber, B.

(2010) Realization of discourse relations by other means: Alternative lexicalizations. In C. -R. Huang & D. Jurafsky (Eds.), Proceedings of the 23rd International Conference on Computational Linguistics (pp. 1023–1031). Beijing: Chinese Information Processsing Society of China.

Redeker, G.

(2000) Coherence and structure in text and discourse. In H. V. Bunt & W. J. Black (Eds.), Abduction, Belief and Context in Dialogue: Studies in Computational Pragmatics (pp. 233–263). Amsterdam/Philadelphia: John Benjamins.

Roze, C., Danlos, L., & Muller, P.

(2012) LEXCONN: A French lexicon of discourse connectives. Discours, 101. [URL] doi: 

Rychlý, P.

(2007) Manatee/Bonito – A Modular Corpus Manager. In P. Sojk & A. Horák (Eds.), First Workshop on Recent Advances in Slavonic Natural Language Processing (pp. 65–70). Brno: Masaryk University.

Rysová, M., & Rysová, K.

(2015) Secondary connectives in the Prague Dependency Treebank. In J. Nivre & E. Hajičova (Eds.), Proceedings of the Third International Conference on Dependency Linguistics (pp. 291–299). Uppsala: Uppsala University.

Schiffrin, D.

(1987) Discourse Markers. Cambridge: Cambridge University Press.

Schnur, E.

(2014) Phraseological signaling of discourse organization in academic lectures: A comparison of lexical bundles in authentic lectures and EAP listening materials. Yearbook of Phraseology, 5(1), 95–122.

Schourup, L.

(1999) Discourse markers. Lingua, 3(4), 227–265.

Siepmann, D.

(2005) Discourse Markers Across Languages: A Contrastive Study of Second-level Discourse Markers in Native and Non-native Text with Implications for General and Pedagogic Lexicography. London/New York: Routledge

Simpson-Vlach, R., & Ellis, N. C. (2010) An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512.

Stede, M.

(2002) DiMLex: A lexical approach to discourse markers. In A. Lenci & V. Di Tomaso (Eds.), Exploring the Lexicon: Theory and Computation (pp. 151–177). Alessandria: Edizioni dell’Orso.

(2011) Discourse Processing. San Rafael, CA: Morgan & Claypool.

Taboada, M.

(2006) Discourse markers as signals (or not) of rhetorical relations. Journal of Pragmatics, 38(4), 567–592.

Tadić, M., & Šojat, K.

(2003) Finding multiword term candidates in Croatian. In H. Cunningham, E. Paskaleva, K. Bontcheva & G. Angelova (Eds.), Proceedings of the International Workshop on Information Extraction for Slavonic and Other Central and Eastern European Languages (pp. 102–107). Sofia: BAS.

van Dijk, T. A.

(Ed.) (1997) Discourse as Structure and Process. London: SAGE.

Verdonik, D.

(2008) Označevanje vrste diskurznih označevalcev. In T. Erjavec & J. Žganec Gros (Eds.), Proceedings of the Sixth Language Technologies Conference (pp. 25–28). Ljubljana: Institut “Jožef Stefan”.

(2014) Vprašanja zapisovanja govora v govornem korpusu Gos. In T. Erjavec & J. Žganec Gros (Eds.), Proceedings of the Ninth Language Technologies Conference (pp. 151–156). Ljubljana: Institut “Jožef Stefan”.

(2015) Internal variety in the use of Slovene general extenders in different spoken discourse settings. International Journal of Corpus Linguistics, 20(4), 445–468.

Verdonik, D., Kosem, I., Vitez, A. Z., Krek, S., & Stabej, M.

(2013) Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language Resources and Evaluation, 47(4), 1031–1048.

Verdonik, D., Rojc, M., & Stabej, M.

(2007) Annotating discourse markers in spontaneous speech corpora on an example for the Slovenian language. Language Resources and Evaluation, 41(2), 147–180.

Wei, N., & Li, J.

(2013) A new computing method for extracting contiguous phraseological sequences from academic text corpora. International Journal of Corpus Linguistics, 18(4), 506–535.

Wiechmann, D.

(2008) On the computation of construction strength: Testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4(2), 253–290.

Wray, A.

(2005) Formulaic Language and the Lexicon. Cambridge: Cambridge University Press.

(2013) Formulaic language. Language Teaching, 46(3), 316–334.

Zufferey, S., & Degand, L.

(2013) Annotating the meaning of discourse connectives in multilingual corpora. Corpus Linguistics and Linguistic Theory, 101, 1–18.

Zwitter Vitez, A., Zemljarič Miklavčič, J., Krek, S., Stabej, M., & Erjavec, T.

(2013) Spoken corpus Gos 1.0. Retrieved from: [URL]

Cited by (2)

Cited by 2 other publications

Mlakar, Izidor, Matej Rojc, Simona Majhenič & Darinka Verdonik

2021. Discourse markers in relation tonon-verbal behavior. Gesture 20:1 ► pp. 103 ff.

Dobrovoljc, Kaja

2020. Identifying dictionary-relevant formulaic sequences in written and spoken corpora. International Journal of Lexicography 33:4 ► pp. 417 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.