Multi-word discourse markers and their corpus-driven identification
The case of MWDM extraction from the reference corpus of spoken Slovene
With expanding evidence on the formulaic nature of human communication, there is a growing need to extend discourse marker research to functionally analogue multi-word expressions. In contrast to the common qualitative approaches to discourse marker identification in corpora, this paper presents a corpus-driven semi-automatic approach to identification of multi-word discourse markers (MWDMs) in the reference corpus of spoken Slovene. Using eight statistical measures, we identified 173 structurally fixed discourse-marking MWEs, distinguished by a high number of tokens, a large proportion of grammatical words and semantic heterogeneity. This is a significantly longer list than would have been gained by manual inspection of smaller corpus samples. Although frequency-based methods produced satisfactory results, best precision in MWDM identification was achieved using the t-score association measure, while the overall poor performance of the mutual information suggests its inadequacy for extraction of MWDMs and other MWEs with similar lexical and distributional features.
Keywords: discourse markers, multi-word units, collocation extraction, association measures, spoken corpora
Published online: 01 December 2017
Adolphs, S., & Carter, R.
Alonso, L., Castellón, I., & Padró, L.
Balažic Bulc, T.
Biber, D., Conrad, S., & Cortes, V.
Biber, D., Johansson, S., Leech, G., & Conrad, S.
Bolly, C., Crible, L., Degand, L., & Uygur, D.
forthcoming). Towards a model for discourse marker annotation in spoken French: From potential to feature-based discourse markers. In C. Fedriani & A. Sanso Eds. Pragmatic Markers, Discourse Markers and Modal Particles: New Perspectives pp. 71 98 Amsterdam/Philadelphia John Benjamins
Brinton, L. J., & Traugott, E. C.
Church, K. W., & Hanks, P.
Conklin, K., & Schmitt, N.
da Silva, J. F., & Lopes, G. P.
Degand, L., Cornillie, B., & Pietrandrea, P.
Degand, L., & Evers-Vermeul, J.
Dice, L. R.
forthcoming). Lexical features of spoken language in user-generated content: The case of multi-word discourse markers (Doctoral dissertation). Faculty of Arts, University of Ljubljana, Slovenia.
Dobrovoljc, K., & Nivre, J.
(2016) The Universal Dependencies treebank of spoken Slovenian. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (pp. 1566–1573). Paris: ELRA.
Erman, B., & Warren, B.
Fox Tree, J. E., & Schrock, J. C.
Gantar, P., Kosem, I., & Krek, S.
Gries, S. Th.
Kilgarriff, A., Rychly, P., Kovar, V., & Baisa, V.
Koops, C., & Lohmann, A.
Lapshinova-Koltunski, E., & Kunz, K.
Lin, P. M. S.
Ljubešić, N., Dobrovoljc, K., & Fišer, D.
Logar, N., Gantar, P., & Kosem, I.
Louwerse, M. M., & Mitchell, H. H.
Manning, C., & Schütze, H.
McCarthy, M., & Carter, R.
Nesi, H., & Basturkmen, H.
O’Donnell, M. B.
Prasad, R., & Bunt, H.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B.
Prasad, R., Joshi, A., & Webber, B.
Roze, C., Danlos, L., & Muller, P.
(2012) LEXCONN: A French lexicon of discourse connectives. Discours, 10. http://discours.revues.org/8645 doi:
Rysová, M., & Rysová, K.
Tadić, M., & Šojat, K.
Verdonik, D., Kosem, I., Vitez, A. Z., Krek, S., & Stabej, M.
Verdonik, D., Rojc, M., & Stabej, M.
Wei, N., & Li, J.
Zufferey, S., & Degand, L.
Zwitter Vitez, A., Zemljarič Miklavčič, J., Krek, S., Stabej, M., & Erjavec, T.
(2013) Spoken corpus Gos 1.0. Retrieved from: http://hdl.handle.net/11356/1040
Cited by 1 other publications
This list is based on CrossRef data as of 28 august 2021. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.