A proposal for the inductive categorisation of parenthetical discourse markers in Spanish using parallel corpora

Robledo, Hernán; Nazar, Rogelio

doi:10.1075/ijcl.20017.rob

Article published In:

International Journal of Corpus Linguistics
Vol. 28:4 (2023) ► pp.500–527

A proposal for the inductive categorisation of parenthetical discourse markers in Spanish using parallel corpora

Hernán Robledo | Pontificia Universidad Católica de Valparaíso

Rogelio Nazar | Pontificia Universidad Católica de Valparaíso

We propose a method for the automatic induction of categories of Spanish discourse markers using parallel corpora, based on a quantitative and empirical approach that minimises explicit linguistic knowledge. We conducted the analysis the using a large Spanish-English parallel corpus. First, we used this corpus to obtain a list of parenthetical discourse markers in each language. Then, we used it as a “semantic mirror”, inspecting the English equivalences and assessing which Spanish discourse markers fulfil a similar function in discourse and vice versa. The result of this procedure is an emerging categorisation of discourse markers. The main contribution is to offer empirical evidence for the adequacy of existing manually-compiled taxonomies and the potential for discovery of new, unaccounted categories. In this article we focus on units pertaining to the Spanish language but, since the method is purely quantitative, it is possible to apply it to different languages as well.

Keywords: clustering, discourse markers, inductive methods, parallel corpus, Spanish

Article outline

1.Introduction
2.Discourse markers: Characteristics, categories and empirical studies
- 2.1General characteristics of DMs
- 2.2Previous attempts at the automatic categorisation of DMs
- 2.3Studies on DMs using parallel corpora
3.Methodology
- 3.1Materials
- 3.2Operational definition of DMs and extraction of first lists of candidates
- 3.3Obtaining a bilingual lexicon of DMs
- 3.4Clustering method
- 3.5Merging similar clusters
4.Results
- 4.1Raw lists of DM candidates in each language
- 4.2Bilingual alignment of DMs
- 4.3Clustering results
5.Conclusions and future work
References

Published online: 23 February 2023

https://doi.org/10.1075/ijcl.20017.rob

References (64)

References

Aijmer, K. (2015). Analysing discourse markers in spoken corpora: Actually as a case study. In P. Baker & T. McEnery (Eds.), Corpora and Discourse Studies: Integrating Discourse and Corpora (pp. 88–109). Palgrave Macmillan.

Aijmer, K., Foolen, A., & Vandenbergen, A.-M. (2006). Pragmatic markers in translation: A methodological proposal. In K. Fischer (Ed.), Approaches to Discourse Particles (pp. 101–114). Elsevier.

Aijmer, K., & Simon-Vandenbergen, A.-M. (2004). A model and a methodology for the study of pragmatic markers: The semantic field of expectation. Journal of Pragmatics, 36 (10), 1781–1805.

Alonso, L., Castellón, I., Gibert, K., & Padró, L. (2002). An empirical approach to discourse markers by clustering. In M. T. Escrig, F. Toledo, & E. Golobardes (Eds.), Topics in Artificial Intelligence. Proceedings of 5th Catalonian Conference on AI, CCIA 2002, LNCS (LNAI), vol. 2504 (pp. 173–183). Springer.

Alonso, L., Castellón, I., & Padró, L. (2002). Lexicón computacional de marcadores del discurso [Computational lexicon of discourse markers]. Procesamiento del lenguaje natural, 29 1, 239–246.

Bestgen, Y., Degand, L., & Spooren, W. (2006). Toward automatic determination of the semantics of connectives in large newspaper corpora. Discourse Processes, 41 (2), 175–193.

Bourgonje, P., Grishina, Y., & Stede, M. (2017). Toward a bilingual lexical database on connectives: Exploiting a German/Italian parallel corpus. In R. Basili, M. Nissim, & G. Satta (Eds.), Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017: 11–12 December 2017, Rome (pp. 53–58). Accademia University Press. [URL].

Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.

Briz, A., Pons, S., & Portolés, J. (Eds.). (2008). Diccionario de partículas discursivas del español [Dictionary of Spanish discourse markers]. Retrieved December, 2019, from [URL]

Calsamiglia, H., & Tusón, A. (1999). Las cosas del decir: Manual de análisis del discurso [The Things of Saying: A Handbook of Discourse Analysis]. Ariel.

Casado Velarde, M. (1993). Introducción a la gramática del texto del español [Introduction to the Grammar of Spanish Texts]. Arco/Libros.

Cornillie, B., & Gras, P. (2015). On the interactional dimension of evidentials: The case of the Spanish evidential discourse markers. Discourse Studies, 17 (2), 141–161.

Crible, L., Abuczki, Á., Burkšaitienė, N., Furkó, P., Nedoluzhko, A., Rackevičienė, S., Oleškevičienė, G. V., & Zikánová, Š. (2019). Functions and translations of discourse markers in TED Talks: A parallel corpus study of underspecification in five languages. Journal of Pragmatics, 142 1, 139–155.

Crible, L., & Cuenca, M.-J. (2017). Discourse markers in speech: Characteristics and challenges for corpus annotation. Dialogue and Discourse, 8 (2), 149–166.

Cuenca, M. J. (2001). Los conectores parentéticos como categoría gramatical [Parenthetical connectives as a grammatical category]. LEA. Lingüística Española Actual, 23 (2), 211–236.

Degand, L. (2009). On describing polysemous discourse markers: What does translation add to the picture? In S. Slembrouck, M. Taverniers, & M. Van Herreweghe (Eds.), From will to well: Studies in Linguistics Offered to Anne-Marie Simon-Vandenbergen (pp. 173–184). Academia Press.

Divjak, D., & Fieller, N. (2014). Cluster analysis: Finding structure in linguistic data. In D. Glynn, & J. A. Robinson (Eds.), Corpus Methods for Semantics: Quantitative Studies in Polysemy and Synonymy (pp. 405–442). John Benjamins.

Dyvik, H. (1998). A translational basis for semantics. In S. Johansson & S. Oksefjell (Eds.), Corpora and Cross-linguistic Research: Theory, Method and Case Studies (pp. 51–86). Rodopi.

(2004). Translations as semantic mirrors: From parallel corpus to WordNet. Language and Computers, 1 1, 311–326.

Dixon, P. (2003). VEGAN, a package of R functions for community ecology. Journal of Vegetation Science, 14 (6), 927–930.

Fedriani, C., & Sansò, A. (2017). Pragmatic markers, discourse markers and modal particles: What do we know and where do we go from here? In C. Fedriani & A. Sansò (Eds.), Pragmatic Markers, Discourse Markers and Modal Particles: New Perspectives (pp. 1–33). John Benjamins.

Fischer, K. (2006). Towards an understanding of the spectrum of approaches to discourse particles: Introduction to the volume. In K. Fischer (Ed.), Approaches to Discourse Particles (pp. 1–20). Elsevier.

(2014). Discourse markers. In K. Schneider & K. Barron (Eds.), Pragmatics of Discourse (pp. 271–294). De Gruyter Mouton.

Fraser, B. (1999). What are discourse markers? Journal of Pragmatics, 31 (7), 931–952.

(2009). An account of discourse markers. International Review of Pragmatics, 1 (2), 293–320.

Fuentes Rodríguez, C. (2009). Diccionario de conectores y operadores del español [Dictionary of Spanish Connectives and Operators]. Arco/Libros.

Furkó, B. P. (2014). Perspectives on the translation of discourse markers. Acta Universitatis Sapientiae, Philologica, 6 (2), 181–196.

Gan, G., Ma, C., & Wu, J. (2007). Data Clustering: Theory, Algorithms, and Applications. SIAM/ASA.

Gries, S. T. (2013). Statistics for Linguistics with R: A Practical Introduction. De Gruyter Mouton.

Hajlaoui, N., & Popescu-Belis, A. (2013). Assessing the accuracy of discourse connective translations: Validation of an automatic metric. In A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing (pp. 236–247). Springer.

Hansen, M.-B. M. (1998). The semantic status of discourse markers. Lingua, 104 1, 235–260.

Hidey, C., & McKeown, K. (2016). Identifying causal relations using parallel Wikipedia articles. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1424–1433). Association for Computational Linguistics.

Holgado Lage, A. (2017). Diccionario de marcadores discursivos para estudiantes de español como segunda lengua [Dictionary of Discourse Markers for Learners of Spanish as a Second Language]. Peter Lang.

Hutchinson, B. (2003). Automatic classification of discourse markers on the basis of their co-occurrences. In M. Stede & H. Zeevat (Eds.), Proceedings of the ESSLLI Workshop The Meaning and Implementation of Discourse Particles (pp. 1–8). University of Groningen.

(2004a). Mining the web for discourse markers. In M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, & R. Silva (Eds.), Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04). European Language Resources Association (ELRA). [URL]

(2004b). Acquiring the meaning of discourse markers. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), 684–691.

(2005). The Automatic Acquisition of Knowledge about Discourse Connectives [Doctoral dissertation, The University of Edinburgh]. Edinburgh Research Archive. [URL]

Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice-Hall.

Kaufman, L., & Rousseeuw, P. J. (2009). Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons.

Knott, A. (1996). A Data-Driven Methodology for Motivating a Set of Coherence Relations [Doctoral dissertation, The University of Edinburgh]. Edinburgh Research Archive. [URL]

Knott, A., & Dale, R. (1994). Using linguistic phenomena to motivate a set of coherence relations. Discourse Processes, 18 (1), 35–62.

Laali, M., & Kosseim, L. (2014). Inducing discourse connectives from parallel texts. In J. Tsujii & J. Hajic (Eds.), Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 610–619). Dublin City University and Association for Computational Linguistics. [URL]

Llopis, A. (2016). Significado y funciones en los marcadores discursivos [The semantics and functions of discourse markers]. Verba, 43 1, 231–268.

López Serena, A., & Borreguero, M. (2010). Los marcadores del discurso y la variación lengua hablada vs. lengua escrita [Discourse markers and the spoken vs. written language variation]. In Ó. Loureda & E. Acín (Eds.), Los estudios sobre marcadores del discurso en español, hoy (pp. 415–495). Arco/Libros.

Loureda, Ó., & Acín, E. (2010). Cuestiones candentes en torno a los marcadores del discurso en español [Hot issues on discourse markers in Spanish]. In Ó. Loureda & E. Acín (Eds.), Los estudios sobre marcadores del discurso en español, hoy (pp. 7–59). Arco/Libros.

Marcu, D. (1998). A surface-based approach to identifying discourse markers and elementary textual units in unrestricted texts. In Proceedings of the Workshop: Discourse Relations and Discourse Markers, COLiNG-ACL’98 (pp. 1–7). Montreal, Quebec, Canada. [URL]

Martín Zorraquino, M. A. (2010). Los marcadores del discurso y su morfología [Discourse markers and their morphology]. In Ó. Loureda & E. Acín (coords.), Los estudios sobre marcadores del discurso en español, hoy (pp. 93–181). Arco/Libros.

Martín Zorraquino, M. A., & Portolés, J. (1999). Los marcadores del discurso [Discourse markers]. In I. Bosque & V. Demonte (Eds.), Gramática descriptiva de la lengua española, Vol. 31 (pp. 4051–4213). Espasa-Calpe.

Montolío, E. (2001). Conectores de la lengua escrita. Contraargumentativos, consecutivos, aditivos y organizadores de la información [Written Language Connectives. Counterargumentative, Consecutive, Additive and Information Organisers.]. Ariel.

Mortier, L., & Degand, L. (2009). Adversative discourse markers in contrast: The need for a combined corpus approach. International Journal of Corpus Linguistics, 14 (3), 338–366.

Muller, P., Conrath, J., Afantenos, S., & Asher, N. (2016). Data-driven discourse markers representation and classification. In Proceedings of TextLink–Structuring Discourse in Multilingual Europe, Second Action Conference (pp. 93–97). Budapest, Hungary.

Noël, D. (2003). Translations as evidence for semantics: An illustration. Linguistics, 41 (4), 757–785.

Pons, S., & Fischer, K. (2021). Using discourse segmentation to account for the polyfunctionality of discourse markers: The case of well . Journal of Pragmatics, 173 (2), 101–118.

Portolés, J. (2016). Los marcadores del discurso [Discourse markers]. In J. Gutiérrez-Rexach (Ed.), Enciclopedia de Lingüística Hispánica, Vol. 1 (pp. 689–699). Routledge.

R Core Team. (2020). R: A language and environment for statistical computing. [Computer software]. R Foundation for Statistical Computing. [URL]

Rouchota, V. (1998). Procedural meaning and parenthetical discourse markers. In A. Jucker & Y. Ziv (Eds.), Discourse Markers: Description and Theory (pp. 97–126). John Benjamins.

Santos Río, L. (2003). Diccionario de partículas [Dictionary of Particles]. Luso-española de ediciones.

Schiffrin, D. (2001). Discourse markers: Language, meaning, and context. In D. Schiffrin, D. Tannen & H. E. Hamilton (Eds.), The Handbook of Discourse Analysis (pp. 54–75). Blackwell.

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing (pp. 44–49). Manchester, UK.

Tiedemann, J. (2016). Opus – parallel corpora for everyone. In Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT): Projects/Products (p. 384). EAMT 2016, Riga, Latvia. [URL]

Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. John Benjamins.

Torrent, A. (2015). Evidentiality and lexicalisation in the Spanish phraseological system: A study of the idiom a fe mía (and its variants). Discourse Studies, 17 (2), 241–256.

Versley, Y. (2010). Discovery of ambiguous and unambiguous discourse connectives via annotation projection. In L. Ahrenberg, J. Tiedemann, & M. Volk (Eds.), Proceedings of Workshop on Annotation and Exploitation of Parallel Corpora (AEPC) (pp. 83–92). Tartu, Estonia. [URL]

Zhou, L., Gao, W., Li, B., Wei, Z., & Wong, K.-F. (2012). Cross-lingual identification of ambiguous discourse connectives for resource-poor language. In M. Kay & C. Boitet (Eds.), Proceedings of COLING 2012: Posters (pp. 1409–1418). The COLING 2012 Organizing Committee. [URL]