Vol. 28:4 (2023) ► pp.500–527
A proposal for the inductive categorisation of parenthetical discourse markers in Spanish using parallel corpora
We propose a method for the automatic induction of categories of Spanish discourse markers using parallel corpora, based on a quantitative and empirical approach that minimises explicit linguistic knowledge. We conducted the analysis the using a large Spanish-English parallel corpus. First, we used this corpus to obtain a list of parenthetical discourse markers in each language. Then, we used it as a “semantic mirror”, inspecting the English equivalences and assessing which Spanish discourse markers fulfil a similar function in discourse and vice versa. The result of this procedure is an emerging categorisation of discourse markers. The main contribution is to offer empirical evidence for the adequacy of existing manually-compiled taxonomies and the potential for discovery of new, unaccounted categories. In this article we focus on units pertaining to the Spanish language but, since the method is purely quantitative, it is possible to apply it to different languages as well.
Article outline
- 1.Introduction
- 2.Discourse markers: Characteristics, categories and empirical studies
- 2.1General characteristics of DMs
- 2.2Previous attempts at the automatic categorisation of DMs
- 2.3Studies on DMs using parallel corpora
- 3.Methodology
- 3.1Materials
- 3.2Operational definition of DMs and extraction of first lists of candidates
- 3.3Obtaining a bilingual lexicon of DMs
- 3.4Clustering method
- 3.5Merging similar clusters
- 4.Results
- 4.1Raw lists of DM candidates in each language
- 4.2Bilingual alignment of DMs
- 4.3Clustering results
- 5.Conclusions and future work
-
References