The corpus-based identification of cross-lectal synonyms in pluricentric languages

Peirsman, Yves; Geeraerts, Dirk; Speelman, Dirk

doi:10.1075/ijcl.20.1.03pei

Article published In:

International Journal of Corpus Linguistics
Vol. 20:1 (2015) ► pp.54–80

The corpus-based identification of cross-lectal synonyms in pluricentric languages

Yves Peirsman | University of Leuven

Dirk Geeraerts

Dirk Speelman

This article discusses a corpus-based method for the automatic identification of synonyms across different varieties of the same language. This method, based on the paradigm of distributional semantics, quantifies semantic similarity on the basis of contextual similarity in two comparable corpora. In two case studies for Dutch and German, we show that it automatically identifies the correct synonym for 31% and 25% of the target words, respectively. A manual error analysis moreover indicates that many additional synonyms are very close in the distributional model, while most other distributional neighbours are semantically related to the target word along other dimensions than synonymy. On the basis of these results, we argue that distributional-semantic methods can play a crucial role in the further evolution of corpus-based lexical semantics to a more quantitative discipline.

Keywords: distributional semantics, synonymy, lexical variation, pluricentric languages

Published online: 30 March 2015

https://doi.org/10.1075/ijcl.20.1.03pei

References (69)

Ammon, U., Bickel, H., Ebner, J., Esterhammer, R., Gasser, M., Hofer, L.,Kellermeier-Rehbein, B., Löffler, H., Mangott, D., Moser, H., Schläpfer, R., Schloßmacher, M., Schmidlin, R., & Vallaster, G. (Eds.) (2004). Variantenwörterbuch des Deutschen. Die Standardsprache in Österreich, der Schweiz und Deutschland sowie in Liechtenstein, Luxemburg, Ostbelgien und Südtirol. Berlin, Germany: Walter de Gruyter.

Arppe, A. (2008). Univariate, bivariate and multivariate methods in corpus-based lexicography: A study of synonymy. (Unpublished doctoral dissertation). University of Helsinki, Helsinki, Finland.

Arppe, A., & Järvikivi, J. (2007). Every method counts: Combining corpus-based and experimental evidence in the study of synonymy. Corpus Linguistics and Linguistic Theory, 3(2), 131–159.

Atkins, B., & Levin, B. (1995). Building on a corpus: A linguistic and lexicographical look at some near-synonyms. International Journal of Lexicography, 8(2), 85–114.

Bai, J., Song, D., Bruza, P., Nie, J.-Y., & Cao, G. (2005). Query expansion using term relationships in language models for information retrieval. In O. Herzog, H. Schek, N. Fuhr, A. Chowdhury & W. Teiken (Eds.), Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM 2005) (pp. 688–695). New York, NY, ACM.

Baker, C.F., Fillmore, C.J., & Lowe, J.B. (1998). The Berkeley FrameNet project. In C. Boitet & P. Whitelock (Eds.), 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL ‘98, Proceedings of the Conference (pp. 86–90). Stroudsburg, PA: Morgan Kaufmann Publishers/ACL.

Baroni, M., Lenci, A., & Onnis, L. (2007). ISA meets Lara: An incremental word space model for cognitively plausible simulations of semantic learning. In A. Lenci, M. Padró, T. Poibeau & A. Villavicencio (Eds.), Proceedings of the ACL Workshop on Cognitive Aspects of Computational Language Acquisition (pp. 49–56). Stroudsburg, PA: ACL.

Bertels, A., Speelman, D., & Geeraerts, D. (2006). Analyse quantitative et statistique de la sémantique dans un corpus technique. In P. Mertens, C. Fairon, A. Dister & P. Watrin (Eds.), Actes de la 13e Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2006) (pp. 73–82). Louvain-la-Neuve, Belgium: Presses universitaires de Louvain.

Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge, UK: Cambridge University Press.

Boussidan, A., Sagi, E., & Ploux, S. (2009). Phonaesthemic and etymological effects on the distribution of senses in statistical models of semantics. In Proceedings of the CogSci Workshop on Distributional Semantics beyond Concrete Concepts (DiSCo 2009) , 35–40.

Buchanan, L., Burgess, C., & Lund, K. (1996). Overcrowding in semantic neighborhoods: Modeling deep dyslexia. Brain and Cognition, 32(2), 111–114.

Burgess, C., Livesay, K., & Lund, K. (1998). Explorations in context space: Words, sentences, discourse. Discourse Processes, 25(2–3), 211–257.

Chiao, Y.-C., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. In S. Tseng, T. Chen & Y. Liu (Eds.), Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), (pp. 1208–1212).Stroudsburg, PA: ACL.

Church, K.W., & Hanks, P. (1989). Word association norms, mutual information and lexicography. In J. Hirschberg (Ed.), Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL 1989) (pp. 76–83). Stroudsburg, PA: ACL.

Church, K.W., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In U. Zernik (Ed.), Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon (pp. 115–164). Hillsdale, NJ: Lawrence Erlbaum.

Clark, S., & Weir, D. (2002). Class-based probability estimation using a semantic hierarchy. Computational Linguistics, 28(2), 187–206.

Curran, J.R. (2004). From distributional to semantic similarity. (Unpublished doctoral dissertation). University of Edinburgh, Edinburgh, UK.

Dagan, I., Lee, L., & Pereira, F.C.N. (1999). Similarity-based models of word cooccurrence probabilities. Machine Learning, 34(1–3), 43–69.

Den Boon, T. & Geeraerts, D. (Eds.) (2005). Van Dale Groot Woordenboek van de Nederlandse taal (14th ed.). Utrecht/Antwerp: Van Dale Lexicograﬁe.

Deygers, K., & Van Den Heede, V. (2000). Belgisch Nederlandse ‘klassiekers’ als variabelen voor lexicaal variatie-onderzoek: Een evaluatie. Taal en Tongval, 52(2), 308–328.

Diab, M., & Finch, S. (2000). A statistical word-level translation model for comparable corpora. In J. Mariani & D. Harman (Eds.), Proceedings of the 6th Conference on Content-Based Multimedia Information Access (RIAO 2000) (pp. 1500–1508). Paris, France: Collège de France.

Divjak, D., & Gries, S. Th. (2006). Ways of trying in Russian: Clustering behavioral profiles. Journal of Corpus Linguistics and Linguistic Theory, 2(1), 23–60.

Firth, J.R. (1957). A synopsis of linguistic theory 1930–1955. In Philological Society (Eds.), Studies in Linguistic Analysis (pp. 1–32). Oxford, UK: Blackwell.

Foltz, P.W. (1996). Latent semantic analysis for text-based research. Behavior Research Methods, Instruments, and Computers, 28(2), 197–202.

Fung, P., & McKeown, K. (1997). Finding terminology translations from nonparallel corpora. In J. Zhou & K. Church (Eds.), Proceedings of the Fifth Workshop on Very Large Corpora (pp. 192–202). Hong Kong/Beijing, China: The Hong Kong University of Science and Technology & Tsinghua University.

Geeraerts, D. (2010a). Lexical variation in space. In P. Auer & J.E. Schmidt (Eds.), Language and Space. An International Handbook of Linguistic Variation (pp. 820–836). Berlin, Germany: De Gruyter Mouton.

. (2010b). Theories of Lexical Semantics. Oxford, UK: Oxford University Press.

Geeraerts, D., Grondelaers, S., & Speelman, D. (1999). Convergentie en Divergentie in de Nederlandse Woordenschat. Amsterdam, Netherlands: Meertens Instituut.

Gilquin, G. (2003). Causative ‘get’ and ‘have’: So close, so different. Journal of English Linguistics, 31(2), 125–148.

Gries, S. Th. (2001). A corpus-linguistic analysis of -ic and -ical adjectives. ICAME Journal, 251, 65–108.

Gries, S. Th., & Stefanowitsch, A. (2004). Extending collostructional analysis: A corpus-based perspectives on ‘alternations’. International Journal of Corpus Linguistics, 9(1), 97–129.

Gries, S. Th., & Otani, N. (2010). Behavioral profiles: A corpus-based perspective on synonymy and antonymy. ICAME Journal, 341, 121–150.

Glynn, D. (2007). Mapping meaning. Towards a usage-based methodology in Cognitive Semantics. (Unpublished doctoral dissertation). University of Leuven, Leuven, Belgium.

Glynn, D., & Fischer, K. (Eds.) (2010). Quantitative Methods in Cognitive Semantics: Corpus-driven Approaches. Berlin/New York: De Gruyter Mouton.

Hanks, P. (1996). Contextual dependency and lexical sets. International Journal of Corpus Linguistics, 1(1), 75–98.

Harris, Z. 1954. Distributional structure. Word, 10(2–3), 146–162.

Janda, L., & Solovyev, V. (2009). What constructional profiles reveal about synonymy: A case study of Russian words for SADNESS and HAPPINESS. Cognitive Linguistics, 20(2), 367–393.

Jijkoun, V., & De Rijke, M. (2005). Recognizing textual entailment: Is word similarity enough? In J. Quinonero Candela, I. Dagan, B. Magnini & F. d’Alche Buc (Eds.), Machine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classiﬁcation and Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop (MLCW 2005) (pp. 449–460). New York, NY: Springer.

Kilgarriff, A., & Yallop, C. (2000). What’s in a thesaurus? In M. Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis & G. Stainhauer (Eds.), Proceedings of the 2nd Language Resources and Evaluation Conference (LREC 2000) (pp. 1371–1379). Athens, Greece: European Language Resources Association.

Kintsch, W. (2000). Metaphor comprehension: A computational theory. Psychonomic Bulletin & Review, 71, 257–266.

Labov, W. (1972). Sociolinguistic Patterns. Philadelphia, PA: University of Pennsylvania Press.

Landauer, T.K., & Dumais, S.T. (1997). A solution to Plato’s problem: The Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104(2), 211–240.

Lee, L. (1999). Measures of distributional similarity. In R. Dale & K. Church (Eds.), Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999) (pp. 25–32). Stroudsburg, PA: ACL.

Levshina, N. (2011). Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions. (Unpublished doctoral dissertation). University of Leuven, Leuven, Belgium.

Lin, D. (1998). Automatic retrieval and clustering of similar words. In C. Boitet & P. Whitelock (Eds.), 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL ‘98, Proceedings of the Conference (pp. 768–774). Stroudsburg, PA: Morgan Kaufmann Publishers/ACL.

Liu, D. (2010). Is it a chief, main, major, primary, or principal concern? A corpus-based behavioral profile study of the near-synonyms. International Journal of Corpus Linguistics, 15(1), 56–87.

Lowe, W. (2001). Towards a theory of semantic space. In J.D. Moore & K. Stenning (Eds.), Proceedings of the 23rd Annual Conference of the Cognitive Science Society (CogSci 2001) (pp. 576–581). London, UK: Lawrence Erlbaum Associates.

Lowe, W., & McDonald, S. (2000). The direct route: Mediated priming in semantic space. In L.R. Gleitman & A.K. Joshi (Eds.), Proceedings of the 22nd Annual Conference of the Cognitive Science Society (CogSci 2000) (pp. 675–680). London, UK: Lawrence Erlbaum Associates.

Martin, W. (2005). Het Belgisch-Nederlands anders bekeken: het Referentiebestand Belgisch-Nederlands (RBBN) (Technical report). Amsterdam, Netherlands: Vrije Universiteit Amsterdam.

McCarthy, D., Koeling, R., Weeds, J., & Carroll, J. (2004). Finding predominant word senses in untagged text. In D. Scott, W. Daelemans & M.A. Walker (Eds.), Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004) (pp. 279–286). Stroudsburg, PA: ACL.

Michelbacher, L., Evert, S., & Schütze, H. (2007). Asymmetric association measures. In G. Angelova, K. Bontcheva, R. Mitkov, N. Nicolov & N. Nikolov (Eds.), Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2007) (pp. 1–6). Sofia, Bulgaria: Institute for Parallel Processing, Bulgarian Academy of Sciences.

Mitchell, T.M., Shinkareva, S.V., Carlson, A., Chang, K.-M., Malave, V.L., Mason, R.A., & Just, M.A. (2008). Predicting human brain activity associated with the meanings of nouns. Science, 320(5880), 1191–1195.

Padó, S., & Lapata, M. (2007). Dependency-based construction of semantic space models. Computational Linguistics, 33(2), 161–199.

Partington, A. (1998). Patterns and Meanings: Using Corpora for English Language Research and Teaching. Amsterdam, Netherlands: John Benjamins.

Peirsman, Y. (2008). Word space models of semantic similarity and relatedness. In K. Balogh (Ed.), Proceedings of the 13th ESSLLI Student Session (pp. 143–152). Hamburg, Germany: FoLLI.

Peirsman, Y., & Geeraerts, D. (2009). Predicting strong associations on the basis of corpus data. In A. Lascarides, C. Gardent & J. Nivre (Eds.), Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009) (pp. 648–656). Stroudsburg, PA: ACL.

Peirsman, Y., Geeraerts, D. & Speelman, D. (2010). The Automatic Identification of Lexical Variation between Language Varieties. Journal of Natural Language Engineering, 16(4), 469–491.

Rapp, R. (1999). Automatic identiﬁcation of word translations from unrelated English and German corpora. In R. Dale & K. Church (Eds.), Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999) (pp. 519–526). Stroudsburg, PA: ACL.

Reisinger, J., & Mooney, R. (2010). Multi-prototype vector-space models of word meaning. In R.M. Kaplan, J. Burstein, M. Harper & G. Penn (Eds.), Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL 2010) (pp. 109–117). Stroudsburg, PA: ACL.

Ruette, T., Geeraerts, D., Peirsman, Y., & Speelman, D. (2014). Semantic weighting mechanisms in scalable lexical sociolectometry. In B. Wälchli & B. Szmrecsanyi (Eds.), Aggregating Dialectology, Typology, and Register Analysis. Linguistic Variation in Text and Speech (pp. 205–230). Berlin, Germany: De Gruyter.

Sagi, E., Kaufmann, S., & Clark, B. (2009). Semantic density analysis: Comparing word meaning across time and phonetic space. In R. Basili & M. Pennacchiotti (Eds.), Proceedings of the EACL 2009 Workshop on GEMS: Geometrical Models of Natural Language Semantics (pp. 104–111). Stroudsburg, PA: ACL.

Sahlgren, M. (2006). The Word-Space model. Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. (Unpublished doctoral dissertation). Stockholm University, Stockholm, Sweden.

Soares da Silva, A. (2010). Measuring and parameterizing lexical convergence and divergence between European and Brazilian Portuguese. In D. Geeraerts, G. Kristiansen & Y. Peirsman (Eds.), Advances in Cognitive Sociolinguistics (pp. 41–83). Berlin/New York: Mouton de Gruyter.

Speelman, D., Grondelaers, S., & Geeraerts, D. (2003). Profile-based linguistic uniformity as a generic method for comparing language varieties. Computers and the Humanities, 37(3), 317–337.

Turney, P., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), 141–188.

Van der Plas, L. (2008). Automatic lexico-semantic acquisition for question answering. (Unpublished doctoral dissertation). University of Groningen, Groningen, Netherlands.

Wettler, M., Rapp, R., & Sedlmeier, P. (2005). Free word associations correspond to contiguities between words in texts. Journal of Quantitative Linguistics, 12(2–3), 111–122.

Wittgenstein, L. (1953). Philosophical Investigations. Oxford, UK: Blackwell.

Zhitomirsky-Geffet, M., & Dagan, I. (2009). Bootstrapping distributional feature vector quality. Computational Linguistics, 35(3), 435–461.

Cited by (4)

Cited by four other publications

Order by:

Wu, Shuqiong & Yue Ou

2024. A quantitative study of the polysemy of Mandarin Chinese perception verb kàn ‘look/see’ . Australian Journal of Linguistics ► pp. 1 ff.

Wu, Shuqiong

2021. A corpus-based study of the Chinese synonymous approximativesshangxia, qianhouandzuoyou. Corpus Linguistics and Linguistic Theory 17:2 ► pp. 411 ff.

Drouin, Patrick

2017. Chapter 6. Should we be looking for the needle in the haystack or in the straw poll?. In Multiple Perspectives on Terminological Variation [Terminology and Lexicography Research and Practice, 18], ► pp. 131 ff.

Ruette, Tom, Katharina Ehret & Benedikt Szmrecsanyi

2016. A lectometric analysis of aggregated lexical variation in written Standard English with Semantic Vector Space models. International Journal of Corpus Linguistics 21:1 ► pp. 48 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.