Let’s get into it: Using contextualized embeddings as retrieval tools

Fonteyn, Lauren

doi:10.1075/bjl.00035.fon

Article published In:

The Wealth and Breadth of Construction-Based Research
Edited by Timothy Colleman, Frank Brisard, Astrid De Wit, Renata Enghels, Nikos Koutsoukos, Tanja Mortelmans and María Sol Sansiñena
[Belgian Journal of Linguistics 34] 2020
► pp. 66–78

Let’s get into it

Using contextualized embeddings as retrieval tools

Lauren Fonteyn | Leiden University

This squib briefly explores how contextualized embeddings – which are a type of compressed token-based semantic vectors – can be used as semantic retrieval and annotation tools for corpus-based research into constructions. Focusing on embeddings created by the Bidirectional Encoder Representations from Transformer model, also known as ‘BERT’, this squib demonstrates how contextualized embeddings can help counter two types of retrieval inefficiency scenarios that may arise with purely form-based corpus queries. In the first scenario, the formal query yields a large number of hits, which contain a reasonable number of relevant examples that can be labeled and used as input for a sense disambiguation classifier. In the second scenario, the contextualized embeddings of exemplary tokens are used to retrieve more relevant examples in a large, unlabeled dataset. As a case study, this squib focuses on the into-interest construction (e.g. I’m so into you).

Keywords: distributional semantics, BERT, corpus linguistics, data retrieval, prepositions

Article outline

1.Introduction
2.Vector-based distributional semantic models
3.The challenge: Finding into-interest
4.A solution: BERT as a disambiguation tool
5.BERT as an exemplar-based retrieval tool
6.Conclusion
Notes
References

Published online: 28 May 2021

https://doi.org/10.1075/bjl.00035.fon

References (25)

Baroni, Marco, Georgiana Dinu, and Germán Kruszewski

2014 “Don’t Count, Predict! A Systematic Comparison of Context-Counting vs. Context-Predicting Semantic Vectors.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ed. by Kristina Toutanova, and Hua Wu, 238–247. Baltimore, Maryland: Association for Computational Linguistics.

Bergen, Benjamin K., and Nancy Chang

2005 “Embodied Construction Grammar in Simulation-Based Language Understanding.” In Construction Grammars: Cognitive Grounding and Theoretical Extensions, ed. by Jan-Ola Östman, and Mirjam Fried, 147–190. Amsterdam/Philadelphia: John Benjamins.

Boleda, Gemma

2020 “Distributional Semantics and Linguistic Theory.” Annual Review of Linguistics 6 (1): 213–234.

Bolukbasi, Tolga, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai

2016 “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings.” ArXiv:1607.06520 [Cs, Stat], July. [URL]

Clark, Kevin, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning

2019 “What Does BERT Look At? An Analysis of BERT’s Attention.” ArXiv:1906.04341 [Cs], June. [URL].

Croft, William

2001 Radical Construction Grammar: Syntactic Theory in Typological Perspective. Oxford: Oxford University Press.

De Pascale, Stefano

2019 “Token-Based Vector Space Models as Semantic Control in Lexical Sociolectometry”. PhD dissertation, KU Leuven.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

2019 “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ed. by Jill Burstein, Christy Doran, and Thamar Solorio, 4171–4186. Minneapolis: Association for Computational Linguistics.

Giulianelli, Mario, Marco Del Tredici, and Raquel Fernández

2020 “Analysing Lexical Semantic Change with Contextualised Word Representations.” ArXiv:2004.14118 [Cs], April. [URL].

Goldberg, Adele E.

1995 Constructions: A Construction Grammar Approach to Argument Structure. Chicago: University of Chicago Press.

Gupta, Abhijeet, Gemma Boleda, Marco Baroni, and Sebastian Padó

2015 “Distributional Vectors Encode Referential Attributes.” In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, ed. by Lluís Màrquez, Chris Callison-Burch, and Jian Su, 12–21. Lisbon: Association for Computational Linguistics.

Heylen, Kris, Thomas Wielfaert, Dirk Speelman, and Dirk Geeraerts

2015 “Monitoring Polysemy: Word Space Models as a Tool for Large-Scale Lexical Semantic Analysis.” Lingua 1571 (April): 153–172.

Hilpert, Martin

2013 “Corpus-Based Approaches to Constructional Change.” In: The Oxford Handbook of Construction Grammar, ed. by Thomas Hoffmann, and Graeme Trousdale, 458–475. Oxford: Oxford University Press.

Hilpert, Martin, and David Correia Saavedra

2017 “Using Token-Based Semantic Vector Spaces for Corpus-Linguistic Analyses: From Practical Applications to Tests of Theoretical Claims.” Corpus Linguistics and Linguistic Theory 16(2): 393–424.

Johnson, Jeff, Matthijs Douze, and Hervé Jégou

2017 “Billion-Scale Similarity Search with GPUs.” ArXiv:1702.08734 [Cs], February. [URL]

Linzen, Tal, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes

(eds) 2019 Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence: Association for Computational Linguistics. [URL]

Louwerse, Max M., and Rolf A. Zwaan

2009 “Language Encodes Geographical Information.” Cognitive Science 33 (1): 51–73.

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze

2009 An Introduction to Information Retrieval. Cambridge: Cambridge University Press.

Perek, Florent

2016 “Using Distributional Semantics to Study Syntactic Productivity in Diachrony: A Case Study.” Linguistics 54 (1): 149–188.

Peters, Matthew, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

2018 “Deep Contextualized Word Representations.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), ed. by Marilyn Walker, Heng Ji, and Amanda Stent, 2227–2237. New Orleans: Association for Computational Linguistics.

Schlechtweg, Dominik, Stefanie Eckmann, Enrico Santus, Sabine Schulte im Walde, and Daniel Hole

2017 “German in Flux: Detecting Metaphoric Change via Word Entropy.” In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), ed. by Roger Levy, and Lucia Specia, 354–367. Vancouver: Association for Computational Linguistics.

Sommerauer, Pia, and Antske Fokkens

2018 “Firearms and Tigers Are Dangerous, Kitchen Knives and Zebras Are Not: Testing Whether Word Embeddings Can Tell”. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, ed. by Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, 276–286. Brussels: Association for Computational Linguistics.

Stefanowitsch, Anatol, and Stefan Gries

2003 “Collostructions: Investigating the Interaction between Words and Constructions.” International Journal of Corpus Linguistics 8 (2): 209–243.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin

2017 “Attention Is All You Need”. ArXiv:1706.03762 [Cs], December. [URL]

Yoon, Jiyoung, and Stefan Th Gries

(eds) 2016 Corpus-Based Approaches to Construction Grammar. Amsterdam/Philadelphia: John Benjamins.