Article published in:
The Wealth and Breadth of Construction-Based ResearchEdited by Timothy Colleman, Frank Brisard, Astrid De Wit, Renata Enghels, Nikos Koutsoukos, Tanja Mortelmans and María Sol Sansiñena
[Belgian Journal of Linguistics 34] 2020
► pp. 66–78
Let’s get into it
Using contextualized embeddings as retrieval tools
Lauren Fonteyn | Leiden University
This squib briefly explores how contextualized embeddings – which are a type of compressed token-based semantic vectors –
can be used as semantic retrieval and annotation tools for corpus-based research into constructions. Focusing on embeddings created by the
Bidirectional Encoder Representations from Transformer model, also known as ‘BERT’, this squib demonstrates how contextualized embeddings
can help counter two types of retrieval inefficiency scenarios that may arise with purely form-based corpus queries. In the first scenario,
the formal query yields a large number of hits, which contain a reasonable number of relevant examples that can be labeled and used as input
for a sense disambiguation classifier. In the second scenario, the contextualized embeddings of exemplary tokens are used to retrieve more
relevant examples in a large, unlabeled dataset. As a case study, this squib focuses on the into-interest construction (e.g. I’m so
into you).
Keywords: distributional semantics, BERT, corpus linguistics, data retrieval, prepositions
Article outline
- 1.Introduction
- 2.Vector-based distributional semantic models
- 3.The challenge: Finding into-interest
- 4.A solution: BERT as a disambiguation tool
- 5.BERT as an exemplar-based retrieval tool
- 6.Conclusion
- Notes
-
References
Published online: 28 May 2021
https://doi.org/10.1075/bjl.00035.fon
https://doi.org/10.1075/bjl.00035.fon
References
Baroni, Marco, Georgiana Dinu, and Germán Kruszewski
2014 “Don’t Count, Predict! A Systematic Comparison of Context-Counting vs. Context-Predicting Semantic Vectors.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ed. by Kristina Toutanova, and Hua Wu, 238–247. Baltimore, Maryland: Association for Computational Linguistics. 

Bergen, Benjamin K., and Nancy Chang
Boleda, Gemma
Bolukbasi, Tolga, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai
2016 “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings.” ArXiv:1607.06520 [Cs, Stat], July. http://arxiv.org/abs/1607.06520
Clark, Kevin, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning
2019 “What Does BERT Look At? An Analysis of BERT’s Attention.” ArXiv:1906.04341 [Cs], June. http://arxiv.org/abs/1906.04341. 
Croft, William
De Pascale, Stefano
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
2019 “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ed. by Jill Burstein, Christy Doran, and Thamar Solorio, 4171–4186. Minneapolis: Association for Computational Linguistics.
Giulianelli, Mario, Marco Del Tredici, and Raquel Fernández
2020 “Analysing Lexical Semantic Change with Contextualised Word Representations.” ArXiv:2004.14118 [Cs], April. http://arxiv.org/abs/2004.14118. 
Goldberg, Adele E.
Gupta, Abhijeet, Gemma Boleda, Marco Baroni, and Sebastian Padó
Heylen, Kris, Thomas Wielfaert, Dirk Speelman, and Dirk Geeraerts
Hilpert, Martin
Hilpert, Martin, and David Correia Saavedra
Johnson, Jeff, Matthijs Douze, and Hervé Jégou
2017 “Billion-Scale Similarity Search with GPUs.” ArXiv:1702.08734 [Cs], February. http://arxiv.org/abs/1702.08734
Linzen, Tal, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes
(eds) 2019 Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence: Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-4800
Louwerse, Max M., and Rolf A. Zwaan
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze
Perek, Florent
Peters, Matthew, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer
2018 “Deep Contextualized Word Representations.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), ed. by Marilyn Walker, Heng Ji, and Amanda Stent, 2227–2237. New Orleans: Association for Computational Linguistics. 

Schlechtweg, Dominik, Stefanie Eckmann, Enrico Santus, Sabine Schulte im Walde, and Daniel Hole
Sommerauer, Pia, and Antske Fokkens
2018 “Firearms and Tigers Are Dangerous, Kitchen Knives and Zebras Are Not: Testing Whether Word Embeddings Can Tell”. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, ed. by Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, 276–286. Brussels: Association for Computational Linguistics. 

Stefanowitsch, Anatol, and Stefan Gries
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin
2017 “Attention Is All You Need”. ArXiv:1706.03762 [Cs], December. http://arxiv.org/abs/1706.03762