Edited by Timothy Colleman, Frank Brisard, Astrid De Wit, Renata Enghels, Nikos Koutsoukos, Tanja Mortelmans and María Sol Sansiñena
[Belgian Journal of Linguistics 34] 2020
► pp. 66–78
This squib briefly explores how contextualized embeddings – which are a type of compressed token-based semantic vectors – can be used as semantic retrieval and annotation tools for corpus-based research into constructions. Focusing on embeddings created by the Bidirectional Encoder Representations from Transformer model, also known as ‘BERT’, this squib demonstrates how contextualized embeddings can help counter two types of retrieval inefficiency scenarios that may arise with purely form-based corpus queries. In the first scenario, the formal query yields a large number of hits, which contain a reasonable number of relevant examples that can be labeled and used as input for a sense disambiguation classifier. In the second scenario, the contextualized embeddings of exemplary tokens are used to retrieve more relevant examples in a large, unlabeled dataset. As a case study, this squib focuses on the into-interest construction (e.g. I’m so into you).