Edited by Gloria Corpas Pastor and Jean-Pierre Colson
[IVITRA Research in Linguistics and Literature 24] 2020
► pp. 208–224
Verbal collocations and pronominalisation
Precise identification of multiword expressions (MWEs) is an important qualitative step for several NLP applications, including machine translation. Since most MWEs cannot be translated literally, failure to identify them yields, at best, inaccurate translation. While some expressions are completely frozen and thus can be listed as compound words, others display a sometimes very large degree of syntactic flexibility.
In this chapter, we argue not only that structural information is necessary for an adequate treatment of collocations, but also that the detection of collocations can be useful for the parser. For instance, it is very useful for solving part-of-speech ambiguities and also some attachment ambiguities. We therefore claim that collocation identification and parsing are interrelated processes.
Section 2 describes the two processes of parsing and collocation detection and their interaction, (i) when and how the collocation identification process is triggered during parsing, and (ii) how the identification of a collocation helps the parser. In Section 3 we describe how anaphora resolution has been implemented in our parsing system, to handle cases where the antecedent and the pronoun are within the same sentence or in adjacent sentences. Section 4 focuses on more intricate cases of verbal collocations where their nominal element has been pronominalised, in the form of a relative pronoun or a personal pronoun. Verb-object collocations with a relative pronoun are extremely frequent and relatively easy to handle for a “deep” parser. In most cases, the relative clause is directly attached to the noun which is part of the collocation. Collocations in which the nominal element takes the form of a personal pronoun are much harder to deal with, as they depend on the process of anaphora resolution, a very challenging task. The last section describes an evaluation of the collocation detection procedure, enhanced with anaphora resolution using a corpus of newspaper articles of about 10 million words.