Empirical variability of Italian multiword expressions as a useful
feature for their categorisation
In contemporary linguistics the definition of those entities
which are referred to as multiword expressions (MWEs) remains controversial.
It is intuitively clear that some words, when appearing together, have some
“special bond” in terms of meaning (e.g. black hole, mountain chain), or
lexical choice (e.g. strong tea, to fill a form), contrary to free
combinations. Nevertheless, the great variety of features and anomalous
behaviours that these expressions exhibit makes it difficult to organise
them into categories and gives rise to a great amount of different and
sometimes overlapping terminology.
So far, most approaches in corpus linguistics have focused on
trying to automatically extract MWEs from corpora by using statistical
association measures, while theoretical aspects related to their definition,
typology and behaviours arising from quantitative corpus-based studies have
not been widely explored, especially for languages with a rich morphology
and relatively free word order, such as Italian.
This contribution attests that a systematic analysis of the
empirical behaviour of Italian MWEs in large corpora, with respect to
several parameters, such as syntactic and lexical variations, is useful for
outlining a categorisation of the expressions in homogeneous sets which
approximately correspond to what is intuitively known as multiword units
(“polirematiche” in the Italian lexicographic tradition) and lexical
collocations. The importance of this kind of approach is that the resulting
categorisation of MWEs is grounded on empirical data rather than relying on
intuitive and not-always-coherent linguistic definitions.
The variational features taken into account are (1) the
possibility for the expressions to be syntactically transformed, and (2) the
possibility for one of the component to be replaced with a synonym. These
features can be automatically and quantitatively investigated using
ad hoc designed tools, whose methodology is fully
explained, if an annotated corpus and a list of expressions are provided. It
is possible to show that the kind of attested variations and the magnitude
of variation appear highly correlated to the grammatical structure of a
given phrase, indicating that the bond between the components for a
multiword unit or a lexical collocation can be formed by activating
different kinds of restrictions, depending on the considered grammatical
pattern.
Article outline
- 1.Introduction
- 2.Anomalous behaviours of Italian Multiword Expressions
- 3.A quantitative approach to MWEs
- 3.1Reasons to go beyond statistics
- 3.2Reasons for an empirical, quantitative approach to MWEs
- 4.Methodology
- 4.1Syntactic variations
- 4.2Lexical variations
- 4.3Inflectional variations
- 5.Analysis and results
- 6.Conclusion
-
Notes
-
Bibliography