604026304 03 01 01 JB John Benjamins Publishing Company 01 JB code IVITRA 24 Eb 15 9789027261397 06 10.1075/ivitra.24 13 2019057309 DG 002 02 01 IVITRA 02 2211-5412 IVITRA Research in Linguistics and Literature 24 <TitleType>01</TitleType> <TitleText textformat="02">Computational Phraseology</TitleText> 01 ivitra.24 01 https://benjamins.com 02 https://benjamins.com/catalog/ivitra.24 1 B01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor University of Malaga 2 B01 Jean-Pierre Colson Colson, Jean-Pierre Jean-Pierre Colson University of Louvain 01 eng 339 xi 327 LAN009060 v.2006 CFK 2 24 JB Subject Scheme LIN.COMPUT Computational & corpus linguistics 24 JB Subject Scheme LIN.SYNTAX Syntax 24 JB Subject Scheme LIN.THEOR Theoretical linguistics 06 01 Whether you wish to <i>deliver on a promise, take a walk down memory lane</i> or even <i>on the wild side</i>, phraseological units (also often referred to as phrasemes or multiword expressions) are present in most communicative situations and in all world’s languages. <i>Phraseology</i>, the study of phraseological units, has therefore become a rare unifying theme across linguistic theories.<br />In recent years, an increasing number of studies have been concerned with the computational treatment of multiword expressions: these pertain among others to their automatic identification, extraction or translation, and to the role they play in various Natural Language Processing applications. Computational Phraseology is a comparatively new field where better understanding and more advances are urgently needed. This book aims to address this pressing need, by bringing together contributions focusing on different perspectives of this promising interdisciplinary field. 04 09 01 https://benjamins.com/covers/475/ivitra.24.png 04 03 01 https://benjamins.com/covers/475_jpg/9789027205353.jpg 04 03 01 https://benjamins.com/covers/475_tif/9789027205353.tif 06 09 01 https://benjamins.com/covers/1200_front/ivitra.24.hb.png 07 09 01 https://benjamins.com/covers/125/ivitra.24.png 25 09 01 https://benjamins.com/covers/1200_back/ivitra.24.hb.png 27 09 01 https://benjamins.com/covers/3d_web/ivitra.24.hb.png 10 01 JB code ivitra.24.forvil vii xii 6 Chapter 1 <TitleType>01</TitleType> <TitleText textformat="02">Foreword</TitleText> 1 A01 Aline Villavicencio Villavicencio, Aline Aline Villavicencio 10 01 JB code ivitra.24.00pas 1 8 8 Chapter 2 <TitleType>01</TitleType> <TitleText textformat="02">Introduction</TitleText> 1 A01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor Universidad de Málaga 2 A01 Jean-Pierre Colson Colson, Jean-Pierre Jean-Pierre Colson Université Catholique de Louvain 10 01 JB code ivitra.24.01cer 9 22 14 Chapter 3 <TitleType>01</TitleType> <TitleText textformat="02">Monocollocable words</TitleText> <Subtitle textformat="02">A type of language combinatory periphery</Subtitle> 1 A01 František Čermák Čermák, František František Čermák Charles University 20 collocation 20 combination 20 corpus 20 distribution 20 monocollocable 20 periphery 01 How often do people, even native speakers, wonder, on hearing a familiar proverb, such as Much Ado about Nothing, what ado in this proverb really means? Most will know the proverb but their knowledge of ado is often restricted to a particular lexical neighbourhood without realising that it is in fact strongly and prohibitively limited to it in this way. It is not common to give much thought to words in combinations and modes of their combination and realise that some, such as auspices, aback, standstill, ado, may not depend on how the speaker would like to use them and what they choose to say but on what the language dictates to users, that is the way how they must be used. This does not mean that there is much liberty in the use of other words either but these limitations are not immediately obvious as in this case: here, words are in their usage severely restricted to one or few more combinations only. These monocollocable words (as they are termed here), to be found, probably, in all languages, are an obstacle in understanding a foreign language, while, on the other hand, textbooks and dictionaries never really give the user much warning that there is a difficulty related to them if these should be used correctly. 10 01 JB code ivitra.24.02mon 23 42 20 Chapter 4 <TitleType>01</TitleType> <TitleText textformat="02">Translation asymmetries of multiword expressions in machine translation</TitleText> <Subtitle textformat="02">An analysis of the TED-MWE corpus</Subtitle> 1 A01 Johanna Monti Monti, Johanna Johanna Monti Università degli Studi di Napoli "L'Orientale" 2 A01 Mihael Arcan Arcan, Mihael Mihael Arcan Insight Centre for Data Analytics 3 A01 Federico Sangati Sangati, Federico Federico Sangati Università degli Studi di Napoli "L'Orientale" 20 machine translation 20 multiword expressions 20 TED-MWE corpus 20 translation asymmetries 01 Machine Translation (MT) is now extensively used both as a tool to overcome language barriers on the internet and as a professional tool to translate technical documentation. The technology has rapidly evolved in recent years thanks to the availability of large amounts of data in digital format and in particular parallel corpora, which are used to train Statistical Machine Translation (SMT) tools. The quality of MT has considerably improved but the translation of multiword expressions (MWEs) still represents a big and open challenge, both from a theoretical and a practical point of view (Monti, 2013). We define MWEs as any group of two or more words or terms in a language lexicon that generally conveys a single meaning, such as the Italian expressions <i>anima gemella</i> (soul mate), <i>carta di credito</i> (credit card), <i>acqua e sapone</i> (water and soap), <i>piovere a catinelle</i> (rain cats and dogs). The persistence of mistranslation of MWEs in MT outputs originates from their lexical, syntactic, semantic, pragmatic but also translational idiomaticity. Therefore, there is a need to invest in further research in order to achieve significant improvements MT and translation technologies. In particular, it is important to develop resources, mainly MWE-annotated corpora, which can be used for both MT training and evaluation purposes (Monti and Todirascu, 2016). <br />This work focuses on the translation asymmetries between English and Italian MWEs, and how they affect the SMT performance. By translation asymmetries we mean the differences which may occur between an MWE in a source language and its equivalent in the target language, like in many-to-many word translations (En. <i>to be in a position to</i> → It. <i>essere in grado di</i>), many-to-one (En. <i>to set free</i> → It. <i>liberare</i>) and finally one-to-many correspondences (En. <i>overcooked</i> → It. <i>cotto troppo</i>). This chapter describes the evaluation of mistranslations caused by translation asymmetries concerning multiword expressions detected in the TED-MWE corpus (<uri href="http://tiny.cc/TED_MWE">http://tiny.cc/TED_MWE</uri>), which contains 1,500 sentences and 31,000 EN tokens. This corpus is a subset of the TED spoken corpus (Monti et al., 2015) annotated with all the MWEs detected during the evaluation process. The corpus contains the following information: (i) the English source text, (ii) the Italian human translations (from the parallel corpus), and (iii) the Italian SMT output. All the annotators were Italian native speakers with a good knowledge of the English language and with a background in linguistics and computational linguistics. They were asked to identify all MWEs in the source text together with their translations in approximately 300 random sentences each and to evaluate the automatic translation correctness. The identified MWEs and the evaluation of both the human and the machine translation are also recorded in the corpus. This chapter will discuss (i) the related work concerning the impact of anisomorphism (the absence of an exact correspondence between words in two different languages) and the consequent translation asymmetries on MWEs translation quality in MT, (ii) the corpus, (iii) the annotation guidelines, (iv) the methodology adopted during the annotation process (Monti et al., 2015), (v) the results of the annotation and finally (vi) the evaluation of translation asymmetries in the corpus and ideas for future work. 10 01 JB code ivitra.24.03dob 43 64 22 Chapter 5 <TitleType>01</TitleType> <TitleText textformat="02">German constructional phrasemes and their Russian counterparts</TitleText> <Subtitle textformat="02">A corpus-based study</Subtitle> 1 A01 Dmitrij Dobrovol’skij Dobrovol’skij, Dmitrij Dmitrij Dobrovol’skij Russian Language Institute and Institute of Linguistics, Russian Academy of Sciences/Stockholm University 20 construction grammar 20 constructional phraseme 20 corpora 20 deictic elements 20 German 20 lexicography 20 phraseology 20 Russian 01 In this article I examine a group of semi-fixed German expressions that are irregular with regard to the relationship between form and meaning, namely constructional phrasemes with the deictic elements <i>her</i> ‘hither’ and <i>hin</i> ‘thither’ [<i>vor sich her</i> + V] and [<i>vor sich hin</i> + V]. These constructions pose considerable difficulties not only for the description of their semantics, but also for translation into other languages. Languages such as Russian, English and French do not have exact equivalents of the German deictic elements <i>hin</i> and <i>her</i>. In cases where the German deictic elements <i>her</i> and <i>hin</i> are constituents of relatively fixed and irregular constructions, their meaning fits even less well their standard definition. Using corpus examples, I propose a means of describing these constructional phrasemes in a German-Russian dictionary. 10 01 JB code ivitra.24.04col 65 82 18 Chapter 6 <TitleType>01</TitleType> <TitleText textformat="02">Computational phraseology and translation studies</TitleText> <Subtitle textformat="02">From theoretical hypotheses to practical tools</Subtitle> 1 A01 Jean-Pierre Colson Colson, Jean-Pierre Jean-Pierre Colson Université catholique de Louvain 20 computational linguistics 20 interpreting 20 phraseology 20 text mining 20 translation 01 The notion of phraseology is now used across a wide range of linguistic disciplines but it is conspicuously absent from most studies in the area of Translation Studies (e.g. Delisle, 2003; Baker and Saldanha, 2011). The paradox is that many practical difficulties encountered by translators and interpreters are directly related to phraseology in the broad sense (Colson, 2008, 2013), and this can also clearly be seen in the failure of machine translation systems to deal efficiently with the translation of phraseological units (PUs). <br />We argue that phraseology and translation studies have much to gain from cross fertilisation, because both disciplines are regularly criticised for their lack of coherent terminological description and for the insufficient number of reproducible experiments they involve. <br />Decoding phraseology in the source text is far from easy for translators and interpreters, all the more so as they are usually not native speakers of the source language. Finding a natural formulation in the target language and avoiding <i>translationese</i> requires an excellent mastery of the phraseology of the target language. Even experienced professionals sometimes fail to detect the fixed or semi-fixed character of a source text construction. We argue that algorithms derived from text mining and information retrieval techniques can be efficient and (computationally) cost-effective in order to build up unfiltered collections of recurrent fixed or semi-fixed phrases, from which translators could gain information about the number of PUs in the source text. Such an algorithm has been proposed in Colson (2016) and has been implemented in a web application enabling translators and language professionals to automatically retrieve most PUs from a source text. Other tools should be developed in order to bridge the gap between the findings of computational phraseology and the practice of translation and interpreting. 10 01 JB code ivitra.24.05wah 83 110 28 Chapter 7 <TitleType>01</TitleType> <TitleText textformat="02">Computational extraction of formulaic sequences from corpora</TitleText> <Subtitle textformat="02">Two case studies of a new extraction algorithm</Subtitle> 1 A01 Alexander Wahl Wahl, Alexander Alexander Wahl Donders Institute for Brain, Cognition and Behaviour, Radboud University 2 A01 Stefan Th. Gries Gries, Stefan Th. Stefan Th. Gries University of California Santa Barbara/Justus Liebig University 20 adjusted frequency list 20 child language 20 collocation extraction 20 formulaic sequences 20 lexical association 20 MERGE 01 We describe a new algorithm for the extraction of formulaic language from corpora. Entitled MERGE (Multi-word Expressions from the Recursive Grouping of Elements), it iteratively combines adjacent bigrams into progressively longer sequences based on lexical association strengths. We then provide empirical evidence for this approach via two case studies. First, we compare the performance of MERGE to that of another algorithm by examining the outputs of the approaches compared with manually annotated formulaic sequences from the spoken component of the British National Corpus. Second, we employ two child language corpora to examine whether MERGE can predict the formulas that the children learn based on caregiver input. Ultimately, we show that MERGE indeed performs well, offering a powerful approach for the extraction of formulas. 10 01 JB code ivitra.24.06ram 111 134 24 Chapter 8 <TitleType>01</TitleType> <TitleText textformat="02">Computational phraseology discovery in corpora with the mwetoolkit</TitleText> 1 A01 Carlos Ramisch Ramisch, Carlos Carlos Ramisch 20 association scores 20 automatic phraseology discovery 20 morphosyntactic patterns 20 mwetoolkit 20 phraseological units 01 Computer tools can help discovering new phraseological units in corpora, thanks to their ability to quickly draw statistics from large amounts of textual data. While the research community has focused on developing and evaluating original algorithms for the automatic discovery of phraseological units, little has been done to transform these sophisticated methods into usable software. In this chapter, we present a brief survey of the main approaches to computational phraseology available. Furthermore, we provide worked out examples of how to apply these methods using the mwetoolkit, a free software for the discovery and identification of multiword ex-pressions. The usefulness of the automatically extracted units depends on various factors such as language, corpus size, target units, and available taggers and parsers. Nonetheless, the mwetoolkit allows fine-grained tuning so that this variability is taken into account, adapting the tool to the specificities of each lexicographic environment. 10 01 JB code ivitra.24.07dur 135 150 16 Chapter 9 <TitleType>01</TitleType> <TitleText textformat="02">Multiword expressions in comparable corpora</TitleText> 1 A01 Peter Ďurčo Ďurčo, Peter Peter Ďurčo University of SS. Cyril and Methodius in Trnava 20 comparable corpora 20 compatible Sketch Grammars 20 multiword expressions 20 universal tagset 01 On the basis of Aranea Gigaword Web corpora, a family of comparable corpora intended for use in contrastive linguistic research, multilingual lexicography, language teaching and translation studies we discuss the pros and cons of comparable corpora in contrast to monolingual and parallel corpora for the analysis of multiword entities (MWEs). We demonstrate that by using large corpora for two or more languages, consisting of unrelated texts, yet created in a comparable manner, parallel language structures and phenomena like MWEs can be identified if the appropriate tools are employed. With the Aranea corpora, the “bilingual sketch” functionality of the Sketch Engine is one such tool which provides a new approach for analyses of similarities of (or differences between) collocation profiles (word sketches) for words and their translation equivalents. 10 01 JB code ivitra.24.08lho 151 176 26 Chapter 10 <TitleType>01</TitleType> <TitleText textformat="02">Collecting collocations from general and specialised corpora</TitleText> <Subtitle textformat="02">A comparative analysis</Subtitle> 1 A01 Marie-Claude L'Homme L'Homme, Marie-Claude Marie-Claude L'Homme Observatoire de linguistique Sens-Texte, Université de Montréal 2 A01 Daphnée Azoulay Azoulay, Daphnée Daphnée Azoulay Observatoire de linguistique Sens-Texte, Université de Montréal 20 classe sémantique 20 Collocation 20 Collocations 20 corpus général 20 corpus spécialisé 20 general corpus 20 lexicographie 20 lexicography 20 semantic class 20 specialised corpus 20 terminologie 20 terminology 01 Collocations are increasingly taken into account in general and specialised repositories and methodologies to collect them are heavily based on corpora. However, lexicographers and terminologists use different kinds of corpora in which combinations are likely to behave according to specific rules and/or patterns. This contribution presents a comparative analysis of the collocational behaviour of 15 lexical items found in a general language corpus and a specialised corpus on the theme of the environment. We automatically extracted large sets of collocates (three lists of 50 collocates) for each lexical item and from each corpus and analyse different facets of collocational behaviour: polysemy of lexical items, characteristics of collocates (overlap, rank and semantic classes of collocates, etc.). Our aim is to draw the attention of terminologists and lexicographers to some specific factors affecting the behaviour of collocations in specialized and general corpora. 10 01 JB code ivitra.24.09mit 177 188 12 Chapter 11 <TitleType>01</TitleType> <TitleText textformat="02">What matters more: The size of the corpora or their quality?</TitleText> <Subtitle textformat="02">The case of automatic translation of multiword expressions using comparable corpora</Subtitle> 1 A01 Ruslan Mitkov Mitkov, Ruslan Ruslan Mitkov University of Wolverhampton 2 A01 Shiva Taslimipoor Taslimipoor, Shiva Shiva Taslimipoor University of Wolverhampton 20 automatic translation 20 comparable corpora 20 multiword expressions 20 size of corpora 20 vector representations 01 This study investigates (and compares) the impact of the size and the similarity/quality of comparable corpora on the specific task of extracting translation equivalents of verb-noun collocations from such corpora. The comprehensive evaluation of different configurations of English and Spanish corpora sheds some light on the more general and perennial question: what matters more – the quantity or quality of corpora? 10 01 JB code ivitra.24.10oak 189 206 18 Chapter 12 <TitleType>01</TitleType> <TitleText textformat="02">Statistical significance for measures of collocation strength</TitleText> 1 A01 Michael P. Oakes Oakes, Michael P. Michael P. Oakes University of Wolverhampton 20 collocation strength 20 Monte Carlo Methods 20 Poisson Distribution 20 statistical significance 01 Of the commonly-used measures of lexical association or collocation strength, only some directly relate to statistical significance: the t-score, chi-squared, log-likelihood, the z-score and Fisher’s exact test. We describe each of these tests, and also describe a computer simulation by which we can derive confidence limits, and hence the statistical significance, of any measure of lexical association which is derived from the contingency table. We illustrate this approach using pointwise mutual information (PMI). We also describe how the Poisson distribution enables us to find the statistical significance of the raw frequency with which a collocation is found. We compare all these methods using collocates of “take”, namely “take up”, “take place”, “take advantage” and “take stock”. 10 01 JB code ivitra.24.11weh 207 224 18 Chapter 13 <TitleType>01</TitleType> <TitleText textformat="02">Verbal collocations and pronominalisation</TitleText> 1 A01 Eric Wehrli Wehrli, Eric Eric Wehrli University of Geneva 2 A01 Violeta Seretan Seretan, Violeta Violeta Seretan University of Geneva 3 A01 Luka Nerima Nerima, Luka Luka Nerima University of Geneva 20 anaphora resolution 20 collocation 20 deep parsing 20 multiword expressions 20 pronominalisation 01 Precise identification of multiword expressions (MWEs) is an important qualitative step for several NLP applications, including machine translation. Since most MWEs cannot be translated literally, failure to identify them yields, at best, inaccurate translation. While some expressions are completely frozen and thus can be listed as compound words, others display a sometimes very large degree of syntactic flexibility. <br />In this chapter, we argue not only that structural information is necessary for an adequate treatment of collocations, but also that the detection of collocations can be useful for the parser. For instance, it is very useful for solving part-of-speech ambiguities and also some attachment ambiguities. We therefore claim that collocation identification and parsing are interrelated processes. <br />Section 2 describes the two processes of parsing and collocation detection and their interaction, (i) when and how the collocation identification process is triggered during parsing, and (ii) how the identification of a collocation helps the parser. In Section 3 we describe how anaphora resolution has been implemented in our parsing system, to handle cases where the antecedent and the pronoun are within the same sentence or in adjacent sentences. Section 4 focuses on more intricate cases of verbal collocations where their nominal element has been pronominalised, in the form of a relative pronoun or a personal pronoun. Verb-object collocations with a relative pronoun are extremely frequent and relatively easy to handle for a “deep” parser. In most cases, the relative clause is directly attached to the noun which is part of the collocation. Collocations in which the nominal element takes the form of a personal pronoun are much harder to deal with, as they depend on the process of anaphora resolution, a very challenging task. The last section describes an evaluation of the collocation detection procedure, enhanced with anaphora resolution using a corpus of newspaper articles of about 10 million words. 10 01 JB code ivitra.24.12squ 225 246 22 Chapter 14 <TitleType>01</TitleType> <TitleText textformat="02">Empirical variability of Italian multiword expressions as a useful feature for their categorisation</TitleText> 1 A01 Luigi Squillante Squillante, Luigi Luigi Squillante Sapienza - Università di Roma 20 categorisation 20 collocation 20 multiword expressions 20 PAISÀ corpus 20 semantic variation 01 In contemporary linguistics the definition of those entities which are referred to as multiword expressions (MWEs) remains controversial. It is intuitively clear that some words, when appearing together, have some “special bond” in terms of meaning (e.g. black hole, mountain chain), or lexical choice (e.g. strong tea, to fill a form), contrary to free combinations. Nevertheless, the great variety of features and anomalous behaviours that these expressions exhibit makes it difficult to organise them into categories and gives rise to a great amount of different and sometimes overlapping terminology. <br />So far, most approaches in corpus linguistics have focused on trying to automatically extract MWEs from corpora by using statistical association measures, while theoretical aspects related to their definition, typology and behaviours arising from quantitative corpus-based studies have not been widely explored, especially for languages with a rich morphology and relatively free word order, such as Italian. <br />This contribution attests that a systematic analysis of the empirical behaviour of Italian MWEs in large corpora, with respect to several parameters, such as syntactic and lexical variations, is useful for outlining a categorisation of the expressions in homogeneous sets which approximately correspond to what is intuitively known as multiword units (“polirematiche” in the Italian lexicographic tradition) and lexical collocations. The importance of this kind of approach is that the resulting categorisation of MWEs is grounded on empirical data rather than relying on intuitive and not-always-coherent linguistic definitions. <br />The variational features taken into account are (1) the possibility for the expressions to be syntactically transformed, and (2) the possibility for one of the component to be replaced with a synonym. These features can be automatically and quantitatively investigated using <i>ad hoc</i> designed tools, whose methodology is fully explained, if an annotated corpus and a list of expressions are provided. It is possible to show that the kind of attested variations and the magnitude of variation appear highly correlated to the grammatical structure of a given phrase, indicating that the bond between the components for a multiword unit or a lexical collocation can be formed by activating different kinds of restrictions, depending on the considered grammatical pattern. 10 01 JB code ivitra.24.13ste 247 272 26 Chapter 15 <TitleType>01</TitleType> <TitleText textformat="02"><i>Too big to fail</i> but <i>big enough to pay for their mistakes</i></TitleText> <Subtitle textformat="02">A collostructional analysis of the patterns [ <i>too</i> ADJ <i>to</i> V] and [ADJ <i>enough to</i> V]</Subtitle> 1 A01 Anatol Stefanowitsch Stefanowitsch, Anatol Anatol Stefanowitsch Freie Universität Berlin 2 A01 Susanne Flach Flach, Susanne Susanne Flach Université de Neuchâtel 20 association 20 collocations 20 Collostructional Analysis 20 collostructions 20 Co-Varying Collexeme Analysis 20 Distinctive Collexeme Analysis 20 Distinctive Co-varying Collexeme Analysis 20 Simple Collexeme Analysis 01 In this paper, we illustrate the usefulness of the family of methods collectively known as Collostructional Analysis for phraseological research. Investigating two patterns, [<i>too</i> ADJ <i>to</i> V] and [ADJ <i>enough to</i> V], we show how a technique originally developed for the investigation of words and constructions can be fruitfully applied to issues pertinent to phraseology, such as the co-existence of compositional and idiomatic semantics and the analysis of semantically complementary patterns more generally. To this end, we use the three conventional methods (Simple, Distinctive and Co-varying Collexeme Analyses) and propose a novel extension (Distinctive Co-varying Collexeme Analysis) particularly suitable for the investigation of complementary patterns. We show that collostructional analysis is suitable for confirming hypotheses derived from qualitative analyses, as well as uncovering subtle differences that are otherwise inaccessible for non-empirical research. 10 01 JB code ivitra.24.14ste 273 296 24 Chapter 16 <TitleType>01</TitleType> <TitleText textformat="02">Multi-word patterns and networks</TitleText> <Subtitle textformat="02">How corpus-driven approaches have changed our description of language use</Subtitle> 1 A01 Kathrin Steyer Steyer, Kathrin Kathrin Steyer Institut für Deutsche Sprache 20 German reference corpus 20 language fixedness 20 multiword expressions 20 pattern-based lexicography 20 phraseology 01 This paper discusses a theoretical and empirical approach to language fixedness that we have developed at the Institut für Deutsche Sprache (IDS) (‘Institute for German Language’) in Mannheim in the project Usuelle Worterbindungen (UWV) over the last decade. The analysis described is based on the Deutsches Referenzkorpus (‘German Reference Corpus’; DeReKo) which is located at the IDS. The corpus analysis tool used for accessing the corpus data is COSMAS II (CII) and – for statistical analysis – the IDS collocation analysis tool (Belica, 1995; CA). For detecting lexical patterns and describing their semantic and pragmatic nature we use the tool lexpan (or ‘Lexical Pattern Analyzer’) that was developed in our project. We discuss a new corpus-driven pattern dictionary that is relevant not only to the field of phraseology, but also to usage-based linguistics and lexicography as a whole. 10 01 JB code ivitra.24.15han 297 310 14 Chapter 17 <TitleType>01</TitleType> <TitleText textformat="02">How context determines meaning</TitleText> 1 A01 Patrick Hanks Hanks, Patrick Patrick Hanks RIILP, University of Wolverhampton (WLV) and BCL, University of the West of England (UWE) 20 collocation 20 corpus pattern analysis (CPA) 20 lexical sets 20 meaning potential 20 valency 01 It is an extraordinary fact that, although most speakers and writers of the English language (or, we may presume, any other language) believe that they are capable of expressing any meaning that they want to with considerable precision, the behaviour of the words they use is highly variable, with much variation in phraseology as well as subtle semantic distinctions. Even more extraordinary is the fact that only some of the logically predictable variants of any given phrase are accepted by native speakers as idiomatic. <br />This chapter shows how meanings are associated with phraseological norms rather than with words in isolation. It also illustrates the phenomenon of alternation among phraseological norms and shows how phraseological norms are not merely conformed to, but also exploited creatively in ordinary language use. Underlying this paper is the proposition that words in isolation do not have a determinable meaning per se. Instead they have <b>meaning potential</b>, different facets of which are activated in different contexts. <br />By detailed corpus pattern analysis of the verb <i>blow</i>, which typically expresses the causation of movement, we explore the relationship between core meaning and a rich set of patterns of idiomatic phraseology – phrasal verbs, idioms, and proverbs. 10 01 JB code ivitra.24.16tas 311 324 14 Chapter 18 <TitleType>01</TitleType> <TitleText textformat="02">Detecting semantic difference</TitleText> <Subtitle textformat="02">A new model based on knowledge and collocational association</Subtitle> 1 A01 Shiva Taslimipoor Taslimipoor, Shiva Shiva Taslimipoor Research Group in Computational Linguistics, University of Wolverhampton 2 A01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor Research Group in Computational Linguistics, University of Wolverhampton/University of Malaga 3 A01 Omid Rohanian Rohanian, Omid Omid Rohanian Research Group in Computational Linguistics, University of Wolverhampton 20 association measures 20 collocation 20 Concept-Net relations 20 n-gram counts 20 semantic difference 20 semantic modelling 20 word2vec 01 Semantic discrimination among concepts is a daily exercise for humans when using natural languages. For example, given the words, <i>airplane</i> and <i>car</i>, the word <i>flying</i> can easily be thought and used as an attribute to differentiate them. In this study, we propose a novel automatic approach to detect whether an attribute word represents the difference between two given words. We exploit a combination of knowledge-based and co-occurrence features (collocations) to capture the semantic difference between two words in relation to an attribute. The features are scores that are defined for each pair of words and an attribute, based on association measures, n-gram counts, word similarity, and Concept-Net relations. Based on these features we designed a system that run several experiments on a SemEval-2018 dataset. The experimental results indicate that the proposed model performs better, or at least comparable with, other systems evaluated on the same data for this task. 10 01 JB code ivitra.24.index 325 327 3 Miscellaneous 19 <TitleType>01</TitleType> <TitleText textformat="02">Index</TitleText> 02 JBENJAMINS John Benjamins Publishing Company 01 John Benjamins Publishing Company Amsterdam/Philadelphia NL 04 20200508 2020 John Benjamins B.V. 02 WORLD 13 15 9789027205353 01 JB 3 John Benjamins e-Platform 03 jbe-platform.com 09 WORLD 21 01 00 99.00 EUR R 01 00 83.00 GBP Z 01 gen 00 149.00 USD S 923026303 03 01 01 JB John Benjamins Publishing Company 01 JB code IVITRA 24 Hb 15 9789027205353 13 2019057308 BB 01 IVITRA 02 2211-5412 IVITRA Research in Linguistics and Literature 24 <TitleType>01</TitleType> <TitleText textformat="02">Computational Phraseology</TitleText> 01 ivitra.24 01 https://benjamins.com 02 https://benjamins.com/catalog/ivitra.24 1 B01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor University of Malaga 2 B01 Jean-Pierre Colson Colson, Jean-Pierre Jean-Pierre Colson University of Louvain 01 eng 339 xi 327 LAN009060 v.2006 CFK 2 24 JB Subject Scheme LIN.COMPUT Computational & corpus linguistics 24 JB Subject Scheme LIN.SYNTAX Syntax 24 JB Subject Scheme LIN.THEOR Theoretical linguistics 06 01 Whether you wish to <i>deliver on a promise, take a walk down memory lane</i> or even <i>on the wild side</i>, phraseological units (also often referred to as phrasemes or multiword expressions) are present in most communicative situations and in all world’s languages. <i>Phraseology</i>, the study of phraseological units, has therefore become a rare unifying theme across linguistic theories.<br />In recent years, an increasing number of studies have been concerned with the computational treatment of multiword expressions: these pertain among others to their automatic identification, extraction or translation, and to the role they play in various Natural Language Processing applications. Computational Phraseology is a comparatively new field where better understanding and more advances are urgently needed. This book aims to address this pressing need, by bringing together contributions focusing on different perspectives of this promising interdisciplinary field. 04 09 01 https://benjamins.com/covers/475/ivitra.24.png 04 03 01 https://benjamins.com/covers/475_jpg/9789027205353.jpg 04 03 01 https://benjamins.com/covers/475_tif/9789027205353.tif 06 09 01 https://benjamins.com/covers/1200_front/ivitra.24.hb.png 07 09 01 https://benjamins.com/covers/125/ivitra.24.png 25 09 01 https://benjamins.com/covers/1200_back/ivitra.24.hb.png 27 09 01 https://benjamins.com/covers/3d_web/ivitra.24.hb.png 10 01 JB code ivitra.24.forvil vii xii 6 Chapter 1 <TitleType>01</TitleType> <TitleText textformat="02">Foreword</TitleText> 1 A01 Aline Villavicencio Villavicencio, Aline Aline Villavicencio 10 01 JB code ivitra.24.00pas 1 8 8 Chapter 2 <TitleType>01</TitleType> <TitleText textformat="02">Introduction</TitleText> 1 A01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor Universidad de Málaga 2 A01 Jean-Pierre Colson Colson, Jean-Pierre Jean-Pierre Colson Université Catholique de Louvain 10 01 JB code ivitra.24.01cer 9 22 14 Chapter 3 <TitleType>01</TitleType> <TitleText textformat="02">Monocollocable words</TitleText> <Subtitle textformat="02">A type of language combinatory periphery</Subtitle> 1 A01 František Čermák Čermák, František František Čermák Charles University 20 collocation 20 combination 20 corpus 20 distribution 20 monocollocable 20 periphery 01 How often do people, even native speakers, wonder, on hearing a familiar proverb, such as Much Ado about Nothing, what ado in this proverb really means? Most will know the proverb but their knowledge of ado is often restricted to a particular lexical neighbourhood without realising that it is in fact strongly and prohibitively limited to it in this way. It is not common to give much thought to words in combinations and modes of their combination and realise that some, such as auspices, aback, standstill, ado, may not depend on how the speaker would like to use them and what they choose to say but on what the language dictates to users, that is the way how they must be used. This does not mean that there is much liberty in the use of other words either but these limitations are not immediately obvious as in this case: here, words are in their usage severely restricted to one or few more combinations only. These monocollocable words (as they are termed here), to be found, probably, in all languages, are an obstacle in understanding a foreign language, while, on the other hand, textbooks and dictionaries never really give the user much warning that there is a difficulty related to them if these should be used correctly. 10 01 JB code ivitra.24.02mon 23 42 20 Chapter 4 <TitleType>01</TitleType> <TitleText textformat="02">Translation asymmetries of multiword expressions in machine translation</TitleText> <Subtitle textformat="02">An analysis of the TED-MWE corpus</Subtitle> 1 A01 Johanna Monti Monti, Johanna Johanna Monti Università degli Studi di Napoli "L'Orientale" 2 A01 Mihael Arcan Arcan, Mihael Mihael Arcan Insight Centre for Data Analytics 3 A01 Federico Sangati Sangati, Federico Federico Sangati Università degli Studi di Napoli "L'Orientale" 20 machine translation 20 multiword expressions 20 TED-MWE corpus 20 translation asymmetries 01 Machine Translation (MT) is now extensively used both as a tool to overcome language barriers on the internet and as a professional tool to translate technical documentation. The technology has rapidly evolved in recent years thanks to the availability of large amounts of data in digital format and in particular parallel corpora, which are used to train Statistical Machine Translation (SMT) tools. The quality of MT has considerably improved but the translation of multiword expressions (MWEs) still represents a big and open challenge, both from a theoretical and a practical point of view (Monti, 2013). We define MWEs as any group of two or more words or terms in a language lexicon that generally conveys a single meaning, such as the Italian expressions <i>anima gemella</i> (soul mate), <i>carta di credito</i> (credit card), <i>acqua e sapone</i> (water and soap), <i>piovere a catinelle</i> (rain cats and dogs). The persistence of mistranslation of MWEs in MT outputs originates from their lexical, syntactic, semantic, pragmatic but also translational idiomaticity. Therefore, there is a need to invest in further research in order to achieve significant improvements MT and translation technologies. In particular, it is important to develop resources, mainly MWE-annotated corpora, which can be used for both MT training and evaluation purposes (Monti and Todirascu, 2016). <br />This work focuses on the translation asymmetries between English and Italian MWEs, and how they affect the SMT performance. By translation asymmetries we mean the differences which may occur between an MWE in a source language and its equivalent in the target language, like in many-to-many word translations (En. <i>to be in a position to</i> → It. <i>essere in grado di</i>), many-to-one (En. <i>to set free</i> → It. <i>liberare</i>) and finally one-to-many correspondences (En. <i>overcooked</i> → It. <i>cotto troppo</i>). This chapter describes the evaluation of mistranslations caused by translation asymmetries concerning multiword expressions detected in the TED-MWE corpus (<uri href="http://tiny.cc/TED_MWE">http://tiny.cc/TED_MWE</uri>), which contains 1,500 sentences and 31,000 EN tokens. This corpus is a subset of the TED spoken corpus (Monti et al., 2015) annotated with all the MWEs detected during the evaluation process. The corpus contains the following information: (i) the English source text, (ii) the Italian human translations (from the parallel corpus), and (iii) the Italian SMT output. All the annotators were Italian native speakers with a good knowledge of the English language and with a background in linguistics and computational linguistics. They were asked to identify all MWEs in the source text together with their translations in approximately 300 random sentences each and to evaluate the automatic translation correctness. The identified MWEs and the evaluation of both the human and the machine translation are also recorded in the corpus. This chapter will discuss (i) the related work concerning the impact of anisomorphism (the absence of an exact correspondence between words in two different languages) and the consequent translation asymmetries on MWEs translation quality in MT, (ii) the corpus, (iii) the annotation guidelines, (iv) the methodology adopted during the annotation process (Monti et al., 2015), (v) the results of the annotation and finally (vi) the evaluation of translation asymmetries in the corpus and ideas for future work. 10 01 JB code ivitra.24.03dob 43 64 22 Chapter 5 <TitleType>01</TitleType> <TitleText textformat="02">German constructional phrasemes and their Russian counterparts</TitleText> <Subtitle textformat="02">A corpus-based study</Subtitle> 1 A01 Dmitrij Dobrovol’skij Dobrovol’skij, Dmitrij Dmitrij Dobrovol’skij Russian Language Institute and Institute of Linguistics, Russian Academy of Sciences/Stockholm University 20 construction grammar 20 constructional phraseme 20 corpora 20 deictic elements 20 German 20 lexicography 20 phraseology 20 Russian 01 In this article I examine a group of semi-fixed German expressions that are irregular with regard to the relationship between form and meaning, namely constructional phrasemes with the deictic elements <i>her</i> ‘hither’ and <i>hin</i> ‘thither’ [<i>vor sich her</i> + V] and [<i>vor sich hin</i> + V]. These constructions pose considerable difficulties not only for the description of their semantics, but also for translation into other languages. Languages such as Russian, English and French do not have exact equivalents of the German deictic elements <i>hin</i> and <i>her</i>. In cases where the German deictic elements <i>her</i> and <i>hin</i> are constituents of relatively fixed and irregular constructions, their meaning fits even less well their standard definition. Using corpus examples, I propose a means of describing these constructional phrasemes in a German-Russian dictionary. 10 01 JB code ivitra.24.04col 65 82 18 Chapter 6 <TitleType>01</TitleType> <TitleText textformat="02">Computational phraseology and translation studies</TitleText> <Subtitle textformat="02">From theoretical hypotheses to practical tools</Subtitle> 1 A01 Jean-Pierre Colson Colson, Jean-Pierre Jean-Pierre Colson Université catholique de Louvain 20 computational linguistics 20 interpreting 20 phraseology 20 text mining 20 translation 01 The notion of phraseology is now used across a wide range of linguistic disciplines but it is conspicuously absent from most studies in the area of Translation Studies (e.g. Delisle, 2003; Baker and Saldanha, 2011). The paradox is that many practical difficulties encountered by translators and interpreters are directly related to phraseology in the broad sense (Colson, 2008, 2013), and this can also clearly be seen in the failure of machine translation systems to deal efficiently with the translation of phraseological units (PUs). <br />We argue that phraseology and translation studies have much to gain from cross fertilisation, because both disciplines are regularly criticised for their lack of coherent terminological description and for the insufficient number of reproducible experiments they involve. <br />Decoding phraseology in the source text is far from easy for translators and interpreters, all the more so as they are usually not native speakers of the source language. Finding a natural formulation in the target language and avoiding <i>translationese</i> requires an excellent mastery of the phraseology of the target language. Even experienced professionals sometimes fail to detect the fixed or semi-fixed character of a source text construction. We argue that algorithms derived from text mining and information retrieval techniques can be efficient and (computationally) cost-effective in order to build up unfiltered collections of recurrent fixed or semi-fixed phrases, from which translators could gain information about the number of PUs in the source text. Such an algorithm has been proposed in Colson (2016) and has been implemented in a web application enabling translators and language professionals to automatically retrieve most PUs from a source text. Other tools should be developed in order to bridge the gap between the findings of computational phraseology and the practice of translation and interpreting. 10 01 JB code ivitra.24.05wah 83 110 28 Chapter 7 <TitleType>01</TitleType> <TitleText textformat="02">Computational extraction of formulaic sequences from corpora</TitleText> <Subtitle textformat="02">Two case studies of a new extraction algorithm</Subtitle> 1 A01 Alexander Wahl Wahl, Alexander Alexander Wahl Donders Institute for Brain, Cognition and Behaviour, Radboud University 2 A01 Stefan Th. Gries Gries, Stefan Th. Stefan Th. Gries University of California Santa Barbara/Justus Liebig University 20 adjusted frequency list 20 child language 20 collocation extraction 20 formulaic sequences 20 lexical association 20 MERGE 01 We describe a new algorithm for the extraction of formulaic language from corpora. Entitled MERGE (Multi-word Expressions from the Recursive Grouping of Elements), it iteratively combines adjacent bigrams into progressively longer sequences based on lexical association strengths. We then provide empirical evidence for this approach via two case studies. First, we compare the performance of MERGE to that of another algorithm by examining the outputs of the approaches compared with manually annotated formulaic sequences from the spoken component of the British National Corpus. Second, we employ two child language corpora to examine whether MERGE can predict the formulas that the children learn based on caregiver input. Ultimately, we show that MERGE indeed performs well, offering a powerful approach for the extraction of formulas. 10 01 JB code ivitra.24.06ram 111 134 24 Chapter 8 <TitleType>01</TitleType> <TitleText textformat="02">Computational phraseology discovery in corpora with the mwetoolkit</TitleText> 1 A01 Carlos Ramisch Ramisch, Carlos Carlos Ramisch 20 association scores 20 automatic phraseology discovery 20 morphosyntactic patterns 20 mwetoolkit 20 phraseological units 01 Computer tools can help discovering new phraseological units in corpora, thanks to their ability to quickly draw statistics from large amounts of textual data. While the research community has focused on developing and evaluating original algorithms for the automatic discovery of phraseological units, little has been done to transform these sophisticated methods into usable software. In this chapter, we present a brief survey of the main approaches to computational phraseology available. Furthermore, we provide worked out examples of how to apply these methods using the mwetoolkit, a free software for the discovery and identification of multiword ex-pressions. The usefulness of the automatically extracted units depends on various factors such as language, corpus size, target units, and available taggers and parsers. Nonetheless, the mwetoolkit allows fine-grained tuning so that this variability is taken into account, adapting the tool to the specificities of each lexicographic environment. 10 01 JB code ivitra.24.07dur 135 150 16 Chapter 9 <TitleType>01</TitleType> <TitleText textformat="02">Multiword expressions in comparable corpora</TitleText> 1 A01 Peter Ďurčo Ďurčo, Peter Peter Ďurčo University of SS. Cyril and Methodius in Trnava 20 comparable corpora 20 compatible Sketch Grammars 20 multiword expressions 20 universal tagset 01 On the basis of Aranea Gigaword Web corpora, a family of comparable corpora intended for use in contrastive linguistic research, multilingual lexicography, language teaching and translation studies we discuss the pros and cons of comparable corpora in contrast to monolingual and parallel corpora for the analysis of multiword entities (MWEs). We demonstrate that by using large corpora for two or more languages, consisting of unrelated texts, yet created in a comparable manner, parallel language structures and phenomena like MWEs can be identified if the appropriate tools are employed. With the Aranea corpora, the “bilingual sketch” functionality of the Sketch Engine is one such tool which provides a new approach for analyses of similarities of (or differences between) collocation profiles (word sketches) for words and their translation equivalents. 10 01 JB code ivitra.24.08lho 151 176 26 Chapter 10 <TitleType>01</TitleType> <TitleText textformat="02">Collecting collocations from general and specialised corpora</TitleText> <Subtitle textformat="02">A comparative analysis</Subtitle> 1 A01 Marie-Claude L'Homme L'Homme, Marie-Claude Marie-Claude L'Homme Observatoire de linguistique Sens-Texte, Université de Montréal 2 A01 Daphnée Azoulay Azoulay, Daphnée Daphnée Azoulay Observatoire de linguistique Sens-Texte, Université de Montréal 20 classe sémantique 20 Collocation 20 Collocations 20 corpus général 20 corpus spécialisé 20 general corpus 20 lexicographie 20 lexicography 20 semantic class 20 specialised corpus 20 terminologie 20 terminology 01 Collocations are increasingly taken into account in general and specialised repositories and methodologies to collect them are heavily based on corpora. However, lexicographers and terminologists use different kinds of corpora in which combinations are likely to behave according to specific rules and/or patterns. This contribution presents a comparative analysis of the collocational behaviour of 15 lexical items found in a general language corpus and a specialised corpus on the theme of the environment. We automatically extracted large sets of collocates (three lists of 50 collocates) for each lexical item and from each corpus and analyse different facets of collocational behaviour: polysemy of lexical items, characteristics of collocates (overlap, rank and semantic classes of collocates, etc.). Our aim is to draw the attention of terminologists and lexicographers to some specific factors affecting the behaviour of collocations in specialized and general corpora. 10 01 JB code ivitra.24.09mit 177 188 12 Chapter 11 <TitleType>01</TitleType> <TitleText textformat="02">What matters more: The size of the corpora or their quality?</TitleText> <Subtitle textformat="02">The case of automatic translation of multiword expressions using comparable corpora</Subtitle> 1 A01 Ruslan Mitkov Mitkov, Ruslan Ruslan Mitkov University of Wolverhampton 2 A01 Shiva Taslimipoor Taslimipoor, Shiva Shiva Taslimipoor University of Wolverhampton 20 automatic translation 20 comparable corpora 20 multiword expressions 20 size of corpora 20 vector representations 01 This study investigates (and compares) the impact of the size and the similarity/quality of comparable corpora on the specific task of extracting translation equivalents of verb-noun collocations from such corpora. The comprehensive evaluation of different configurations of English and Spanish corpora sheds some light on the more general and perennial question: what matters more – the quantity or quality of corpora? 10 01 JB code ivitra.24.10oak 189 206 18 Chapter 12 <TitleType>01</TitleType> <TitleText textformat="02">Statistical significance for measures of collocation strength</TitleText> 1 A01 Michael P. Oakes Oakes, Michael P. Michael P. Oakes University of Wolverhampton 20 collocation strength 20 Monte Carlo Methods 20 Poisson Distribution 20 statistical significance 01 Of the commonly-used measures of lexical association or collocation strength, only some directly relate to statistical significance: the t-score, chi-squared, log-likelihood, the z-score and Fisher’s exact test. We describe each of these tests, and also describe a computer simulation by which we can derive confidence limits, and hence the statistical significance, of any measure of lexical association which is derived from the contingency table. We illustrate this approach using pointwise mutual information (PMI). We also describe how the Poisson distribution enables us to find the statistical significance of the raw frequency with which a collocation is found. We compare all these methods using collocates of “take”, namely “take up”, “take place”, “take advantage” and “take stock”. 10 01 JB code ivitra.24.11weh 207 224 18 Chapter 13 <TitleType>01</TitleType> <TitleText textformat="02">Verbal collocations and pronominalisation</TitleText> 1 A01 Eric Wehrli Wehrli, Eric Eric Wehrli University of Geneva 2 A01 Violeta Seretan Seretan, Violeta Violeta Seretan University of Geneva 3 A01 Luka Nerima Nerima, Luka Luka Nerima University of Geneva 20 anaphora resolution 20 collocation 20 deep parsing 20 multiword expressions 20 pronominalisation 01 Precise identification of multiword expressions (MWEs) is an important qualitative step for several NLP applications, including machine translation. Since most MWEs cannot be translated literally, failure to identify them yields, at best, inaccurate translation. While some expressions are completely frozen and thus can be listed as compound words, others display a sometimes very large degree of syntactic flexibility. <br />In this chapter, we argue not only that structural information is necessary for an adequate treatment of collocations, but also that the detection of collocations can be useful for the parser. For instance, it is very useful for solving part-of-speech ambiguities and also some attachment ambiguities. We therefore claim that collocation identification and parsing are interrelated processes. <br />Section 2 describes the two processes of parsing and collocation detection and their interaction, (i) when and how the collocation identification process is triggered during parsing, and (ii) how the identification of a collocation helps the parser. In Section 3 we describe how anaphora resolution has been implemented in our parsing system, to handle cases where the antecedent and the pronoun are within the same sentence or in adjacent sentences. Section 4 focuses on more intricate cases of verbal collocations where their nominal element has been pronominalised, in the form of a relative pronoun or a personal pronoun. Verb-object collocations with a relative pronoun are extremely frequent and relatively easy to handle for a “deep” parser. In most cases, the relative clause is directly attached to the noun which is part of the collocation. Collocations in which the nominal element takes the form of a personal pronoun are much harder to deal with, as they depend on the process of anaphora resolution, a very challenging task. The last section describes an evaluation of the collocation detection procedure, enhanced with anaphora resolution using a corpus of newspaper articles of about 10 million words. 10 01 JB code ivitra.24.12squ 225 246 22 Chapter 14 <TitleType>01</TitleType> <TitleText textformat="02">Empirical variability of Italian multiword expressions as a useful feature for their categorisation</TitleText> 1 A01 Luigi Squillante Squillante, Luigi Luigi Squillante Sapienza - Università di Roma 20 categorisation 20 collocation 20 multiword expressions 20 PAISÀ corpus 20 semantic variation 01 In contemporary linguistics the definition of those entities which are referred to as multiword expressions (MWEs) remains controversial. It is intuitively clear that some words, when appearing together, have some “special bond” in terms of meaning (e.g. black hole, mountain chain), or lexical choice (e.g. strong tea, to fill a form), contrary to free combinations. Nevertheless, the great variety of features and anomalous behaviours that these expressions exhibit makes it difficult to organise them into categories and gives rise to a great amount of different and sometimes overlapping terminology. <br />So far, most approaches in corpus linguistics have focused on trying to automatically extract MWEs from corpora by using statistical association measures, while theoretical aspects related to their definition, typology and behaviours arising from quantitative corpus-based studies have not been widely explored, especially for languages with a rich morphology and relatively free word order, such as Italian. <br />This contribution attests that a systematic analysis of the empirical behaviour of Italian MWEs in large corpora, with respect to several parameters, such as syntactic and lexical variations, is useful for outlining a categorisation of the expressions in homogeneous sets which approximately correspond to what is intuitively known as multiword units (“polirematiche” in the Italian lexicographic tradition) and lexical collocations. The importance of this kind of approach is that the resulting categorisation of MWEs is grounded on empirical data rather than relying on intuitive and not-always-coherent linguistic definitions. <br />The variational features taken into account are (1) the possibility for the expressions to be syntactically transformed, and (2) the possibility for one of the component to be replaced with a synonym. These features can be automatically and quantitatively investigated using <i>ad hoc</i> designed tools, whose methodology is fully explained, if an annotated corpus and a list of expressions are provided. It is possible to show that the kind of attested variations and the magnitude of variation appear highly correlated to the grammatical structure of a given phrase, indicating that the bond between the components for a multiword unit or a lexical collocation can be formed by activating different kinds of restrictions, depending on the considered grammatical pattern. 10 01 JB code ivitra.24.13ste 247 272 26 Chapter 15 <TitleType>01</TitleType> <TitleText textformat="02"><i>Too big to fail</i> but <i>big enough to pay for their mistakes</i></TitleText> <Subtitle textformat="02">A collostructional analysis of the patterns [ <i>too</i> ADJ <i>to</i> V] and [ADJ <i>enough to</i> V]</Subtitle> 1 A01 Anatol Stefanowitsch Stefanowitsch, Anatol Anatol Stefanowitsch Freie Universität Berlin 2 A01 Susanne Flach Flach, Susanne Susanne Flach Université de Neuchâtel 20 association 20 collocations 20 Collostructional Analysis 20 collostructions 20 Co-Varying Collexeme Analysis 20 Distinctive Collexeme Analysis 20 Distinctive Co-varying Collexeme Analysis 20 Simple Collexeme Analysis 01 In this paper, we illustrate the usefulness of the family of methods collectively known as Collostructional Analysis for phraseological research. Investigating two patterns, [<i>too</i> ADJ <i>to</i> V] and [ADJ <i>enough to</i> V], we show how a technique originally developed for the investigation of words and constructions can be fruitfully applied to issues pertinent to phraseology, such as the co-existence of compositional and idiomatic semantics and the analysis of semantically complementary patterns more generally. To this end, we use the three conventional methods (Simple, Distinctive and Co-varying Collexeme Analyses) and propose a novel extension (Distinctive Co-varying Collexeme Analysis) particularly suitable for the investigation of complementary patterns. We show that collostructional analysis is suitable for confirming hypotheses derived from qualitative analyses, as well as uncovering subtle differences that are otherwise inaccessible for non-empirical research. 10 01 JB code ivitra.24.14ste 273 296 24 Chapter 16 <TitleType>01</TitleType> <TitleText textformat="02">Multi-word patterns and networks</TitleText> <Subtitle textformat="02">How corpus-driven approaches have changed our description of language use</Subtitle> 1 A01 Kathrin Steyer Steyer, Kathrin Kathrin Steyer Institut für Deutsche Sprache 20 German reference corpus 20 language fixedness 20 multiword expressions 20 pattern-based lexicography 20 phraseology 01 This paper discusses a theoretical and empirical approach to language fixedness that we have developed at the Institut für Deutsche Sprache (IDS) (‘Institute for German Language’) in Mannheim in the project Usuelle Worterbindungen (UWV) over the last decade. The analysis described is based on the Deutsches Referenzkorpus (‘German Reference Corpus’; DeReKo) which is located at the IDS. The corpus analysis tool used for accessing the corpus data is COSMAS II (CII) and – for statistical analysis – the IDS collocation analysis tool (Belica, 1995; CA). For detecting lexical patterns and describing their semantic and pragmatic nature we use the tool lexpan (or ‘Lexical Pattern Analyzer’) that was developed in our project. We discuss a new corpus-driven pattern dictionary that is relevant not only to the field of phraseology, but also to usage-based linguistics and lexicography as a whole. 10 01 JB code ivitra.24.15han 297 310 14 Chapter 17 <TitleType>01</TitleType> <TitleText textformat="02">How context determines meaning</TitleText> 1 A01 Patrick Hanks Hanks, Patrick Patrick Hanks RIILP, University of Wolverhampton (WLV) and BCL, University of the West of England (UWE) 20 collocation 20 corpus pattern analysis (CPA) 20 lexical sets 20 meaning potential 20 valency 01 It is an extraordinary fact that, although most speakers and writers of the English language (or, we may presume, any other language) believe that they are capable of expressing any meaning that they want to with considerable precision, the behaviour of the words they use is highly variable, with much variation in phraseology as well as subtle semantic distinctions. Even more extraordinary is the fact that only some of the logically predictable variants of any given phrase are accepted by native speakers as idiomatic. <br />This chapter shows how meanings are associated with phraseological norms rather than with words in isolation. It also illustrates the phenomenon of alternation among phraseological norms and shows how phraseological norms are not merely conformed to, but also exploited creatively in ordinary language use. Underlying this paper is the proposition that words in isolation do not have a determinable meaning per se. Instead they have <b>meaning potential</b>, different facets of which are activated in different contexts. <br />By detailed corpus pattern analysis of the verb <i>blow</i>, which typically expresses the causation of movement, we explore the relationship between core meaning and a rich set of patterns of idiomatic phraseology – phrasal verbs, idioms, and proverbs. 10 01 JB code ivitra.24.16tas 311 324 14 Chapter 18 <TitleType>01</TitleType> <TitleText textformat="02">Detecting semantic difference</TitleText> <Subtitle textformat="02">A new model based on knowledge and collocational association</Subtitle> 1 A01 Shiva Taslimipoor Taslimipoor, Shiva Shiva Taslimipoor Research Group in Computational Linguistics, University of Wolverhampton 2 A01 Gloria Corpas Pastor Corpas Pastor, Gloria Gloria Corpas Pastor Research Group in Computational Linguistics, University of Wolverhampton/University of Malaga 3 A01 Omid Rohanian Rohanian, Omid Omid Rohanian Research Group in Computational Linguistics, University of Wolverhampton 20 association measures 20 collocation 20 Concept-Net relations 20 n-gram counts 20 semantic difference 20 semantic modelling 20 word2vec 01 Semantic discrimination among concepts is a daily exercise for humans when using natural languages. For example, given the words, <i>airplane</i> and <i>car</i>, the word <i>flying</i> can easily be thought and used as an attribute to differentiate them. In this study, we propose a novel automatic approach to detect whether an attribute word represents the difference between two given words. We exploit a combination of knowledge-based and co-occurrence features (collocations) to capture the semantic difference between two words in relation to an attribute. The features are scores that are defined for each pair of words and an attribute, based on association measures, n-gram counts, word similarity, and Concept-Net relations. Based on these features we designed a system that run several experiments on a SemEval-2018 dataset. The experimental results indicate that the proposed model performs better, or at least comparable with, other systems evaluated on the same data for this task. 10 01 JB code ivitra.24.index 325 327 3 Miscellaneous 19 <TitleType>01</TitleType> <TitleText textformat="02">Index</TitleText> 02 JBENJAMINS John Benjamins Publishing Company 01 John Benjamins Publishing Company Amsterdam/Philadelphia NL 04 20200508 2020 John Benjamins B.V. 02 WORLD 08 735 gr 01 JB 1 John Benjamins Publishing Company +31 20 6304747 +31 20 6739773 bookorder@benjamins.nl 01 https://benjamins.com 01 WORLD US CA MX 21 83 20 01 02 JB 1 00 99.00 EUR R 02 02 JB 1 00 104.94 EUR R 01 JB 10 bebc +44 1202 712 934 +44 1202 712 913 sales@bebc.co.uk 03 GB 21 20 02 02 JB 1 00 83.00 GBP Z 01 JB 2 John Benjamins North America +1 800 562-5666 +1 703 661-1501 benjamins@presswarehouse.com 01 https://benjamins.com 01 US CA MX 21 1 20 01 gen 02 JB 1 00 149.00 USD