Computational phraseology discovery in corpora with the mwetoolkit

Ramisch, Carlos

doi:10.1075/ivitra.24.06ram

Part of

Computational Phraseology
Edited by Gloria Corpas Pastor and Jean-Pierre Colson
[IVITRA Research in Linguistics and Literature 24] 2020
► pp. 111–134

Computational phraseology discovery in corpora with the mwetoolkit

editor

Carlos Ramisch | Aix-Marseille Université

Computer tools can help discovering new phraseological units in corpora, thanks to their ability to quickly draw statistics from large amounts of textual data. While the research community has focused on developing and evaluating original algorithms for the automatic discovery of phraseological units, little has been done to transform these sophisticated methods into usable software. In this chapter, we present a brief survey of the main approaches to computational phraseology available. Furthermore, we provide worked out examples of how to apply these methods using the mwetoolkit, a free software for the discovery and identification of multiword ex-pressions. The usefulness of the automatically extracted units depends on various factors such as language, corpus size, target units, and available taggers and parsers. Nonetheless, the mwetoolkit allows fine-grained tuning so that this variability is taken into account, adapting the tool to the specificities of each lexicographic environment.

Les outils informatiques peuvent assister la découverte de nouvelles unités phraséologiques dans les corpus grâce à leur facilité pour calculer rapidement des statistiques à partir de grands volumes de données textuelles. Alors que la communauté de recherche s’est concentrée sur le développement et l’évaluation d’algorithmes originaux pour la découverte automatique d’unités phraséologiques, la transformation de ces méthodes sophistiquées en logiciels utilisables est souvent ignorée. Ce chapitre présente un bref résumé des principales approches informatiques disponibles pour la découverte d’unités phraséologiques. Nous présenterons des exemples détaillés de l’application de ces approches avec le mwetoolkit, un logiciel libre pour la découverte et l’identification d’unités polylexicales. L’utilité des unités extraites automatiquement dépend de plusieurs facteurs comme la langue, la taille du corpus, les unités cibles, et les étiqueteurs et analyseurs disponibles. Néanmoins, le mwetoolkit permet un paramétrage fin, de manière à ce que cette variabilité soit prise en compte dans l’adaptation de l’outil à chaque environnement lexicographique.

Keywords: phraseological units, automatic phraseology discovery, morphosyntactic patterns, association scores, mwetoolkit

Article outline

1.Introduction
2.Computational phraseology discovery
- General architecture
- Freely available tools
3.The mwetoolkit
4.Phraseology discovery with the mwetoolkit
- 4.1Candidate search patterns
- 4.2Association scores
- 4.3Other scores
5.Conclusions and open issues
Notes
References

Published online: 8 May 2020

https://doi.org/10.1075/ivitra.24.06ram

References (70)

Baldwin, T., & Kim, S. N.

(2010) Multiword expressions. In N. Indurkhya and F. J. Damerau (Eds.), Handbook of Natural Language Processing. 2 edition (pp. 267–292). Boca Raton, FL: CRC Press, Taylor and Francis Group.

Banerjee, S., & Pedersen, T.

(2003) The design, implementation, and use of the Ngram Statistic Package. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics (370–381). Mexico City, Mexico.

Baroni, M., & Bernardini, S.

(Eds.) (2006) Wacky! Working papers on the Web as Corpus. Bologna, Italy: GEDIT.

Bond, F., Kim, S. N., Nakov, P., & Szpakowicz, S.

(Eds) (2013) Journal of Natural Language Engineering.Special Issue on computational approaches to the semantics of noun compounds, 19(3). Cambridge, UK:Cambridge University Press.

Bonin, F., Dell’Orletta, F., Montemagni, S., & Venturi, G.

(2010) A contrastive approach to multiword extraction from domain-specific corpora. In Proceeginds of the Seventh LREC (LREC 2010), Valetta, Malta: ELRA.

Carpuat, M., & Diab, M.

(2010) Task-based evaluation of multiword expressions: a pilot study in statistical machine translation. In Proceedings of HLT: The 2010 Annual Conference of the NAACL (NAACL 2003) (pp. 242–245). Los Angeles, CA: ACL.

Church, K., & Hanks, P.

(1990) Word association norms mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.

Constant, M., Roux, J. L., & Sigogne, A.

(2013) Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields. ACM Transactions on Speech and Language Processing. Special Issue on MWEs: from theory to practice and use, part 2 (TSLP), 10(3).

Constant, M., & Tellier, I.

(2012) Evaluating the impact of external lexical resources into a CRF-based multiword segmenter and part-of-speech tagger. In Proceedings of the Eigth LREC (LREC 2012), Istanbul, Turkey: ELRA.

Cook, P., & Stevenson, S.

(2010) Automatically identifying the source words of lexical blends in English. Computational Linguistics, 36(1), 129–149.

Cordeiro, S., Ramisch, C., Idiart, M., & Villavicencio, A.

(2016a) Predicting the compositionality of nominal compounds: Giving word embeddings a hard time. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1986–1997). Association for Computational Linguistics.

Cordeiro, S., Ramisch, C., & Villavicencio, A.

(2016b) Mwetoolkit+sem: Integrating word embeddings in the mwetoolkit for semantic mwe processing. In LREC 2016 Portoroz, Slovenia.

Dagan, I., & Church, K.

(1994) Termight: Identifying and translating technical terminology. In Proceedings of the 4th ANLP Conference (ANLP 1994) (pp. 34–40). Stuttgart, Germany: ACL.

Daille, B.

(1995) Repérage et extraction de terminologie par une approche mixte statistique et linguistique. Traitement Automatique des Langues, 36(1–2), 101–118.

de Medeiros Caseli, H., Villavicencio, A., Machado, A., & Finatto, M. J.

(2009) Statisticallydriven alignment-based multiword expression identification for technical domains. In D. Anastasiou, C. Hashimoto, P. Nakov, S. N. Kim (Eds.), Proceedings of the ACL Workshop on MWEs: Identification, Interpretation, Disambiguation, Applications (MWE 2009) (pp. 1–8). Suntec, Singapore: ACL.

Drouin, P.

(2004) Detection of domain specific terminology using corpora comparison. In Proceedings of the Fourth LREC (LREC 2004). Lisbon, Portugal: ELRA.

Dunning, T.

(1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

Duran, M. S., & Ramisch, C.

(2011) How do you feel? Investigating lexical-syntactic patterns in sentiment expression. In Proceedings of Corpus Linguistics 2011: Discourse and Corpus Linguistics Conference. Birmingham, UK.

Duran, M. S., Ramisch, C., Aluísio, S. M., & Villavicencio, A.

(2011) Identifying and analyzing Brazilian Portuguese complex predicates. In V. Kordoni, C. Rasmich, & A. Villavicencio (Eds.), Proceedings of the ALC Workshop on MWEs: from Parsing and Generation to the Real World (MWE 2011) (pp. 74–82). Portland, OR: ACL.

Duran, M. S., Scarton, C. E., Aluísio, S. M., & Ramisch, C.

(2013) Identifying pronominal verbs: Towards automatic disambiguation of the clitic ’se’ in Portuguese. In V. Kordoni, C. Rasmich, & A. Villavicencio (Eds.), Proceedings of the 9th Workshop on MWEs (MWE 2013) (pp. 93–100). Atlanta, GA: ACL.

Evert, S.

(2004) The Statistics of Word Cooccurrences: Word Pairs and Collocations. (PhD Thesis, Institut für maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, Germany).

Evert, S., & Krenn, B.

(2005) Using small random samples for the manual evaluation of statistical association measures. Computer Speech & Language. Special issue on Multiword Expression, 19(4), 450–466.

Fazly, A., Cook, P., & Stevenson, S.

(2009) Unsupervised type and token identification of idiomatic expressions. Computational Linguistics, 35(1), 61–103.

Finlayson, M., & Kulkarni, N.

(2011) Detecting multi-word expressions improves word sense disambiguation. In V. Kordoni, C. Rasmich, & A. Villavicencio (Eds.), Proceedings of the ALC Workshop on MWEs: from Parsing and Generation to the Real World (MWE 2011) (pp. 20–24). Portland, OR: ACL.

Ha, L. A., Fernandez, G., Mitkov, R., & Corpas Pastor, G.

(2008) Mutual bilingual terminology extraction. In Proceedings of the Sixth LREC (LREC 2008), Marrakech, Morocco: ELRA.

Heid, U.

(2008) Computational phraseology: An overview. In S. Granger, & F. Meunier. (Eds.), Phraseology. An interdisciplinary Perspective (pp. 337-360). Amsterdam/Philadelphia: John Benjamins.

Heid, U., Fritzinger, F., Hinrichs, E., Hinrichs, M., & Zastrow, T.

(2010) Term and collocation extraction by means of complex linguistic web services. In Proceedings of the Seventh LREC (LREC 2010), Valetta, Malta: ELRA.

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V.

(2014) The sketch engine: ten years on. Lexicography, 1(1), 7–36.

Köper, M., Schulte im Walde, S., Kisselew, M., & Padó, S.

(2016) Improving zero-shot-learning for German particle verbs by using training-space restrictions and local scaling. In Proceedings of *SEM 2016 (pp. 91–96). ACL.

Linardaki, E., Ramisch, C., Villavicencio, A., & Fotopoulou, A.

(2010) Towards the construction of language resources for Greek multiword expressions: Extraction and evaluation. In S. Piperidis, M. Slavcheva, & C. Vertan (Eds.), Proceedings of the LREC Workshop on Exploitation of multilingual resources and tools for Central and (South) Eastern European Languages (pp. 31–40). Valetta, Malta.

Manning, C. D., & Schütze, H.

(1999) Foundations of statistical natural language processing. Cambridge, MA: MIT Press.

Markantonatou, S., Ramisch, C., Savary, A., & Vincze, V.

(Eds.) (2017) Proceedings of the 13th Workshop on MWEs (MWE 2017), Valencia, Spain: ACL.

Martens, S. & Vandeghinste, V.

(2010) An efficient, generic approach to extracting multiword expressions from dependency trees. In É. Laporte., P. Nakov, C. Ramisch, & A. Villavicencio (Eds.), Proc. of the COLING Workshop on MWEs: from Theory to Applications (MWE 2010), (pp. 84–87). Beijing, China: ACL.

McKeown, K. R., & Radev, D. R.

(1999) Collocations. In R. Dale, H. Moisl, & H. Somers (Eds.), A Handbook of Natural Language Processing (pp. 507–553). New York, NY: Marcel Dekker.

Mikolov, T., Chen, K., Corrado, G., & Dean, J.

(2013) Efficient estimation of word representations in vector space.CoRR, abs/1301.3781.

Morin, E., & Daille, B.

(2010) Compositionality and lexical alignment of multi-word terms. Language Resources and Evaluation. Special Issue on Multiword expression: hard going or plain sailing, 44(1–2), 79–95.

Morin, E., Daille, B., Takeuchi, K., & Kageura, K.

(2007) Bilingual terminology mining– using brain, not brawn comparable corpora. In Proceedings of the 45th ACL (ACL 2007) (pp 664–671). Prague, Czech Republic: ACL.

Nakov, P., & Hearst, M. A.

(2005) Search engine statistics beyond the n-gram: Application to noun compound bracketing. In I. Dagan, & D. Gildea (Eds.), Proceeginds of the Ninth CoNLL (CoNLL-2005) (pp. 17-24). University of Michigan, MI: ACL.

Pearce, D.

(2001) Synonymy in collocation extraction. In WordNet and Other Lexical Resources: Applications, Extensions and Customizations (NAACL 2001 Workshop) (pp. 41–46).

Pecina, P.

(2008) Lexical Association Measures: Collocation Extraction. (PhD Thesis, Faculty of Mathematics and Physics, Charles University).

(Rev.) (2011) Syntax-based collocation extraction by Violeta seretan (University of Geneva). Berlin: Springer (Text, Speech and Language Technology Series, volume 44). Computational Linguistics, 37(3), 631–633.

Pedersen, T., Banerjee, S., McInnes, B., Kohli, S., Joshi, M., & Liu, Y.

(2011) The n-gram statistics package (text: NSP) : A flexible tool for identifying n-grams, collocations, and word associations. In V. Kordoni, C. Rasmich, & A. Villavicencio (Eds.), Proceedings of the ALC Workshop on MWEs: from Parsing and Generation to the Real World (MWE 2011) (pp. 131–133). Portland, OR: ACL.

Ramisch, C.

(2015) Multiword Expressions Acquisition: A Generic and Open Framework, volume XIV of Theory and Applications of Natural Language Processing. Springer.

Ramisch, C., Araujo, V. D., & Villavicencio, A.

(2012) A broad evaluation of techniques for automatic acquisition of multiword expressions. In Proceedings of the ACL 2012 SRW (pp. 1–6). Jeju, Republic of Korea: ACL.

Ramisch, C., Schreiner, P., Idiart, M., & Villavicencio, A.

(2008) An evaluation of methods for the extraction of multiword expressions. In N. Grégoire, S. Evert, & B. Krenn (Eds.), Proceedings of the LREC Workshop Towards a Shared Task for MWEs (MWE 2008) (pp. 50–53). Marrakech, Morocco.

Ramisch, C., Villavicencio, A., & Boitet, C.

(2010a) Multiword expressions in the wild? The mwetoolkit comes in handy. In Y. Liu, & T. Liu (Eds.), Proceedings of the 23rd COLING (COLING 2010) – Demonstrations (pp. 57–60). Beijing, China: The Coling 2010 Organizing Committee.

(2010b) mwetoolkit: a Framework for Multiword Expression Identification. In Proceeginds of the Seventh LREC (LREC 2010) (pp. 662–669). Valetta, Malta: ELRA.

(2010c) Web-based and combined language models: a case study on noun compound identification. In C.-R. Huang, & D. Jurafsky (Eds.), Proceedings of the 23rd COLING (COLING 2010) – Posters (pp. 1041–1049). Beijing, China: The Coling 2010 Organizing Committee.

Ramisch, C., Villavicencio, A., & Kordoni, V.

(Eds.) (2013) ACM Transactions on Speech and Language Processing. Special Issue on MWEs: from theory to practice and use, part 1 (TSLP), 10(2). New York, NY: ACM

Rayson, P., Piao, S., Sharoff, S., Evert, S., & Moirón, B. V.

(Eds.) (2010) Language Resources and Evaluation. Special Issue on Multiword expression: hard going or plain sailing, 44(1–2). Springer.

Riedl, M., & Biemann, C.

(2013) Scaling to large data: An efficient and effective method to compute distributional thesauri. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 884–890). Association for Computational Linguistics.

(2015) A single word is not enough: Ranking multiword expressions using distributional semantics. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2430–2440). Association for Computational Linguistics.

Rivera, O. M., Mitkov, R., & Corpas Pastor, G.

(2013) A flexible framework for collocation retrieval and translation from parallel and comparable corpora. In R. Mitkov, J. Monti, G. Corpas Pastor, & V. Seretan (Eds.), Proceedings of the MT Summit 2013 MUMTTT workshop (MUMTTT 2013) (pp. 18–25). Nice, France.

Roller, S., im Walde, S. S., & Scheible, S.

(2013) The (un)expected effects of applying standard cleansing models to human ratings on compositionality. In V. Kordoni, C. Rasmich, & A. Villavicencio (Eds.), Proceedings of the 9th Workshop on MWEs (MWE 2013) (pp. 32–41). Atlanta, GA: ACL.

Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D.

(2002) Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd CICLing (CICLing-2002), volume 2276/2010 of LNCS (pp. 1–15). Mexico City, Mexico: Springer.

Salehi, B., & Cook, P.

(2013) Predicting the compositionality of multiword expressions using translations in multiple languages. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity (pp. 266–275). Association for Computational Linguistics.

Salehi, B., Cook, P., & Baldwin, T.

(2015) A word embedding approach to predicting the compositionality of multiword expressions. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 977–983). Association for Computational Linguistics.

Sangati, F., Zuidema, W., & Bod, R.

(2010) Efficiently extract rrecurring tree fragments from large treebanks. In Proc. of the Seventh LREC (LREC 2010). Valetta, Malta: ELRA.

Savary, A., Ramisch, C., Cordeiro, S., Sangati, F., Vincze, V., Qasemi Zadeh, B., Candito, M., Cap, F., Giouli, V., Stoyanova, I., & Doucet, A.

(2017) The parseme shared task on automatic identification of verbal multiword expressions. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) (pp. 31–47). Valencia, Spain: ACL.

Schneider, N., Onuffer, S., Kazour, N., Danchik, E., Mordowanec, M. T., Conrad, H., & Smith, N. A.

(2014) Comprehensive annotation of multiword expressions in a social web corpus. In Proceedings of the Ninth LREC (LREC 2014). Reykjavik, Iceland: ELRA.

Seretan, V.

(2011) Syntax-Based Collocation Extraction, volume 44 of Text, Speech and Language Technology. 1st edition. Dordrecht, Netherlands: Springer.

.

Seretan, V., & Wehrli, E.

(2009) Multilingual collocation extraction with a syntactic parser. Language Resources and Evaluation. Special Issue on Multilingual Language Resources and Interoperability, 43(1), 71–85.

Silva, J. & Lopes, G.

(1999) A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In Proceedings of the Sixth Meeting on Mathematics of Language (MOL6) (pp. 369–381). Orlando, FL.

Smadja, F. A.

(1993) Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177.

Tsvetkov, Y., & Wintner, S.

(2011) Identification of multi-word expressions by combining multiple linguistic information sources. In R. Barzilay, M. Johnson (Eds.), Proceedings of the 2011 EMNLP (EMNLP 2011) (pp. 836–845). Edinburgh, Scotland, UK: ACL.

Vargas, N., Ramisch, C., & Caseli, H.

(2017) Discovering light verb constructions and their translations from parallel corpora without word alignment. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) (pp. 91–96). Valencia, Spain: ACL.

Villavicencio, A., Bond, F., Korhonen, A., & McCarthy, D.

(Eds.) (2005) Computer Speech & Language. Special issue on Multiword Expression, 19(4). Elsevier.

Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., & Ramisch, C.

(2007) Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In J. Eisner (Ed.), Proceedings of the 2007 Joint Conference on EMNLP and Computational NLL (EMNLPCoNLL 2007) (pp. 1034–1043). Prague, Czech Republic: ACL.

Weller, M., & Heid, U.

(2012) Analyzing and aligning German compound nouns. In Proceedings of the Eighth LREC (LREC 2012). Istanbul, Turkey: ELRA.

Zhou, X., Zhang, X., & Hu, X.

(2007) Dragon toolkit: Incorporating auto-learned semantic knowledge into large-scale text retrieval and mining. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence– ICTAI 2007, volume 2 (pp. 197–201). Washington, DC: IEEE Computer Society.

Cited by (1)

Cited by 1 other publications

Lima Florido, Francisco Javier

2023. Computational and Corpus-Based Phraseology. TRANS: Revista de Traductología :27 ► pp. 289 ff.

This list is based on CrossRef data as of 28 june 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.