Recognition of irrelevant phrases in automatically extracted lists of domain terms
In our paper, we address the problem of recognition of irrelevant phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms or discourse expressions. We defined several methods based on comparison of domain corpora and a method based on contexts of phrases identified in a large corpus of general language. The methods were tested on Polish data. We used six domain corpora and one general corpus. Two test sets were prepared to evaluate the methods. The first one consisted of many presumably irrelevant phrases, as we selected phrases which occurred in at least three domain corpora. The second set mainly consisted of domain terms, as it was composed of the top-ranked phrases automatically extracted from the analyzed domain corpora.
The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method, with a precision of about 0.75 on half of the tested list, was the context based method using a modified contextual diversity coefficient.
Although the methods were tested on Polish, they seems to be language independent.
Article outline
- 1.Introduction
- 2.Terminology extraction
- 3.Domain corpora
- 4.Term selection based on domain corpora
- Method I.Co-occurrence in multiple corpora
- Method II, IIa.C-value standard deviation based weighting
- Method III.Penalization for not occurring in other corpora
- II+III, IIa+III.Second order methods
- 5.Term selection based on term contexts in a general corpus
- 5.1Context diversity coefficient
- 5.2Boosting lists of irrelevant phrases by adding similar ones
- 6.Evaluation
- 6.1Evaluation data
- 6.2Results
- 7.Conclusions
- Acknowledgements
- Notes
-
References
References (14)
References
Basili, Roberto, Alessandro Moschitti, Maria Teresa Pazienza, and Fabio Massimo Zanzotto. 2001. “A Contrastive Approach to Term Extraction.” In Proceedings of 4th Terminology and Artificial Intelligence Conference (TIA), 119–128, Nancy: INIST/CNRS.
Bonin, Francesca, Felice Dell’Orletta, Giulia Venturi, and Simonetta Montemagni. 2010. “A Contrastive Approach to Multi-word Term Extraction from Domain Corpora.” In Proceedings of the 7th International Conference on Language Resources and Evaluation, 19–21. Valetta, Malta.
Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima. 2000. “Automatic Recognition of Multi-word Terms: the C-value/NC-value Method.” International Journal on Digital Libraries 31: 115–130.
Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. “Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers. 1489–1501. Berlin, Germany: The Association for Computer Linguistics.
Hill, Felix, Reichart Roi, and Anna Korhonnen. 2015. “Simlex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation.” Computational Linguistics 411: 665–695.
Lopes, Lucene, Paulo Fernandes, and Renata Vieira. 2016. “Estimating Term Domain Relevance through Term Frequency, Disjoint Corpora Frequency – tf-dcf.” Knowledge-Based Systems 971: 237–249.
Marciniak, Małgorzata, Agnieszka Mykowiecka, and Piotr Rychlik. 2016. “TermoPL – A Flexible Tool for Terminology Extraction.” In Proceedings of 10th edition of the Language Resources and Evaluation Conference. 2278–2284. Portorož, Slovenia.
Marciniak, Małgorzata, and Agnieszka Mykowiecka. 2014. “Terminology Extraction from Medical Texts in Polish.” Journal of Biomedical Semantics 51: 24.
Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. 2013. “Linguistic Regularities in Continuous Space Word Representations.” In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings. 746–751. Atlanta, Georgia: The Association for Computer Linguistics.
Navigli, Roberto, and Paola Velardi. 2004. “Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites.” Computational Linguistics 301: 151–179.
Przepiórkowski, Adam, Mirosław Bańko, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk. 2012. Narodowy Korpus Języka Polskiego. Warszawa: Wydawnictwo Naukowe PWN.
Rayson, Paul, and Roger Garside. 2000. “Comparing Corpora Using Frequency Profiling.” in Proceedings of the Workshop on Comparing Corpora – Volume 9, WCC ’00. 1–6. Stroudsburg, PA, USA: Association for Computational Linguistics.
Řehůřek, Radim, and Petr Sojka. 2010. “Software Framework for Topic Modelling with Large Corpora.” In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 45–50. Valetta, Malta: ELRA.
Schäfer, Johannes, Ina Rösinger, Ulrich Heid, and Michael Dorna. 2015. “Evaluating Noise Reduction Strategies for Terminology Extraction.” In Proceedings of the 11th International Conference on Terminology and Artificial Intelligence. 123–131. Granada: Universidad de Granada.
Cited by (2)
Cited by two other publications
Marciniak, Małgorzata, Agnieszka Mykowiecka & Piotr Rychlik
2021.
Terminology/Keyphrase Extraction for Creation of Book Indexes in Polish. In
Linking Theory and Practice of Digital Libraries [
Lecture Notes in Computer Science, 12866],
► pp. 49 ff.
Khasanova, Nurgizya, Ilvira Kuznetsova & Meri Gulkanyan
2020.
The equivalence of the Russian and english phraseological terms in the field of construction and architecture.
IOP Conference Series: Materials Science and Engineering 890:1
► pp. 012213 ff.
This list is based on CrossRef data as of 27 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.