Nested term recognition driven by word connection strength
Domain corpora are often not very voluminous and even important terms can occur in them not as isolated maximal phrases but only within more complex constructions. Appropriate recognition of nested terms can thus influence the content of the extracted candidate term list and its order. We propose a new method for identifying nested terms based on a combination of two aspects: grammatical correctness and normalised pointwise mutual information (NPMI) counted for all bigrams in a given corpus. NPMI is typically used for recognition of strong word connections, but in our solution we use it to recognise the weakest points to suggest the best place for division of a phrase into two parts. By creating, at most, two nested phrases in each step, we introduce a binary term structure. We test the impact of the proposed method applied, together with the C-value ranking method, to the automatic term recognition task performed on three corpora, two in Polish and one in English.
References (27)
Acedański, Szymon. 2010. “A Morphosyntactic Brill Tagger for Inflectional Languages.” In Advances in Natural Language Processing, ed. by Hrafn Loftsson, Eirikur Rognvaldsson, and Sigrun Helgadottir, 3–14. Berlin Heidelberg: Springer.
Barrón-Cedeno, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou. 2009. “An Improved Automatic Term Recognition Method for Spanish.” In Computational Linguistics and Intelligent Text Processing, ed. by Alexander Gelbukh, 125–136. Berlin Heidelberg: Springer.
Bouma, Gerlof. 2009. “Normalized (Pointwise) Mutual Information in Collocation.” In
From Form to Meaning: Processing Texts Automatically, Proceedings of the Biennial GSCL Conference 2009
, ed. by Christian Chiarcos, Richard Eckart de Castilho and Manfred Stede, 31–40. Tubingen: Gunter Narr Verlag.
Cetnarowska, Bożena. 2013. “The Representational Approach to Adjective Placement in Polish.” Linguistica Silesiana 341: 7–22.
Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima. 2000. “Automatic Recognition of Multi-word Terms: The C-value/NC-value Method.” Journal on Digital Libraries 3 (2): 115–130.
Genia Corpus ([URL]). Accessed 20 August 2015.
Kim, Jin-Dong, Tomoko Otha, Yuka Tateisi, and Jun’ichi Tsujii. 2003. “GENIA Corpus – a Semantically Annotated Corpus of Bio-Textmining.” Bioinformatics 19 (suppl. 1): 180–182.
Kobyliński, Łukasz. 2012. “Mining Class Association Rule for Word Sense Disambigiation.” In Security and Intelligent Information Systems. Lecture Notes in Computer Science Volume 7053, ed. by Pascal Bouvry, Mieczysław A. Kłopotek, Franck Leprévost, Małgorzata Marciniak, Agnieszka Mykowiecka, and Henryk Rybiński, 307–317. Berlin Heidelberg: Springer.
Korkontzelos, Ioannis, Ioannis P. Klapaftis, and Suresh Manandhar. 2008. “Reviewing and Evaluating Automatic Term Recognition Techniques.” In Advances in Natural Language Processing. Lecture Notes in Computer Science Volume 52211, ed. by Bengt Nordström and Aarne Ranta, 248–259. Berlin Heidelberg: Springer.
Lossio-Ventura, Juan Antonio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. 2014. “Yet Another Ranking Function for Automatic Multiword Term Extraction.” In Advances in Natural Language Processing Lecture Notes in Computer Science Volume 8686, 52–64. Berlin Heidelberg: Springer.
Manning, Christopher D., and Hinrich Schutze. 1999 Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press.
Marciniak, Małgorzata, and Agnieszka Mykowiecka. 2011. “Towards Morphologically Annotated Corpus of Hospital Discharge Reports in Polish.” In Proceedings of BioNLP 2011, ed. by Kevin Bretonnel Cohen, Dina Demner-Fushman, Sophia Ananiadou, John Pestian, Jun’ichi Tsujii, and Bonnie Webber, 92–100. Portland, Oregon, USA: ACL.
Marciniak, Małgorzata, and Agnieszka Mykowiecka. 2013. “Terminology Extraction from Domain Texts in Polish.” In Intelligent Tools for Building a Scientific Information Platform. Advanced Architectures and Solutions. volume 467 of Studies in Computational Intelligence, ed. by Robert Bembenik, Łukasz Skonieczny, Henryk Rybiński, Marzena Kryszkiewicz, and Marek Niezgódka, 171–185. Berlin Heidelberg: Springer.
Marciniak, Małgorzata, and Agnieszka Mykowiecka. 2014. “Terminology Extraction From Medical Texts in Polish.” Journal of Biomedical Semantics 51: 24.
Nenadic, Goran, Irena Spasic, and Sophia Ananiadou. 2005. Mining Biomedical Abstracts: What’s in a Term? Lecture Notes in Artificial Intelligence, Volume 32481, ed. by Keh-Yih Su, Jun’ichi Tsujii, Jong-Hyeok Lee, and Oi Yee Kwong, 797–806. Berlin Heidelberg: Springer.
Pantel, Patrick, and Dekang Lin. 2001. “A Statistical Corpus-Based Term Extractor.” In Advances in Artificial Intelligence. Lecture Notes in Computer Science Volume 2056, ed. by Eleni Stroulia and Stan Matwin, 36–46. Berlin Heidelberg: Springer.
Pazienza, Maria T., Marco Pennacchiotti, and Fabio M. Zanzotto. 2005. “Terminology Extraction: An Analysis of Linguistic and Statistical Approaches.” In Knowledge Mining Series: Studies in Fuzziness and Soft Computing, ed. by Spiros Sirmakessis, 255–280. Berlin Heidelberg: Springer.
plWikiEcono ([URL]). Accessed 20 August 2015.
Przepiórkowski, Adam. 2008. Powierzchniowe przetwarzanie języka polskiego [Eng. Shallow Parsing of Polish]. Warszawa: Akademicka Oficyna Wydawnicza EXIT.
Sclano, Francesco, and Paola Velardi. 2007. “Termextractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities.” In Enterprise Interoperability II, ed. by Ricardo Jardim-Gonçalves, Jörg P. Müller, Kai Mertins, and Martin Zelm, 287–290. Berlin Heidelberg: Springer.
Tateisi, Yuka, and Jun’ichi Tsujii. 2004. “Part-of-speech Annotation of Biology Research Abstracts.” In Proceedings of 4th International Conference on Language Resources and Evaluation, ed. by Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa, and Raquel Silva, 1267–1270. Lisbon, Portugal: ELRA.
Toutanova, Kristina, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. “Feature-rich Part-of-speech Tagging with a Cyclic Dependency Network.” In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, ed. by Marti Hearstand and Mari Ostendorf, 173–180. Edmonton, Canada: ACL.
Ventura, Juan A. Lossio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. 2014. “Towards a Mixed Approach to Extract Biomedical Terms from Documents.” In International Journal of Knowledge Discovery in Bioinformatics 4 (1):1–5.
Vu, Thuy, Ai Ti Aw, and Min Zhang. 2008. “Term Extraction through Unithood and Termhood Unification.” In Proceedings of International Joint Conference on Natural Language Processing, ed. by Jong-Hyeok Lee, Ann Copestake, and Yuji Matsumoto, 631–636. Hyderabad, India: ACL.
Wermter, Joachim, and Udo Hahn. 2005. “Massive Biomedical Term Discovery.” In Discovery Science Lecture Notes in Computer Science Volume 3735, ed. by Achim Hoffmann, Hiroshi Motoda, and Tobias Scheffer, 281–293. Berlin Heidelberg: Springer.
Woliński, Marcin. 2006 “Morfeusz - a Practical Solution for the Morphological Analysis of Polish.” In Intelligent Information Processing and Web Mining. Advances in Soft Computing Volume 35, ed. by Mieczysław A. Kłopotek, Sławomir T. Wierzchoń, and Krzysztof Trojanowski, 511–520. Berlin Heidelberg: Springer.
Cited by (2)
Cited by two other publications
ADAMCHUK, Y.
2024.
TERM NESTS AS A REFLECTION OF THE SYSTEM AND CONCEPTUAL DYNAMICS
OF THE FIELD OF KNOWLEDGE (ON THE EXAMPLE OF THE EU LANGUAGE POLICY TERMINOLOGY).
Herald of Polotsk State University. Series A. Humanity sciences :2
► pp. 43 ff.
Du, Jiali, Christina Alexantris & Pingfang Yu
2020.
Comparative Research on Terminology Databases in Europe and China. In
Human Interaction, Emerging Technologies and Future Applications II [
Advances in Intelligent Systems and Computing, 1152],
► pp. 252 ff.
This list is based on CrossRef data as of 27 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.