Nested term recognition driven by word connection strength
Domain corpora are often not very voluminous and even important terms can occur in them not as isolated maximal phrases but only within more complex constructions. Appropriate recognition of nested terms can thus influence the content of the extracted candidate term list and its order. We propose a new method for identifying nested terms based on a combination of two aspects: grammatical correctness and normalised pointwise mutual information (NPMI) counted for all bigrams in a given corpus. NPMI is typically used for recognition of strong word connections, but in our solution we use it to recognise the weakest points to suggest the best place for division of a phrase into two parts. By creating, at most, two nested phrases in each step, we introduce a binary term structure. We test the impact of the proposed method applied, together with the C-value ranking method, to the automatic term recognition task performed on three corpora, two in Polish and one in English.
References
Acedański, Szymon
2010 “
A Morphosyntactic Brill Tagger for Inflectional Languages.” In
Advances in Natural Language Processing, ed. by
Hrafn Loftsson,
Eirikur Rognvaldsson, and
Sigrun Helgadottir, 3–14. Berlin Heidelberg: Springer.


Barrón-Cedeno, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou
2009 “
An Improved Automatic Term Recognition Method for Spanish.” In
Computational Linguistics and Intelligent Text Processing, ed. by
Alexander Gelbukh, 125–136. Berlin Heidelberg: Springer.


Bouma, Gerlof
2009 “
Normalized (Pointwise) Mutual Information in Collocation.” In
From Form to Meaning: Processing Texts Automatically, Proceedings of the Biennial GSCL Conference 2009
, ed. by
Christian Chiarcos,
Richard Eckart de Castilho and
Manfred Stede, 31–40. Tubingen: Gunter Narr Verlag.
Cetnarowska, Bożena
2013 “
The Representational Approach to Adjective Placement in Polish.”
Linguistica Silesiana 341: 7–22.

Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima
2000 “
Automatic Recognition of Multi-word Terms: The C-value/NC-value Method.”
Journal on Digital Libraries 3 (2): 115–130.


Genia Corpus
[URL]). Accessed 20 August 2015.
Kageura, Kyo, and Bin Umino
Kim, Jin-Dong, Tomoko Otha, Yuka Tateisi, and Jun’ichi Tsujii
2003 “
GENIA Corpus – a Semantically Annotated Corpus of Bio-Textmining.”
Bioinformatics 19 (suppl. 1): 180–182.


Kobyliński, Łukasz
2012 “
Mining Class Association Rule for Word Sense Disambigiation.” In
Security and Intelligent Information Systems. Lecture Notes in Computer Science Volume 7053, ed. by
Pascal Bouvry,
Mieczysław A. Kłopotek,
Franck Leprévost,
Małgorzata Marciniak,
Agnieszka Mykowiecka, and
Henryk Rybiński, 307–317. Berlin Heidelberg: Springer.

Korkontzelos, Ioannis, Ioannis P. Klapaftis, and Suresh Manandhar
2008 “
Reviewing and Evaluating Automatic Term Recognition Techniques.” In
Advances in Natural Language Processing. Lecture Notes in Computer Science Volume 52211, ed. by
Bengt Nordström and
Aarne Ranta, 248–259. Berlin Heidelberg: Springer.


Lossio-Ventura, Juan Antonio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire
2014 “
Yet Another Ranking Function for Automatic Multiword Term Extraction.” In
Advances in Natural Language Processing Lecture Notes in Computer Science Volume 8686, 52–64. Berlin Heidelberg: Springer.

Manning, Christopher D., and Hinrich Schutze
1999 Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press.

Marciniak, Małgorzata, and Agnieszka Mykowiecka
2011 “
Towards Morphologically Annotated Corpus of Hospital Discharge Reports in Polish.” In
Proceedings of BioNLP 2011, ed. by
Kevin Bretonnel Cohen,
Dina Demner-Fushman,
Sophia Ananiadou,
John Pestian,
Jun’ichi Tsujii, and
Bonnie Webber, 92–100. Portland, Oregon, USA: ACL.

Marciniak, Małgorzata, and Agnieszka Mykowiecka
2013 “
Terminology Extraction from Domain Texts in Polish.” In
Intelligent Tools for Building a Scientific Information Platform. Advanced Architectures and Solutions. volume 467 of Studies in Computational Intelligence, ed. by
Robert Bembenik,
Łukasz Skonieczny,
Henryk Rybiński,
Marzena Kryszkiewicz, and
Marek Niezgódka, 171–185. Berlin Heidelberg: Springer.

Marciniak, Małgorzata, and Agnieszka Mykowiecka
2014 “
Terminology Extraction From Medical Texts in Polish.”
Journal of Biomedical Semantics 51: 24.


Nenadic, Goran, Irena Spasic, and Sophia Ananiadou
2005 Mining Biomedical Abstracts: What’s in a Term? Lecture Notes in Artificial Intelligence, Volume 32481, ed. by
Keh-Yih Su,
Jun’ichi Tsujii,
Jong-Hyeok Lee, and
Oi Yee Kwong, 797–806. Berlin Heidelberg: Springer.

Pantel, Patrick, and Dekang Lin
2001 “
A Statistical Corpus-Based Term Extractor.” In
Advances in Artificial Intelligence. Lecture Notes in Computer Science Volume 2056, ed. by
Eleni Stroulia and
Stan Matwin, 36–46. Berlin Heidelberg: Springer.

Pazienza, Maria T., Marco Pennacchiotti, and Fabio M. Zanzotto
2005 “
Terminology Extraction: An Analysis of Linguistic and Statistical Approaches.” In
Knowledge Mining Series: Studies in Fuzziness and Soft Computing, ed. by
Spiros Sirmakessis, 255–280. Berlin Heidelberg: Springer.


plWikiEcono
[URL]). Accessed 20 August 2015.
Przepiórkowski, Adam
2008 Powierzchniowe przetwarzanie języka polskiego [Eng. Shallow Parsing of Polish]. Warszawa: Akademicka Oficyna Wydawnicza EXIT.

Sclano, Francesco, and Paola Velardi
2007 “
Termextractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities.” In
Enterprise Interoperability II, ed. by
Ricardo Jardim-Gonçalves,
Jörg P. Müller,
Kai Mertins, and
Martin Zelm, 287–290. Berlin Heidelberg: Springer.


Tateisi, Yuka, and Jun’ichi Tsujii
2004 “
Part-of-speech Annotation of Biology Research Abstracts.” In
Proceedings of 4th International Conference on Language Resources and Evaluation, ed. by
Maria Teresa Lino,
Maria Francisca Xavier,
Fátima Ferreira,
Rute Costa, and
Raquel Silva, 1267–1270. Lisbon, Portugal: ELRA.

Toutanova, Kristina, Dan Klein, Christopher D. Manning, and Yoram Singer
2003 “
Feature-rich Part-of-speech Tagging with a Cyclic Dependency Network.” In
Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, ed. by
Marti Hearstand and
Mari Ostendorf, 173–180. Edmonton, Canada: ACL.

Ventura, Juan A. Lossio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire
2014 “
Towards a Mixed Approach to Extract Biomedical Terms from Documents.”
In International Journal of Knowledge Discovery in Bioinformatics 4 (1):1–5.


Vu, Thuy, Ai Ti Aw, and Min Zhang
2008 “
Term Extraction through Unithood and Termhood Unification.” In
Proceedings of International Joint Conference on Natural Language Processing, ed. by
Jong-Hyeok Lee,
Ann Copestake, and
Yuji Matsumoto, 631–636. Hyderabad, India: ACL.

Wermter, Joachim, and Udo Hahn
2005 “
Massive Biomedical Term Discovery.” In
Discovery Science Lecture Notes in Computer Science Volume 3735, ed. by
Achim Hoffmann,
Hiroshi Motoda, and
Tobias Scheffer, 281–293. Berlin Heidelberg: Springer.

Woliński, Marcin
2006 “
Morfeusz - a Practical Solution for the Morphological Analysis of Polish.” In
Intelligent Information Processing and Web Mining.
Advances in Soft Computing Volume 35, ed. by
Mieczysław A. Kłopotek,
Sławomir T. Wierzchoń, and
Krzysztof Trojanowski, 511–520. Berlin Heidelberg: Springer.


Cited by
Cited by 1 other publications
Du, Jiali, Christina Alexantris & Pingfang Yu
2020.
Comparative Research on Terminology Databases in Europe and China. In
Human Interaction, Emerging Technologies and Future Applications II [
Advances in Intelligent Systems and Computing, 1152],
► pp. 252 ff.

This list is based on CrossRef data as of 1 december 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.