Nested term recognition driven by word connection strength
Domain corpora are often not very voluminous and even important terms can occur in them not as isolated maximal phrases but only within more complex constructions. Appropriate recognition of nested terms can thus influence the content of the extracted candidate term list and its order. We propose a new method for identifying nested terms based on a combination of two aspects: grammatical correctness and normalised pointwise mutual information (NPMI) counted for all bigrams in a given corpus. NPMI is typically used for recognition of strong word connections, but in our solution we use it to recognise the weakest points to suggest the best place for division of a phrase into two parts. By creating, at most, two nested phrases in each step, we introduce a binary term structure. We test the impact of the proposed method applied, together with the C-value ranking method, to the automatic term recognition task performed on three corpora, two in Polish and one in English.
“A Morphosyntactic Brill Tagger for Inflectional Languages
.” In Advances in Natural Language Processing
, ed. by Hrafn Loftsson
, Eirikur Rognvaldsson
, and Sigrun Helgadottir
, 3–14. Berlin Heidelberg: Springer.
Barrón-Cedeno, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou
“An Improved Automatic Term Recognition Method for Spanish
.” In Computational Linguistics and Intelligent Text Processing
, ed. by Alexander Gelbukh
, 125–136. Berlin Heidelberg: Springer.
“Normalized (Pointwise) Mutual Information in Collocation
From Form to Meaning: Processing Texts Automatically, Proceedings of the Biennial GSCL Conference 2009
, ed. by Christian Chiarcos
, Richard Eckart de Castilho
and Manfred Stede
, 31–40. Tubingen: Gunter Narr Verlag.
“The Representational Approach to Adjective Placement in Polish
.” Linguistica Silesiana
Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima
“Automatic Recognition of Multi-word Terms: The C-value/NC-value Method
.” Journal on Digital Libraries
3 (2): 115–130.
). Accessed 20 August 2015.
Kageura, Kyo, and Bin Umino
Kim, Jin-Dong, Tomoko Otha, Yuka Tateisi, and Jun’ichi Tsujii
“GENIA Corpus – a Semantically Annotated Corpus of Bio-Textmining
19 (suppl. 1): 180–182.
“Mining Class Association Rule for Word Sense Disambigiation
.” In Security and Intelligent Information Systems. Lecture Notes in Computer Science Volume 7053
, ed. by Pascal Bouvry
, Mieczysław A. Kłopotek
, Franck Leprévost
, Małgorzata Marciniak
, Agnieszka Mykowiecka
, and Henryk Rybiński
, 307–317. Berlin Heidelberg: Springer.
Korkontzelos, Ioannis, Ioannis P. Klapaftis, and Suresh Manandhar
“Reviewing and Evaluating Automatic Term Recognition Techniques
.” In Advances in Natural Language Processing. Lecture Notes in Computer Science
Volume 52211, ed. by Bengt Nordström
and Aarne Ranta
, 248–259. Berlin Heidelberg: Springer.
Lossio-Ventura, Juan Antonio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire
“Yet Another Ranking Function for Automatic Multiword Term Extraction
.” In Advances in Natural Language Processing Lecture Notes in Computer Science Volume 8686
, 52–64. Berlin Heidelberg: Springer.
Manning, Christopher D., and Hinrich Schutze
1999 Foundations of Statistical Natural Language Processing
. Cambridge, MA, USA: MIT Press.
Marciniak, Małgorzata, and Agnieszka Mykowiecka
“Towards Morphologically Annotated Corpus of Hospital Discharge Reports in Polish
.” In Proceedings of BioNLP 2011
, ed. by Kevin Bretonnel Cohen
, Dina Demner-Fushman
, Sophia Ananiadou
, John Pestian
, Jun’ichi Tsujii
, and Bonnie Webber
, 92–100. Portland, Oregon, USA: ACL.
“Terminology Extraction from Domain Texts in Polish
.” In Intelligent Tools for Building a Scientific Information Platform. Advanced Architectures and Solutions. volume 467 of Studies in Computational Intelligence
, ed. by Robert Bembenik
, Łukasz Skonieczny
, Henryk Rybiński
, Marzena Kryszkiewicz
, and Marek Niezgódka
, 171–185. Berlin Heidelberg: Springer.
“Terminology Extraction From Medical Texts in Polish
.” Journal of Biomedical Semantics
Nenadic, Goran, Irena Spasic, and Sophia Ananiadou
2005 Mining Biomedical Abstracts: What’s in a Term? Lecture Notes in Artificial Intelligence
, Volume 32481, ed. by Keh-Yih Su
, Jun’ichi Tsujii
, Jong-Hyeok Lee
, and Oi Yee Kwong
, 797–806. Berlin Heidelberg: Springer.
Pantel, Patrick, and Dekang Lin
“A Statistical Corpus-Based Term Extractor
.” In Advances in Artificial Intelligence. Lecture Notes in Computer Science Volume 2056
, ed. by Eleni Stroulia
and Stan Matwin
, 36–46. Berlin Heidelberg: Springer.
Pazienza, Maria T., Marco Pennacchiotti, and Fabio M. Zanzotto
“Terminology Extraction: An Analysis of Linguistic and Statistical Approaches
.” In Knowledge Mining Series: Studies in Fuzziness and Soft Computing
, ed. by Spiros Sirmakessis
, 255–280. Berlin Heidelberg: Springer.
). Accessed 20 August 2015.
2008 Powierzchniowe przetwarzanie języka polskiego
[Eng. Shallow Parsing of Polish]. Warszawa: Akademicka Oficyna Wydawnicza EXIT.
Sclano, Francesco, and Paola Velardi
“Termextractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities
.” In Enterprise Interoperability II
, ed. by Ricardo Jardim-Gonçalves
, Jörg P. Müller
, Kai Mertins
, and Martin Zelm
, 287–290. Berlin Heidelberg: Springer.
Tateisi, Yuka, and Jun’ichi Tsujii
“Part-of-speech Annotation of Biology Research Abstracts
.” In Proceedings of 4th International Conference on Language Resources and Evaluation
, ed. by Maria Teresa Lino
, Maria Francisca Xavier
, Fátima Ferreira
, Rute Costa
, and Raquel Silva
, 1267–1270. Lisbon, Portugal: ELRA.
Toutanova, Kristina, Dan Klein, Christopher D. Manning, and Yoram Singer
“Feature-rich Part-of-speech Tagging with a Cyclic Dependency Network
.” In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
, ed. by Marti Hearstand
and Mari Ostendorf
, 173–180. Edmonton, Canada: ACL.
Ventura, Juan A. Lossio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire
“Towards a Mixed Approach to Extract Biomedical Terms from Documents
.” In International Journal of Knowledge Discovery in Bioinformatics
Vu, Thuy, Ai Ti Aw, and Min Zhang
“Term Extraction through Unithood and Termhood Unification
.” In Proceedings of International Joint Conference on Natural Language Processing
, ed. by Jong-Hyeok Lee
, Ann Copestake
, and Yuji Matsumoto
, 631–636. Hyderabad, India: ACL.
Wermter, Joachim, and Udo Hahn
“Massive Biomedical Term Discovery
.” In Discovery Science Lecture Notes in Computer Science Volume 3735
, ed. by Achim Hoffmann
, Hiroshi Motoda
, and Tobias Scheffer
, 281–293. Berlin Heidelberg: Springer.
“Morfeusz - a Practical Solution for the Morphological Analysis of Polish
.” In Intelligent Information Processing and Web Mining
. Advances in Soft Computing Volume 35
, ed. by Mieczysław A. Kłopotek
, Sławomir T. Wierzchoń
, and Krzysztof Trojanowski
, 511–520. Berlin Heidelberg: Springer.
Cited by 1 other publications
Du, Jiali, Christina Alexantris & Pingfang Yu
. Comparative Research on Terminology Databases in Europe and China
. In Human Interaction, Emerging Technologies and Future Applications II
[Advances in Intelligent Systems and Computing
, 1152], ►
pp. 252 ff.
This list is based on CrossRef data as of 1 december 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.