The underpinnings of a composite measure for automatic term extraction: The case of SRC

Periñán-Pascual, Carlos

doi:10.1075/term.21.2.02per

Article published In:

Terminology across Languages and Domains
Edited by Patrick Drouin, Natalia Grabar, Thierry Hamon and Kyo Kageura
[Terminology 21:2] 2015
► pp. 151–179

The underpinnings of a composite measure for automatic term extraction

The case of SRC

Carlos Periñán-Pascual

The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small- and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.

Keywords: SRC, cohesion, automatic term extraction, relevance, salience

Published online: 31 December 2015

https://doi.org/10.1075/term.21.2.02per

References (56)

Ahmad, Khurshid, Lee Gillam, and Lena Tostevin. 2000. “Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER).” In Proceedings of the 8th Text Retrieval Conference (TREC-8), ed. by E.M. Voorhees, and D.K. Harman, 717–724. Washington: National Institute of Standards and Technology.

Barrón-Cedeño, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou. 2009. “An Improved Automatic Term Recognition Method for Spanish.” In Computational Linguistics and Intelligent Text Processing, ed. by Alexander Gelbukh, 125–136. Berlin-Heidelberg: Springer.

Barthes, Roland. 1964. Elements of Semiology. New York: Hill and Wang.

Church, Kenneth Ward, and Patrick Hanks. 1990. “Word Association Norms, Mutual Information and Lexicography.” Computational Linguistics 6 (1): 22–29.

Church, Kenneth Ward, William Gale, Patrick Hanks, and Donald Hindle. 1991. “Using Statistics in Lexical Analysis.” In Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, ed. by Uri Zernik, 115–164. Hillsdale, NJ: Lawrence Erlbaum.

Collins WordBanks Online. 2013. ([URL]). Accessed 14 August 2015.

Conrado, Merley da Silva, Ariani Felippo, Thiago Salgueiro Pardo, and Solange Rezende. 2014. “A Survey of Automatic Term Extraction for Brazilian Portuguese.” Journal of the Brazilian Computer Society 20 (12): 1–28. ([URL]). Accessed 14 August 2015.

Conrado, Merley da Silva, Thiago Salgueiro Pardo, and Solange Rezende. 2014. “The Main Challenge of Semi-Automatic Term Extraction Methods.” In Proceedings of the 11th International Workshop on Natural Language Processing and Cognitive Science , 1–10, Venice.

Cusin-Berche, Fabrienne. 2003. Les mots et leurs contextes. Paris: Presses Sorbonne Nouvelle.

Dunning, Ted. 1994. “Accurate Methods for the Statistics of Surprise and Coincidence.” Computational Linguistics 19 (1): 61–74.

Fedorenko, Denis, Nikita Astrakhantsev, and Denis Turdakov. 2013. “Automatic Recognition of Domain-Specific Terms: An Experimental Evaluation.” In Proceedings of the 9th Spring Researcher’s Colloquium on Database and Information Systems , 15–23, Kazan.

Frantzi, Katerina, and Sophia Ananiadou. 1996. “Extracting Nested Collocations.” In Proceedings of the 16th International Conference on Computational Linguistics , 41–46. Morristown: Association for Computational Linguistics.

Frantzi, Katerina, Sophia Ananiadou, and Mima Hideki. 2000. “Automatic Recognition of Multi-Word Terms: the C-Value/NC-Value Method.” International Journal of Digital Libraries 3 (2): 115–130.

Golik, Wiktoria, Robert Bossy, Zorana Ratkovic, and Claire Nédellec. 2013. “Improving Term Extraction with Linguistic Analysis in the Biomedical Domain.” Research in Computing Science 701: 157–172.

Graf, Rudolf F. 1999. Modern Dictionary of Electronics, 7th edition. Boston: Newnes.

Grefenstette, Gregory. 1994. Explorations in Automatic Thesaurus Discovery. Boston: Kluwer Academic.

Harris, Zellig. 1954. “Distributional Structure.” Word 10 (23): 146–162.

Kageura, Kyo, and Bin Umino. 1996. “Methods of Automatic Term Recognition: A Review”. Terminology 3 (2): 259–289.

Knoth, Petr, Marek Schmidt, Pavel Smrz, and Zdenek Zdráhal. 2009. “Towards a Framework for Comparing Automatic Term Recognition Methods.” In Proceedings of the 8th Annual Conference Znalosti, 83–94. Bratislava: Informatics and Information Technology STU.

Korkontzelos, Ioannis, Ioannis Klapaftis, and Suresh Manandhar. 2008. “Reviewing and Evaluating Automatic Term Recognition Techniques.” In Proceedings of the 6th International Conference on Advances in Natural Language Processing, ed. by Bengt Nordström and Aarne Ranta, 248–259. Berlin-Heidelberg: Springer.

Kraaij, Wessel, and Renée Pohlmann. 1996. “Viewing Stemming as Recall Enhancement.” In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , 40–48, Zurich.

Lossio-Ventura, Juan Antonio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. 2014. “Biomedical Terminology Extraction: A New Combination of Statistical and Web Mining Approaches.” In Proceedings of Journées Internationales d’Analyse Statistique des Données Textuelles , 1–12, Paris.

Luhn, Hans Peter. 1958. “The Automatic Creation of Literature Abstracts”. IBM Journal of Research and Development 2 (2): 159–165.

Mairal-Usón, Ricardo, and Carlos Periñán-Pascual. 2009. “The Anatomy of the Lexicon within the Framework of an NLP Knowledge Base.” Revista Española de Lingüística Aplicada 221: 217–244.

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2009. Introduction to Information Retrieval. Cambridge: Cambridge University Press.

Nagao, Makoto, Mikio Mizutani, and Hiroyuki Ikeda. 1976. “An Automated Method of the Extraction of Important Words from Japanese Scientific Documents.” Transactions of Information Processing Society of Japan 17 (2): 110–117.

Navigli, Roberto, and Paola Velardi. 2002. “Semantic Interpretation of Terminological Strings.” In Proceedings of the 6th International Conference on Terminology and Knowledge Engineering , 95–100. Berlin-Heidelberg: Springer.

Park, Youngja, Roy J. Byrd, and Branimir K. Boguraev. 2002. “Automatic Glossary Extraction: Beyond Terminology Identification.” In Proceedings of the 19th International Conference on Computational Linguistics , vol. 11, 1–7. Stroudsburg, PA: Association for Computational Linguistics.

Pazienza, Maria Teresa, Marco Pennacchiotti, and Fabio Massimo Zanzotto. 2005. “Terminology Extraction: An Analysis of Linguistic and Statistical Approaches”. In Studies in Fuzziness and Soft Computing: Knowledge Mining, ed. by Janusz Kacprzyk and Spiros Sirmakessis, 255–279. Berlin-Heidelberg: Springer.

Peñas, Anselmo, Felisa Verdejo, and Julio Gonzalo. 2001. “Corpus-Based Terminology Extraction Applied to Information Access.” In Proceedings of the Corpus Linguistics Conference , 458–465, Lancaster.

Periñán Pascual, Carlos. 2013. “A Knowledge-Engineering Approach to the Cognitive Categorization of Lexical Meaning.” VIAL: Vigo International Journal of Applied Linguistics 101: 85–104.

Periñán-Pascual, Carlos, and Francisco Arcas-Túnez. 2004. “Meaning Postulates in a Lexico-Conceptual Knowledge Base.” In Proceedings of the 15th International Workshop on Databases and Expert Systems Applications , 38–42. Los Alamitos: the Institute of Electrical and Electronics Engineers-Computer Society.

. 2007. “Cognitive Modules of an NLP Knowledge Base for Language Understanding.” Procesamiento del Lenguaje Natural 391: 197–204.

. 2010. “The Architecture of FunGramKB.” In Proceedings of the 7th International Conference on Language Resources and Evaluation , 2667–2674. Malta: ELRA.

Periñán-Pascual, Carlos, and Ricardo Mairal-Usón. 2009. “Bringing Role and Reference Grammar to Natural Language Understanding.” Procesamiento del Lenguaje Natural 431: 265–273.

Plante, Pierre, and Lucie Dumas. 1989. “Le Dépouillement Terminologique Assisté par Ordinateur.” Terminogramme 461: 24–28.

Real Academia Española. Corpus de Referencia del Español Actual (CREA). ([URL]). Accessed 14 August 2015.

Sabbah, Yousef W., and Yousef Abuzir. 2005. “Automatic Term Extraction Using Statistical Techniques: A Comparative in-Depth Study & Applications.” In Proceedings of the International Arab Conference on Information Technology ACIT 2005 , 1–7, Amman.

Sager, Juan C. 1990. A Practical Course in Terminology Processing. Amsterdam: John Benjamins.

Salton, Gerard (ed.). 1971. The SMART Retrieval System – Experiments in Automatic Document Retrieval. Englewood Cliffs, NJ: Prentice Hall.

Salton, Gerard, and Christopher Buckley. 1988. “Term-Weighting Approaches in Automatic Text Retrieval.” Information Processing & Management 24 (5): 513–523.

Salton, Gerard, Anita Wong, and Chung-Shu Yang. 1975. “A Vector Space Model for Automatic Indexing.” Communications of the ACM 18 (11): 613–620.

Salton, Gerard, and Chung-Shu Yang. 1973. “On the Specification of Term in Automatic Indexing.” Journal of Documentation 29 (4): 351–372.

Salton, Gerard, Chung-Shu Yang, and Clement T. Yu. 1975. “A Theory of Term Importance in Automatic Text Analysis.” Journal of the American Society for Information Science 26 (1): 33–44.

Sclano, Francesco, and Paola Velardi. 2007. “TermExtractor: A Web Application to Learn the Common Terminology of Interest Groups and Research Communities.” In Proceedings of the 9th Conference on Terminology and Artificial Intelligence , 1–10, Sophia Antinopolis.

Singhal, Amit. 1997. Term Weighting Revisited. Ph.D. thesis. Ithaca, NY: Cornell University.

Singhal, Amit, Chris Buckley, and Mandar Mitra. 1996. “Pivoted Document Length Normalization.” In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , 21–29. New York: ACM press

Singhal, Amit, Gerard Salton, and Chris Buckley. 1996. “Length Normalization in Degraded Text Collections.” In Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval , 149–162. Las Vegas: University of Nevada.

Smadja, Frank. 1993. “Retrieving Collocations from Text: Xtract.” Computational Linguistics 19 (1): 143–178.

Sun, Qinglan, Debora Shaw, and Charles H. Davis. 1999. “A Model for Estimating the Occurrence of Same-Frequency Words and the Boundary between High- and Low-Frequency Words in Texts”. Journal of the American Society for Information Science 50 (3): 280–286.

The British National Corpus (BNC). Oxford University Computing Services. [URL]

Turney, Peter D., and Patrick Pantel. 2010. “From Frequency to Meaning: Vector Space Models of Semantics.” Journal of Artificial Intelligence Research 371: 141–188.

Velardi, Paola, Michele Missikoff, and Roberto Basili. 2001. “Identification of Relevant Terms to Support the Construction of Domain Ontologies.” In Proceedings of the Workshop on Human Language Technology and Knowledge Management , 1–8. Morristown: Association for Computational Linguistics.

Wong, Wilson, Wei Liu, and Mohammed Bennamoun. 2007. “Determining Termhood for Learning Domain Ontologies Using Domain Prevalence and Tendency.” In Proceedings of the 6th Australasian Conference on Data Mining , 47–54, Gold Coast.

. 2008. “Determination of Unithood and Termhood for Term Recognition.” In Handbook of Research on Text and Web Mining Technologies, ed. by Min Song and Yi-Fang Wu, 500–529. Hershey-New York: IGI Global.

Zhang, Ziqi, José Iria, Christopher Brewster, and Fabio Ciravegna. 2008. “A Comparative Evaluation of Term Recognition Algorithms.” In Proceedings of the 6th International Conference on Language Resources and Evaluation , 2108–2113. Marrakech: ELRA.

Cited by (2)

Cited by two other publications

Felices Lago, Ángel M. & Pedro Ureña Gómez-Moreno

2020. Conceptualización de entidades terminológicas en una subontología de derecho penal: análisis del concepto superordinado +DRUG_00 en FunGramKb. Revista de Lingüística y Lenguas Aplicadas 15:1 ► pp. 15 ff.

PERIÑAN-PASCUAL, CARLOS

2018. DEXTER: A workbench for automatic term extraction with specialized corpora. Natural Language Engineering 24:2 ► pp. 163 ff.

This list is based on CrossRef data as of 10 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.