Automatic extraction of specialized verbal units
A comparative study on Arabic, English and French
This paper presents a methodology for the automatic extraction of specialized
Arabic, English and French verbs of the field of computing. Since nominal terms
are predominant in terminology, our interest is to explore to what extent verbs
can also be part of a terminological analysis. Hence, our objective is to verify
how an existing extraction tool will perform when it comes to specialized verbs
in a given specialized domain. Furthermore, we want to investigate any
particularities that a language can represent regarding verbal terms from the
automatic extraction perspective. Our choice to operate on three different
languages reflects our desire to see whether the chosen tool can perform better
on one language compared to the others. Moreover, given that Arabic is a
morphologically rich and complex language, we consider investigating the results
yielded by the extraction tool. The extractor used for our experiment is
TermoStat (
Drouin 2003). So far, our
results show that the extraction of verbs of computing represents certain
differences in terms of quality and particularities of these units in this
specialized domain between the languages under question.
Article outline
- 1.Introduction
- 2.Previous work
- 3.Methodology
- 3.1TermoStat: A general overview
- 3.2Integrating Arabic language to TermoStat
- 3.3Compiling specialized corpora
- 3.4Pre-processing
- 3.5Managing specificity
- 4.Results
- 4.1Filtering results
- 4.1.1Tagging errors
- 4.1.2General language units
- 4.1.3Concordance list
- 4.2Results of filtering
-
For Arabic
-
For English
-
For French
- 4.2.1Specificity
- 4.2.2Certain particularities
- 4.3VTU validation criterion
- 4.3.1Some particularities regarding validation: English and French
- 4.3.2Some particularities regarding validation: Arabic
- 5.Evaluation and discussion
- 5.1Precision
- 5.2Comparison between VTUs and NTUs
- 6.Conclusion
- Notes
-
References
This article is currently available as a sample article.
References (45)
References
Abed, A. M., S. Tiun, and M. Abared. 2013. “Arabic Term Extraction Using Combined Approach on Islamic
Document.” Journal of Theoretical & Applied Information Technology 58 (3): 601–608.
Almaany. 2017. [URL]. Accessed 30 March 2017.
Attia, M., P. Pecina, A. Toral, L. Tounsi, and J. van Genabith. 2011. “A Lexical Database for Modern Standard Arabic Interoperable with
a Finite State Morphological Transducer.” In Proceedings Systems and Frameworks for Computational Morphology: Second
International Workshop, SFCM 2011, Zurich, Switzerland, August 26,
2011, ed. by M. Cerstin and M. Piotrowski, 98–118. Zurich, Switzerland: Springer Berlin Heidelberg.
Attia, M., P. Pecina, A. Toral, L. Tounsi, and J. van Genabith. 2011a. “An Open-Source Finite State Morphological Transducer for Modern
Standard Arabic.” In Proceedings of the 9th International Workshop on Finite State Methods
and Natural Language Processing, 125–133. Blois, France: Association for Computational Linguistics.
Attia, M., P. Pecina, A. Toral, and J. van Genabith. 2013. “A Corpus-Based Finite-State Morphological Toolkit for
Contemporary Arabic.” Journal of Logic and Computation 24 (2): 455–472.
Church, K., and P. Hanks. 2002. “Word Association Norms, Mutual Information, and
Lexicography.” Computational Linguistics 16 (1): 22–29.
Déjean, H., and E. Gaussier. 2002. “Une nouvelle approche à l’extraction de lexiques bilingues à
partir de corpus comparables.” In Corpus Linguistics: Critical Concepts in Linguistics, ed. by W. Teubert and R. Krishnamurthy, 1–22. New York: Routledge.
DiCoInfo. 2017. [URL]. Accessed 30 March 2017.
Drouin, P. 2002. Acquisition automatique des termes: l’utilisation des pivots lexicaux
spécialisés. Doctoral thesis. Université de Montréal.
Drouin, P. 2004. “Detection of Domain Specific Terminology Using Corpora
Comparison.” In Proceedings of the Fourth International Conference on Language Resources
and Evaluation (LREC), 79–82. Lisbon, Portugal: ELRA – European Language Resources Association.
Fung, P. 1998. “A Statistical View on Bilingual Lexicon Extraction: from Parallel
Corpora to Non-Parallel Corpora.” In The 3rd Conference of the Association for Machine Translation in the
Americas (AMTA’98), 1–17. Langhorne, PA, USA: Springer Berlin Heidelberg.
Galisson, R. 1978. Recherches de lexicologie descriptive: la banalisation lexicale. Paris: University of Montréal.
Ghazzawi, N. 2016. Du terme prédicatif au cadre sémanrique: méthodologie de compilation
d’une ressource terminologique pour les termes arabes de
l’informatique. Doctoral thesis. University of Montréal.
Guilbert, L. 1973. “La spécificité du terme scientifique et technique.” Langue française (171): 5–17.
Habash, N., and F. Sadat. 2006. “Arabic Preprocessing Schemes for Statistical Machine
Translation.” In Proceedings of the Human Language Technology Conference of the NAACL,
Companion Volume: Short Papers, 49–52. New York, USA: Association for Computational Linguistics.
Habash, N., O. Rambow, and R. Roth. 2009. “MADA+ TOKAN: A Toolkit for Arabic Tokenization, Diacritization,
Morphological Disambiguation, POS Tagging, Stemming and
Lemmatization.” In Proceedings of the 2nd International Conference on Arabic Language
Resources and Tools (MEDAR), 102–109. Cairo, Egypt.
Habash, N. 2010. “Introduction to Arabic Natural Language
Processing.” Synthesis Lectures on Human Language Technologies 3 (1): 1–187.
Kilgarriff, A. 2001. “Comparing Corpora.” International Journal of Corpus Linguistics 6 (1): 97–133.
Lafon, P. 1980. “Sur la variabilité de la fréquence des formes dans un
corpus.” Mot 1 (1): 127–165.
Lebart, L., and A. Salem. 1994. Statistique textuelle. Paris: Dunod.
L’Homme, M.-C. 2004. La terminologie: Principes et Techniques. Montréal, Canada: Les presses de l’université de Montréal.
L’Homme, M.-C. 2015. “Predicative Lexical Units in Terminology.” In Recent Advances in Language Production, Cognition and the
Lexicon, ed. by N. Gala, R. Rappand, and G. Bel-Enguix, 75–93. Switzerland: Springer.
Lorente, M. 2007. “Les unitats lèxiques verbals dels textos especialitzats.
Redefinició d’una proposta de classificació.” In Estudis de lingüístics i de lingüística aplicada en honor de M. Teresa
Cabré Catellví. Volum II: De deixebles, ed. by M. Lorente, R. Estopà, J. Freixa, J. Martí, and C. Tebé, 365–380. Barcelona: Institut Universitari de Lingüística Aplicada de la Universitat Pompeu Fabra.
Mel’čuk, I., A. Clas, and A. Polguère. 1995. Introduction à la lexicologie explicative et combinatoire. Louvain-la-Neuve: Duculot.
Meyer, I. 2000. “Computer Words in Our Everyday Lives: How are They Interesting
for Terminography and Lexicography?” In Proceedings of the Ninth EURALEX International Congress, EURALEX
2000, ed. by H. Ulrich, S. Evert, E. Lehmann, and C. Rohrer, 39–58. Stuttgart, Germany: Institut für Maschinelle Sprachverarbeitung.
Monsonego, S. 1969. “Ch. Muller: Étude de statistique lexicale. Le vocabulaire du
théâtre de P. Corneille.” Langue française 3 (1): 107–110.
Muller, C. 1967. Étude de statistique lexicale, le vocabulaire du théâtre de Pierre
Corneille. Paris: Larousse.
Muller, C. 1977. Principes et méthodes de statistique lexicale. Paris: Hachette.
Nelson, M. B. 2000. Corpus-based Study of the Lexis of Business English and Business
English Teaching Materials. Unpublished Ph.D Thesis, University of Manchester, Manchester.
Rapp, R. 1999. “Automatic Identification of Word Translations from Unrelated
English and German Corpora.” In Proceedings of the 37th Annual Meeting of the Association for
Computational Linguistics on Computational Linguistics, ed. by R. Dale and K. Church, 519–526. Stroudsburg, PA, USA: Association for Computational Linguistics.
Rayson, P., and R. Garside. 2000. “Comparing Corpora Using Frequency Profiling.” In Proceedings of the workshop on Comparing Corpora, 1–6. Stroudsburg, PA, USA: Association for Computational Linguistics.
Reppen, R. 2001. “Review of MONOCONC PRO and WORDSMITH TOOLS.” Language Learning & Technology 5 (3): 32–36.
Rey, A. 1979. La terminologie: noms et notions. Coll. “Que sais-je ?”. Paris: Presses universitaires de France.
Rondeau, G. 1984. Introduction à la terminologie. Chicoutimi, Québec: G. Morin.
Scott, M. 1997. “PC Analysis of Key Words – and Key Key Words.” System 25 (1): 233–345.
Teubert, W. 2009. “La linguistique de corpus: une alternative.” Semen. Revue de sémio-linguistique des textes et discours 271: 185–211.
Toutanova, K., and C. Manning. 2000. “Enriching the Knowledge Sources Used in a Maximum Entropy
Part-of-Speech Tagger.” In Proceedings of the Joint SIGDAT Conference on Empirical Methods in
Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), 63–70. Hong Kong: Association for Computational Linguistics.
Toutanova, K., D. Klein, C. D. Manning, and Y. Singer. 2003. “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency
Network.” In Proceedings of HLT-NAACL, 173–180. Edmonton, Canada: Association for Computational Linguistics.
Xu, F., D. Kurz, J. Piskorski, and S. Schmeier. 2002. “A Domain Adaptive Approach to Automatic Acquisition of Domain
Relevant Terms and their Relations with Bootstrapping.” In Proceedings of the Third International Conference on Language Resources
and Evaluation (LREC’02), ed. by M. González Rodríguez and C. Paz Suarez Araujo, 134–145. Las Palmas, Canary Islands, Spain: European Language Resources Association (ELRA).
Cited by (1)
Cited by one other publication
Rigouts Terryn, Ayla, Véronique Hoste & Els Lefever
2020.
In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora.
Language Resources and Evaluation 54:2
► pp. 385 ff.
This list is based on CrossRef data as of 27 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.