Automatic extraction of specialized verbal units: A comparative study on Arabic, English and French

Ghazzawi, Nizar; Robichaud, Benoît; Drouin, Patrick; Sadat, Fatiha

doi:10.1075/term.00002.gha

Article published In:

Terminology
Vol. 23:2 (2017) ► pp.207–237

Automatic extraction of specialized verbal units

A comparative study on Arabic, English and French

Nizar Ghazzawi | Université de Montréal

Benoît Robichaud | Université de Montréal

Patrick Drouin | Université de Montréal

Fatiha Sadat | Université du Québec à Montréal

This paper presents a methodology for the automatic extraction of specialized Arabic, English and French verbs of the field of computing. Since nominal terms are predominant in terminology, our interest is to explore to what extent verbs can also be part of a terminological analysis. Hence, our objective is to verify how an existing extraction tool will perform when it comes to specialized verbs in a given specialized domain. Furthermore, we want to investigate any particularities that a language can represent regarding verbal terms from the automatic extraction perspective. Our choice to operate on three different languages reflects our desire to see whether the chosen tool can perform better on one language compared to the others. Moreover, given that Arabic is a morphologically rich and complex language, we consider investigating the results yielded by the extraction tool. The extractor used for our experiment is TermoStat (Drouin 2003). So far, our results show that the extraction of verbs of computing represents certain differences in terms of quality and particularities of these units in this specialized domain between the languages under question.

Keywords: specialized verbs, verbal terminological units, Arabic, French, English, terms extraction, terminology, corpus linguistics

Article outline

1.Introduction
2.Previous work
3.Methodology
- 3.1TermoStat: A general overview
- 3.2Integrating Arabic language to TermoStat
- 3.3Compiling specialized corpora
- 3.4Pre-processing
- 3.5Managing specificity
4.Results
- 4.1Filtering results
  - 4.1.1Tagging errors
  - 4.1.2General language units
  - 4.1.3Concordance list
- 4.2Results of filtering
  - For Arabic
  - For English
  - For French
  - 4.2.1Specificity
  - 4.2.2Certain particularities
- 4.3VTU validation criterion
  - 4.3.1Some particularities regarding validation: English and French
  - 4.3.2Some particularities regarding validation: Arabic
5.Evaluation and discussion
- 5.1Precision
- 5.2Comparison between VTUs and NTUs
6.Conclusion
Notes
References

This article is currently available as a sample article.

Published online: 19 January 2018

https://doi.org/10.1075/term.00002.gha

References (45)

References

Abed, A. M., S. Tiun, and M. Abared. 2013. “Arabic Term Extraction Using Combined Approach on Islamic Document.” Journal of Theoretical & Applied Information Technology 58 (3): 601–608.

Ahmad, K., A. Davies, H. Fulford, and M. Rogers. 1994. “What is a term? The Semi-automatic Extraction of Terms from Text.” In Translation Studies: An Interdiscipline, ed. by M. Snell-Hornby, F. Pöchhacker, and K. Kaindl, 267–278. Amsterdam: John Benjamins.

Almaany. 2017. [URL]. Accessed 30 March 2017.

Attia, M., P. Pecina, A. Toral, L. Tounsi, and J. van Genabith. 2011. “A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer.” In Proceedings Systems and Frameworks for Computational Morphology: Second International Workshop, SFCM 2011, Zurich, Switzerland, August 26, 2011, ed. by M. Cerstin and M. Piotrowski, 98–118. Zurich, Switzerland: Springer Berlin Heidelberg.

. 2011a. “An Open-Source Finite State Morphological Transducer for Modern Standard Arabic.” In Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, 125–133. Blois, France: Association for Computational Linguistics.

Attia, M., P. Pecina, A. Toral, and J. van Genabith. 2013. “A Corpus-Based Finite-State Morphological Toolkit for Contemporary Arabic.” Journal of Logic and Computation 24 (2): 455–472.

Chung, T. M. 2003. “A Corpus Comparison Approach for Terminology Extraction.” Terminology 9 (2): 221–246.

Church, K., and P. Hanks. 2002. “Word Association Norms, Mutual Information, and Lexicography.” Computational Linguistics 16 (1): 22–29.

Déjean, H., and E. Gaussier. 2002. “Une nouvelle approche à l’extraction de lexiques bilingues à partir de corpus comparables.” In Corpus Linguistics: Critical Concepts in Linguistics, ed. by W. Teubert and R. Krishnamurthy, 1–22. New York: Routledge.

DiCoInfo. 2017. [URL]. Accessed 30 March 2017.

Drouin, P. 2002. Acquisition automatique des termes: l’utilisation des pivots lexicaux spécialisés. Doctoral thesis. Université de Montréal.

2003. “Term Extraction Using Non-technical Corpora as a Point of Leverage.” Terminology 9 (1): 99–115.

2004. “Detection of Domain Specific Terminology Using Corpora Comparison.” In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), 79–82. Lisbon, Portugal: ELRA – European Language Resources Association.

Fung, P. 1998. “A Statistical View on Bilingual Lexicon Extraction: from Parallel Corpora to Non-Parallel Corpora.” In The 3rd Conference of the Association for Machine Translation in the Americas (AMTA’98), 1–17. Langhorne, PA, USA: Springer Berlin Heidelberg.

Galisson, R. 1978. Recherches de lexicologie descriptive: la banalisation lexicale. Paris: University of Montréal.

Ghazzawi, N. 2016. Du terme prédicatif au cadre sémanrique: méthodologie de compilation d’une ressource terminologique pour les termes arabes de l’informatique. Doctoral thesis. University of Montréal.

Guilbert, L. 1973. “La spécificité du terme scientifique et technique.” Langue française (171): 5–17.

Habash, N., and F. Sadat. 2006. “Arabic Preprocessing Schemes for Statistical Machine Translation.” In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, 49–52. New York, USA: Association for Computational Linguistics.

Habash, N., O. Rambow, and R. Roth. 2009. “MADA+ TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization.” In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), 102–109. Cairo, Egypt.

Habash, N. 2010. “Introduction to Arabic Natural Language Processing.” Synthesis Lectures on Human Language Technologies 3 (1): 1–187.

Kilgarriff, A. 2001. “Comparing Corpora.” International Journal of Corpus Linguistics 6 (1): 97–133.

Lafon, P. 1980. “Sur la variabilité de la fréquence des formes dans un corpus.” Mot 1 (1): 127–165.

Lebart, L., and A. Salem. 1994. Statistique textuelle. Paris: Dunod.

Lemay, C., M.-C. L’Homme, and P. Drouin. 2005. “Two Methods for Extracting Specific Single-Word Terms from Specialized Corpora: Experimentation and Evaluation.” International Journal of Corpus Linguistics 10 (2): 227–255.

L’Homme, M.-C. 2004. La terminologie: Principes et Techniques. Montréal, Canada: Les presses de l’université de Montréal.

2015. “Predicative Lexical Units in Terminology.” In Recent Advances in Language Production, Cognition and the Lexicon, ed. by N. Gala, R. Rappand, and G. Bel-Enguix, 75–93. Switzerland: Springer.

Lorente, M. 2007. “Les unitats lèxiques verbals dels textos especialitzats. Redefinició d’una proposta de classificació.” In Estudis de lingüístics i de lingüística aplicada en honor de M. Teresa Cabré Catellví. Volum II: De deixebles, ed. by M. Lorente, R. Estopà, J. Freixa, J. Martí, and C. Tebé, 365–380. Barcelona: Institut Universitari de Lingüística Aplicada de la Universitat Pompeu Fabra.

Mel’čuk, I., A. Clas, and A. Polguère. 1995. Introduction à la lexicologie explicative et combinatoire. Louvain-la-Neuve: Duculot.

Meyer, I. 2000. “Computer Words in Our Everyday Lives: How are They Interesting for Terminography and Lexicography?” In Proceedings of the Ninth EURALEX International Congress, EURALEX 2000, ed. by H. Ulrich, S. Evert, E. Lehmann, and C. Rohrer, 39–58. Stuttgart, Germany: Institut für Maschinelle Sprachverarbeitung.

Meyer, I. and K. Mackintosh. 2000. “When terms move into our everyday lives: An overview of de-terminologization”. Terminology 6(1), 111–138.

Monsonego, S. 1969. “Ch. Muller: Étude de statistique lexicale. Le vocabulaire du théâtre de P. Corneille.” Langue française 3 (1): 107–110.

Muller, C. 1967. Étude de statistique lexicale, le vocabulaire du théâtre de Pierre Corneille. Paris: Larousse.

1977. Principes et méthodes de statistique lexicale. Paris: Hachette.

Nelson, M. B. 2000. Corpus-based Study of the Lexis of Business English and Business English Teaching Materials. Unpublished Ph.D Thesis, University of Manchester, Manchester.

Rapp, R. 1999. “Automatic Identification of Word Translations from Unrelated English and German Corpora.” In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ed. by R. Dale and K. Church, 519–526. Stroudsburg, PA, USA: Association for Computational Linguistics.

Rayson, P., and R. Garside. 2000. “Comparing Corpora Using Frequency Profiling.” In Proceedings of the workshop on Comparing Corpora, 1–6. Stroudsburg, PA, USA: Association for Computational Linguistics.

Reppen, R. 2001. “Review of MONOCONC PRO and WORDSMITH TOOLS.” Language Learning & Technology 5 (3): 32–36.

Rey, A. 1979. La terminologie: noms et notions. Coll. “Que sais-je ?”. Paris: Presses universitaires de France.

Rondeau, G. 1984. Introduction à la terminologie. Chicoutimi, Québec: G. Morin.

Sager, J. C. 1990. A Practical Course in Terminology Processing. Amsterdam: John Benjamins.

Scott, M. 1997. “PC Analysis of Key Words – and Key Key Words.” System 25 (1): 233–345.

Teubert, W. 2009. “La linguistique de corpus: une alternative.” Semen. Revue de sémio-linguistique des textes et discours 271: 185–211.

Toutanova, K., and C. Manning. 2000. “Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger.” In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), 63–70. Hong Kong: Association for Computational Linguistics.

Toutanova, K., D. Klein, C. D. Manning, and Y. Singer. 2003. “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network.” In Proceedings of HLT-NAACL, 173–180. Edmonton, Canada: Association for Computational Linguistics.

Xu, F., D. Kurz, J. Piskorski, and S. Schmeier. 2002. “A Domain Adaptive Approach to Automatic Acquisition of Domain Relevant Terms and their Relations with Bootstrapping.” In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), ed. by M. González Rodríguez and C. Paz Suarez Araujo, 134–145. Las Palmas, Canary Islands, Spain: European Language Resources Association (ELRA).

Cited by (1)

Cited by one other publication

Rigouts Terryn, Ayla, Véronique Hoste & Els Lefever

2020. In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora. Language Resources and Evaluation 54:2 ► pp. 385 ff.

This list is based on CrossRef data as of 27 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.