A shorter path to the lexical profiling of legal texts?: Automatic term recognition and legal language

Marín, María José

doi:10.1075/hot.3.aut2

Part of

Handbook of Terminology: Volume 3. Legal Terminology
Edited by Łucja Biel and Hendrik J. Kockaert
[Handbook of Terminology 3] 2023
► pp. 511–541

Automatic term recognition and legal language

A shorter path to the lexical profiling of legal texts?

María José Marín | Universidad de Murcia

Natural Language Processing (NLP) tools offer language scholars a wide array of possibilities to examine, amongst other, the lexicon in any text collection. This research was designed as an attempt to try to measure the degree of precision of three of these methods (Chung 2003; Drouin 2003; Scott 2008a) through their implementation on two corpora of Spanish and British judicial decisions which revolve around the topic of immigration. In addition, the last section of this chapter explores the lexical inventories extracted by each method (the top 500 candidate terms (CTs) in each case) by grouping them into ad hoc thematic categories, the most numerous being, as was to be expected, legal terms, followed by territory, evaluative items, crime and family.

Keywords: ATR methods, legal corpora, terminology, immigration, judicial decisions

Article outline

1.Introduction
2.ATR and legal language
3.Methodology
- 3.1Method description
  - 3.1.1Keywords
  - 3.1.2TermoStat
  - 3.1.3Chung
- 3.2Corpus description
- 3.3Method implementation
4.Results and discussion
- 4.1Method validation
- 4.2Thematic term categories
  - 4.2.1Corpus-driven semantic classification
  - 4.2.2Semantic categorization using UMUTextStats
5.Conclusion
Notes
References

Available under the Creative Commons Attribution-NoDerivatives (CC BY-ND) 4.0 license.

For any use beyond this license, please contact the publisher at [email protected].

Published online: 1 December 2023

https://doi.org/10.1075/hot.3.aut2

References (51)

References

Ahmad, Khurshid, Andrea Davies, Heather Fulford, and Margaret Rogers. 1994. “What is a Term? The Semi-automatic Extraction of Terms from Text.” In Translation Studies: An Interdiscipline, edited by Mary Snell-Hornby, Franz Pöchhacker and Klaus Kaindl, 267–278. Amsterdam: John Benjamins.

Ananiadou, Sophia. 1988. A Methodology for Automatic Term Recognition. PhD Thesis, University of Manchester Institute of Science and Technology: United Kingdom.

. 1994. “A Methodology for Automatic Term Recognition.” In COLING. Proceedings of the 15th International Conference on Computational Linguistics, 1034–1038.

Anthony, Laurence. 2020. AntConc (Version 3.5.9) [Computer Software]. Tokyo: Waseda University. [URL]

Aronson, Alan R. and François-Michel Lang. 2010. “An Overview of MetaMap: Historical Perspective and Recent Advances.” Journal of American Medical Informatics Association 17(3):229–236.

Arora, Chetan, Mehrdad Sabetzadeh, Lionel Briand, and Frank Zimmer. 2016. “Automated Extraction and Clustering of Requirements Glossary Terms[J].” IEEE Transactions on Software Engineering 43(10):918–945.

Astrakhantsev, Nikita. 2018. “ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala.” Language Resources and Evaluation 52:853–872.

Barrón-Cedeño, Alberto, Gerardo E. Sierra, Patrick Drouin, and Sophia Ananiadou. 2009. “An Improved Automatic Term Recognition Method for Spanish.” In International Conference on Intelligent Text Processing and Computational Linguistics, edited by Alexander Gelbukh, 125–136. Berlin: Springer.

Bernier-Colborne, Gabriel. 2012. “Defining a Gold Standard for the Evaluation of Term Extractors.” In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), edited by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk and Stelios Piperidis, 15–18. European Language Resources Association (ELRA). [URL]

Bisceglia, Bruno, Rita Calabrese, and Ljubica Leone. 2014. “Combining Critical Discourse Analysis and NLP Tools in Investigations of Religious Prose.” LRE-REL2. Proceedings of the 2nd Workshop on Language Resources and Evaluation for Religious Texts. 31 May 2014, Reykjavik, Iceland, edited by Claire Brierley, Majdi Sawalha & Eric Atwell, 24–29. [URL]

Bourigault, Didier. 1992. “Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases.” In COLING 1992 – Volume 3: The 14th International Conference on Computational Linguistics, 977–981. [URL].

Cabré Castellví, Maria Teresa. 1999. Terminology: Theory, Methods and Applications. Amsterdam: John Benjamins.

Cabré Castellví, Maria Teresa, Rosa Estopà Bagot, and Jordi Vivaldi Palatresi. 2001. “Automatic Term Detection: A Review of Current Systems.” In Recent Advances in Computational Terminology 2, edited by Dider Bourigault, Christian Jacquemin & Marie-Claude L’Homme, 53–87. Amsterdam: John Benjamins.

Chung, Teresa Mihwa. 2003. “A Corpus Comparison Approach for Terminology Extraction.” Terminology 9(2):221–246.

Church, Kenneth and Patrick Hanks. 1990. “Word Association Norms, Mutual Information, and Lexicography.” Computational Linguistics 16(1):22–29.

Coxhead, Averil. 2000. “A New Academic Word List.” TESOL Quarterly 34(2):213–238.

Dagan, Ido and Kenneth Church. 1994. “Termight: Identifying and Translating Technical Terminology.” In Fourth Conference on Applied Natural Language Processing, 34–40. Association for Computational Linguistics.

Daille, Béatrice. 1996. “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology.” In The Balancing Act: Combining Symbolic And Statistical Approaches To Language, edited by Judith L. Klavans and Philip Resnik, 49–66. Cambridge, MA: MIT Press.

David, Sophie and Pierre Plante. 1990. Termino 1.0. Research Report of Centre d’Analyse de Textes par Ordinateur. Montréal: Université du Québec.

Drouin, Patrick. 2003. “Term Extraction Using Non-technical Corpora as a Point of Leverage.” Terminology 9(1):99–115.

Dunning, Ted E. 1993. “Accurate Methods for the Statistics of Surprise and Coincidence.” Computational Linguistics 19(1):61–74.

Fahmi, Ismail, Gosse Bouma, and Lonneke Van Der Plas. 2007. “Improving Statistical method using known terms for automatic term extraction.” (conference talk). Conference: Computational Linguistics in the Netherlands (CLIN 17), November 2007 (unpublished).

Frantzi, Katerina T. and Sophia Ananiadou. 1996. “Extracting Nested Collocations.” In COLING 1996 – Volume 1: The 16th International Conference on Computational Linguistics, 41–46. USA: Association for Computational Linguistics. [URL].

. 2000. “Automatic Recognition of Multi-Word Terms: the C-value/NC-value Method.” International Journal on Digital Libraries 3:115–130.

Gabrielatos, Costas and Anna Marchi. 2011. “ Keyness: Matching Metrics to Definitions ” (conference talk). Conference: Corpus Linguistics in the South: Theoretical-methodological challenges in corpus approaches to discourse studies – and some ways of addressing them , 5th November, Portsmouth (unpublished). [URL]

García-Díaz, José Antonio, Mar Cánovas-García, and Rafael Valencia-García. 2020. “Ontology-driven Aspect-based Sentiment Analysis Classification: An Infodemiological Case Study Regarding Infectious Diseases in Latin America.” Future Generation Computer Systems 112:641–657.

García-Díaz, José Antonio, María Pilar Salas-Zárate, María Luisa Hernández-Alcaraz, Rafael Valencia-García, and Juan Miguel Gómez-Berbís. 2018. “Machine Learning Based Sentiment Analysis on Spanish Financial Tweets.” In Trends and Advances in Information Systems and Technologies (WorldCIST’18 2018). Advances in Intelligent Systems and Computing, Vol. 745, edited by Álvaro Rocha, Hojjat Adeli, Luís Paulo Reis and Sandra Costanzo, 305–311. Springer: Cham.

Heylen, Kris and Dirk De Hertog. 2015. “Automatic Term Extraction.” In Handbook of Terminology, edited by Hendrik Kockaert and Frieda Steurs, 203–221. Amsterdam: John Benjamins.

Jacquemin, Christian. 2001. Spotting and Discovering Terms through NLP. Massachusetts: MIT Press.

Jumaquio-Ardales, Alona, Nathaniel Oco, and Rowell Madula. 2017. “Click-analysis of a Lesbian Online Community in Facebook Using the Critical Discourse Analysis and Natural Language Processing.” Humanities Diliman: A Philippine Journal of Humanities 14(1):46–68.

Jurafsky, Daniel and James H. Martin. 2019. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River: Els autors.

Justeson, John S. and Slava M. Katz. 1995. “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text.” Natural Language Engineering 1(1):9–27.

Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. 2014. “The Sketch Engine: Ten Years On.” Lexicography 1:7–36.

Kit, Chunyu and Xiaoyue Liu. 2008. “Measuring Mono-word Termhood by Rank Difference via Corpus Comparison.” Terminology 14(2):204–229.

Lemay, Chantal, Marie-Claude L’Homme, and Patrick Drouin. 2005. “Two Methods for Extracting ‘Specific’ Single-word Terms from Specialised Corpora: Experimentation and Evaluation.” International Journal of Corpus Linguistics 10(2):227–255.

Loginova, Elizaveta, Anita Gojun, Helena Blancafort, Marie Guégan, Tatiana Gornostay, and Ulrich Heid. 2012. “Reference Lists for the Evaluation of Term Extraction Tools.” Paper at the 10th Terminology and Knowledge Engineering Conference: New Frontiers in the Constructive Symbiosis of Terminology and Knowledge Engineering (TKE 2012) , Madrid, Spain. [URL]

Marín, María José. 2014. “Evaluation of Five Single-Word Term Recognition Methods on a Legal Corpus.” Corpora 9(1):83–107.

. 2016. “Measuring the Degree of Specialisation of Sub-Technical Legal Terms through Corpus Comparison: a Domain-Independent Method.” Terminology 22(1):80–102.

Maynard, Diana and Sophia Ananiadou. 2000. “TRUCKS: A model for Automatic Multi-word Term Recognition.” Journal of Natural Language Processing 8(1):101–125.

Meijer, Kevin, Flavius Frasincar, and Frederik Hogenboom. 2014. “A Semantic Approach for Extracting Domain Taxonomies from Text.” Decision Support Systems 62:78–93.

Mondary, Thibault, Adeline Nazarenko, Haïfa Zargayouna, and Sabine Barreaux. 2012. “The Quaero Evaluation Initiative on Term Extraction.” In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12), edited by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk and Stelios Piperidis, 663–669. European Language Resources Association (ELRA). [URL]

Pazienza, Maria Teresa, Marco Pennacchiotti, and Fabio Massimo Zanzotto. 2005. “Terminology Extraction: An Analysis of Linguistic and Statistical Approaches.” In Knowledge Mining. Studies in Fuzziness and Soft Computing, Vol. 185, edited by Spiros Sirmakessis, 255–279. Berlin: Springer.

Pennebaker, J. W. and M. Francis. 1999. Linguistic Inquiry and Word Count: LIWC. Erlbaum Publishers.

Rayson, Paul, and Roger Garside. 2000. “Comparing Corpora Using Frequency Profiling.” In WCC ’00: Proceedings of the Workshop on Comparing Corpora, Vol. 9, 1–6.

Reese, Samuel, Gemma Boleda, Montse Cuadros, Lluís Padró, and German Rigau. 2010. “Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus.” In Proceedings of 7th Language Resources and Evaluation Conference (LREC’10), edited by Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner & Daniel Tapias, 1418–1421. European Language Resources Association (ELRA).

Schmid, Helmut. 1999. “Improvements in Part-of-Speech Tagging with an Application to German.” In Natural Language Processing Using Very Large Corpora, edited by Susan Armstrong, Kenneth Church, Pierre Isabelle, Sandra Manzi, Evelyne Tzoukermann, and David Yarowsky, 13–25. Springer.

Scott, Mike. 2008a. WordSmith Tools, Version 5. Liverpool: Lexical Analysis Software.

. 2008b. WordSmith Tools Help. Stroud: Lexical Analysis Software.

Shang, Jingbo, Jialu Liu, Meng Jiang, Xiang Ren, Clare R. Voss, and Jiawei Han. 2018. “Automated Phrase Mining from Massive Text Corpora.” IEEE Transactions on Knowledge and Data Engineering 30(10):1825–1837.

Spasic, Irena, Sophia Ananiadou, John McNaught, and Anand Kumar. 2005. “Text Mining and Ontologies in Biomedicine: Making Sense of Raw Text.” Brief Bioinform 6(3):239–251.

Vivaldi, Jorge, Luis Adrián Cabrera-Diego, Gerardo Sierra, and María Pozzi. 2012. “Using Wikipedia to Validate the Terminology Found in a Corpus of Basic Textbooks.” In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12), edited by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk and Stelios Piperidis, 3820–3827. European Language Resources Association (ELRA). Available at: [URL]