The role of Semantic Web technologies in legal terminology
Table of contents
1.Introduction
The last decade has witnessed the growth of technological solutions for law firms and legal services. The term legaltech has become a buzz word as more and more technological start-ups have emerged to transform several aspects of the legal services industry (Dale 2019Dale, Robert 2019 “Law and Word Order: NLP in Legal Tech.” Natural Language Engineering 25(1):211–217. ). Legaltech, the short form of legal technology, is generally defined as “technologies from Computer Science that are applied to a range of areas related to legal practice and materials” (Nazarenko, and Wyner 2017Nazarenko, Adeline and Adam Wyner 2017 “Legal NLP Introduction.” TAL Traitement Automatique des Langues 58:7–19.). In the light of this technological revolution in the legal practice, language resources have also become necessary to support services that rely on Natural Language Processing (NLP) and Artificial Intelligence (AI) technologies. As stated by Nazarenko and Wyner (2017)Nazarenko, Adeline and Adam Wyner 2017 “Legal NLP Introduction.” TAL Traitement Automatique des Langues 58:7–19., legal NLP is playing a major role in tasks such as document drafting and revision, legal research or document automation.
For machines to provide support in these tasks, NLP technologies usually need to be adapted or trained on the domain specific language used in the documents. Therefore, tools must be trained on the legal jargon or rely on manually created terminological resources (Zhong et al. 2020Zhong, Haoxi, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun 2020 “How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5218–5230. Online: Association for Computational Linguistics. ). The well-known complexities of legal language, nicely summarized in Alcaraz Varó, Hughes, and Gómez (2002)Alcaraz Varó, Enrique, Brian Hughes, and Adelina Gómez 2002 El español jurídico. Barcelona: Ariel. and Haigh (2004)Haigh, Rupert 2004 Legal English. London: Cavendish., come to pose an added challenge here, and even in the case of a simple keyword-based search, the choice of legal expression may have an impact on the results. In addition, legal sub-areas codify different terminologies that need to be specifically collected for the processing of legal documents in those sub-areas, and that may not be always available for reuse, since, on many occasions, these are generated for internal use (not publicly available), published in unstructured formats or require an economic fee to use them.
In addition to the resource availability problem, other terminologies, glossaries and thesauri have been created for direct consultation by humans. This means that data is shown through a graphical user interface, and it is not available for its integration in NLP tools. This hinders automatic querying as well their maintenance, preventing constant and automatic updates, as news terms are created. Finally, users also face difficulties when searching for language resources in the legal area, since they are not easily findable due to the lack of rich metadata descriptors associated with them. To palliate this, several initiatives in Europe have pursued the creation of specific catalogues of legal language resources11. http://data.lynx-project.eu/ or terminology resources in general.22. https://termcoord.eu/
For all these reasons, the first step in the reuse and integration of language resources in NLP tools involves their conversion into standard machine-readable formats. The main objective of these formats is to represent every data item contained in a resource in a way that it is uniquely and unambiguously identifiable, accessible, and easy to integrate. One of the most relevant examples of modernization of a terminological database for its integration in a computer-assisted translation environment (CATE) is represented by the recently launched version of IATE, the term base of the language services of the European Union. IATE has been redeveloped to adapt “the technologies, architecture and data structure of the system in order to prepare it for future challenges, including interoperability, modularity, scalability and data exchange” (Zorrilla-Agut, and Fontenelle 2019Zorrilla-Agut, Paula and Thierry Fontenelle 2019 “IATE 2: Modernising the EU’s IATE Terminological Database to Respond to the Challenges of Today’s TranslationWorld and Beyond.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 25(2):146–174. , 146). One of the first consequences of this transformation is that IATE data can be directly accessed from CATEs by means of complex queries combining several fields supported by its new data structure.
Data exchange between services, in a machine-to-machine communication, is precisely what the formats and technologies of the Semantic Web enable and what supports our claim that terminology resources in the legal domain should adopt these standards to guarantee an efficient integration in legal NLP tools. In Section 2, we therefore describe the Semantic Web in more detail. Section 3 is devoted to the Linguistic Linked Open Data cloud initiative, an effort to publish and link language resources with open licenses on the Web, for an immediate reuse and integration in third applications. Examples of legal language resources (thesauri and terminologies) are given in Section 4, and the models used to represent those resources according to the Semantic Web standards are described in Section 5. The benefits of publishing terminological resources in Semantic Web formats and interlink them in the Linguistic Linked Open Data cloud are spelled out in Section 6 and exemplified in Section 7 for practical purposes. We conclude the chapter in Section 8, making a plea for the adoption of Semantic Web standards in the publication of legal terminological resources.
2.The Semantic Web at a glance
Much of the content on the web is intended for human consumption, published in unstructured or semi-structured formats such as .pdf, .txt, .doc, .csv or .html. Their heterogeneity and the fact that some of them are not machine-readable pose many problems when certain techniques of Artificial Intelligence are applied, such as Information Retrieval, Document Classification or Machine Translation, whose aim is to provide users with an easier access to information. Content in these documents is to be interpreted by human users, since they consist of a set of unrelated words to the machine. This limits the search for information to a keyword search, as we are used to from the well-known commercial search engines (Benjamins et al. 2002Benjamins, Richard V., Jesús Contreras, Oscar Corcho, and Asunción Gómez-Pérez 2002 “The Six Challenges for the Semantic Web.” Eighth International Conference on Principles of Knowledge Representation and Reasoning, KR2002. Toulouse.).
For this reason, the World Wide Web Consortium (W3C) works to improve the growth of the Web in an organized way, promoting the publication of data in structured, machine-readable formats, in which the meaning of words is coded and can be interpreted by machines. This evolution is known as the Semantic Web or the Web of Data, whose main idea is that not only the documents are connected, but also the information contained in these documents (Berners-Lee, Hendler, and Lassila 2001Berners-Lee, Tim, James Hendler, and Ora Lassila 2001 “The Semantic Web.” Scientific American 2001(5):34–43. ).
The most common model for publishing data on the Semantic Web is the Resource Description Framework (RDF). This format supports the description of concepts, the representation of information and the interchange of data on the web. The information unit in RDF is the triple,33. https://www.w3.org/TR/rdf11-primer/#section-triple/ a subject-predicate-object structure that represents the information as entities (subject and object) connected by relations (predicates), as shown in Figure 1. These entities are identified by a Uniform Resource Identifier (URI), that is a unique identifier or ID for an entity within a certain resource. For instance, if we navigate through the content of Wikidata,44. https://www.wikidata.org/ a free and open knowledge base that stores the structured data of Wikimedia projects55. https://meta.wikimedia.org/wiki/Wikimedia_movement/ we can find relations (also named as properties)66. https://www.wikidata.org/wiki/Wikidata:List_of_properties/ such as is_capital_of amongst the entities Berlin and Germany. These three elements are precisely identified and given meaning, since they belong to a broader hierarchical structure that places them in a wider concept scheme.
RDF is at the core of the Linked Open Data paradigm for publishing information, based on these four Linked Data Principles (Bizer, Heath, and Berners-Lee 2011Bizer, Christian, Tom Heath, and Tim Berners-Lee 2011 “Linked Data: The Story so Far.” In Semantic Services, Interoperability and Web Applications: Emerging Concepts, edited by Amit P. Sheth, 205–227. Hershey, PA: Information Science Reference. ):
-
Entities should be identified via unique URIs.
-
These URIs should be HTTP URIs and follow standard web protocols.
-
These URIs should return useful information about the resource.
-
They should contain links to other URIs pointing at related resources.
Following with the previous example, we could link a new entity, Willi Stoph, to the existing entities Berlin and Germany, with their corresponding relations. Figure 2 represents the following triples:
-
Berlin is the capital of Germany.
-
Willi Stoph was born in Berlin.
-
The country of citizenship of Willi Stoph is Germany.
If we transform all the information related to Berlin into Linked Data, including different entities possibly connected to external resources, we will have weaved a rich graph data structure, nowadays much appreciated under the term Knowledge Graph77. https://www.ontotext.com/knowledgehub/fundamentals/what-is-a-knowledge-graph/ . Machines can navigate through the data in a graph and infer available knowledge which would be otherwise hidden.
This knowledge inference is possible thanks to the use of ontologies to organize the information. An ontology is a concept that originally belongs to the philosophical domain, defined as “the science of what is, of the kinds and structures of objects, properties, events, processes, and relations in every area of reality” (Smith 2008Smith, Barry 2008 “Ontology.” In The Blackwell guide to the philosophy of computing and information edited by Luciano Floridi, 155–164. New Jersey: John Wiley and Sons. , 155). In Information Science, an ontology is understood as a model or vocabulary to represent the concepts of a certain domain (Chandrasekaran, Josephson, and Benjamins 1999Chandrasekaran, Balakrishnan, John R. Josephson, and V. Richard Benjamins 1999 “What are Ontologies, and Why do We Need Them?” IEEE Intelligent Systems and Their Applications 14(1):20–26. ), and it is composed of classes, relations, rules and restrictions. Therefore, following with the previous examples, in the Wikidata Ontology, Berlin is an instance of the capital class, represented by the ID wdt:Q5119; Germany is an instance of the country class, represented by the ID wdt:Q6256; and Willi Stoph is an instance of human, represented by the ID wdt:Q5 (see Figure 3).
This method of representing knowledge allows to retrieve complex pieces of information by using one single query. One way to query knowledge bases, such as Wikidata, is using the SPARQL language88. https://www.w3.org/TR/rdf-sparql-query/ that is the standard query language to retrieve RDF data from a SPARQL Endpoint. The main advantage of SPARQL compared to other query languages, such as SQL, is that it can efficiently extract information from non-uniform data possibly stored in different servers. Machines serving a SPARQL Endpoint enable a new sort of computer applications taking advantage of distributed knowledge. In this context, users can access Wikidata SPARQL Endpoint,99. https://query.wikidata.org/ and with a single query retrieve, for instance, a list of all countries in Central Europe with their corresponding capitals as shown in Listing 1.
The query shown in Listing 1 asks for two variables, ?capital and ?country, that follow three different rules, where the IDs represent the information described in Table 1:
-
?country needs to have the class country (?country wdt:P31 wd:Q6256)
-
?country needs to belong to Central Europe (?country wdt:P30 wd:Q46)
-
?capital needs to be a capital of ?country (?capital wdt:P1376 ?country)
ID | Type of element | Description |
---|---|---|
wdt:P31 | property | instance of |
wdt:Q6256 | class | country |
wdt:P30 | property | part of |
wdt:Q46 | class | Central Europe |
wdt:P1376 | property | is capital of |
Note that, as mentioned before, resources in RDF are identified by URIs. Consequently, the results of this query are the URIs used by Wikidata to identify those countries and capitals, that are composed of a base URI (http://wikidata.org/wiki/) plus the ID of each element. If we want to know the corresponding name of those URIs, we need to ask for their names, that in this context are called labels, as shown in Listing 2.
In the query shown in Listing 2, we add two lines asking for the label of the items ?capital and ?country. In this resource, the labels are represented by the property rdfs:label. Consequently, we also add the variables ?capitalLabel and ?countryLabel, and language filters to retrieve only the names in one language, in this case, English. Otherwise, we would get the labels in every language available in the knowledge base. Finally, we are adding a rule to alphabetically order the results per country label. Table 2 shows the first five results of this query.
capital | country | capitalLabel | countryLabel |
---|---|---|---|
wd:Q19689 | wd:Q222 | Tirana | Albania |
wd:Q1863 | wd:Q228 | Andorra la Vella | Andorra |
wd:Q1741 | wd:Q40 | Vienna | Austria |
wd:Q47 | wd:Q219 | Sofia | Bulgaria |
wd:Q1435 | wd:Q224 | Zagreb | Croatia |
… | … | … | … |
In summary, the objective of this section is to give an overview of the peculiarities and advantages of the Semantic Web, meaning structured representation of data, open access, knowledge inference and access to complex information with a single query. Building SPARQL queries is not as easy as searching for information through a search interface. SPARQL is not, therefore, intended for human users but for machines. However, it is infinitely more efficient, and it is also possible to build search interfaces that access knowledge structured in RDF, thus taking advantage of its power and speed, offering them to humans.
3.The Linguistic Linked Open Data cloud
The advantages of combining RDF and ontologies were rapidly demonstrated and several initiatives to publish data according to the Linked Data principles arose. The most important, as mentioned in the previous section, is the Linked Open Data project, that pursues the publication of Linked Data under open licenses. This project gave birth to the Linked Open Data cloud1010. http://lod-cloud.net (Bizer, Heath, and Berners-Lee 2011Bizer, Christian, Tom Heath, and Tim Berners-Lee 2011 “Linked Data: The Story so Far.” In Semantic Services, Interoperability and Web Applications: Emerging Concepts, edited by Amit P. Sheth, 205–227. Hershey, PA: Information Science Reference. ), as the main source of Linked Data. It can be divided in sub-clouds per area of expertise, such as the Geography cloud, the Governmental cloud, the Media cloud, etc.; and each of them is composed of interlinked datasets belonging to that specific field. In this context, the most relevant sub-cloud is the Linguistic Linked Open Data cloud1111. http://linguistic-lod.org/llod-cloud (LLOD cloud) (see Figure 4).
Resources in the Linguistic Linked Open Data cloud are classified (by colours, as shown in Figure 4 depending on their typology: (1) Corpora, (2) Lexicons and Dictionaries, (3) Terminologies, Thesauri and Knowledge Bases, (4) Linguistic Resource Metadata, (5) Linguistic Data Categories, (6) Typological Databases and (7) Other. The interactive LLOD diagram1212. https://lod-cloud.net/versions/latest/linguistic-lod.svg shows the links between resources, allowing the navigation amongst them. Its main drawback, however, is that the datasets are not classified by domain, and it is complex to identify their subject.
Some data catalogues, truly repositories of metadata, have been created such a Linghub1313. http://linghub.org/ (McCrae, and Cimiano 2015McCrae, John P. and Philipp Cimiano 2015 “Linghub: a Linked Data based Portal Supporting the Discovery of Language Resources.” Proceedings of SEMANTiCS2015, Posters and Demos, edited by Agata Filipowska, Ruben Verborgh & Axel Polleres, 88–91. https://ceur-ws.org/Vol-1481/paper27.pdf). LingHub contains the metadata or data describing the resources in the LLOD (author, date of creation, domain, language, etc.), so that resources can be grouped by domain. Nonetheless, the results are not refined enough, and a constrained search may be difficult to perform. For the purposes of this work, we have performed a lookup through the resources in the LLOD cloud and selected the most relevant ones for the legal domain, that is, we have identified legal language resources that have been represented in Linked Data formats and which are openly available.
4.Legal language resources in the Semantic Web
Although the presence of legal language resources in machine readable formats in general, and in the Web of Data, specifically, is relatively low, several efforts have been made that are of interest to this work:
-
EuroVoc: the multilingual and multidisciplinary thesaurus that covers the activities of the European Union, containing terms in 22 languages, was originally published in XML-Eurovoc. The presence of legal content in this thesaurus is notable. After much discussion and various proposals EuroVoc was published as Linked Data and linked to other relevant thesauri (Alvite Díez et al. 2010Alvite Díez, Luisa, Beatriz Pérez-León, Mercedes Martínez-González, and Dámaso Javier Vicente Blanco 2010 “Propuesta de Representación del Tesauro Eurovoc en SKOS para su Integración en Sistemas de Información Jurídica.” Scire: representación y organización del conocimiento 16(2):47–51. ). It can be accessed through the EU Vocabularies SPARQL Endpoint.1414. https://op.europa.eu/en/advanced-sparql-query-editor
-
ECLAS: a thesaurus created by the Central Library of the Commission of the European Communities for indexing the publications and documents by the Central Library of the Commission.1515. http://publications.europa.eu/resource/dataset/eclas ECLAS is also an interdisciplinary thesaurus, but as in the case of EuroVoc, it contains a considerable number of terms related to the legal domain in English and French.
-
International Labour Organization Thesaurus: this asset is published as an interlinked resource with to ECLAS thesaurus. The ILO thesaurus1616. https://metadata.ilo.org/thesaurus.html contains terms from the labour law domain in English, French and Spanish.
-
UNESCO Thesaurus: a controlled list of terms intended for the subject analysis of texts and document retrieval, developed by the UNESCO, containing terms on several domains such as education, politics, culture and social sciences. It is published in English, French, Spanish and Russian.
-
TheSoz: the Thesaurus for Social Sciences is a German thesaurus for the domain of the social sciences, and a very important instrument for information retrieval, document indexing or search term recommendation. It contains terms in English, French and German (Zapilko et al. 2013Zapilko, Benjamin, Johann Schaible, Philipp Mayr, and Brigitte Mathiak 2013 “TheSoz: A SKOS Representation of the Thesaurus for the Social Sciences.” Semantic Web 4(3):257–263. ).
-
STW Thesaurus for Economics: a thesaurus that provides a vocabulary on any economic subject. It also contains terms used in law, sociology and politics. In this case, the thesaurus is bilingual, with terms in English and German (Neubert 2009Neubert, Joachim 2009 “Bringing the Thesaurus for Economics on to the Web of Linked Data.” LDOW 25964:102.).
-
IATE RDF: the RDF version of the Inter-Active Terminology for Europe (IATE) is one of the most representative terminological resources in the LLOD cloud. It contains more than 8 million multilingual and cross-domain terms. A dump of its TBX version was converted into RDF in Cimiano, McCrae et al. (2015)Cimiano, Philipp, John P. McCrae, Víctor Rodríguez-Doncel, Tatiana Gornostay, Asunción Gómez-Pérez, Benjamin Siemoneit, and Andis Lagzdins 2015 “Linked Terminologies: Applying Linked Data Principles to Terminological Resources.” In Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11–13 August 2015, Herstmonceux Castle, United Kingdom, edited by Iztok Kosem, Miloš Jakubíček, Jelena Kallas and Simon Krek, 504–517. Ljubljana: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd.. It was also linked with the European Migration Network glossary and is one of the most relevant works of terminology conversion into Semantic Web formats.
-
Copyright Termbank: a similar work was done in Rodríguez-Doncel et al. (2015)Rodríguez-Doncel, Víctor, Cristiana Santos, Pompeu Casanovas, and Asunción Gómez-Pérez 2015 “A Linked Term Bank of Copyright-Related Terms.” Legal Knowledge and Information Systems 279:91–100. publishing a multilingual term bank of copyright-related terms, with links WIPO definitions, IATE terms, definitions from Creative Commons licenses, DBpedia1717. https://www.dbpedia.org/ and Lexvo.1818. http://www.lexvo.org/
-
Terminoteca RDF: this project gathers two sets of resources: Terminesp, a multilingual terminological database developed by the Spanish Association for Terminology;1919. http://www.aeter.org/ and terminological glossaries from the Terminología Oberta service of the Catalan Terminological Centre2020. http://www.termcat.cat/en (TERMCAT). The result is a multilingual repository2121. http://linguistic.linkeddata.es/terminoteca/ of linked terminologies from different areas of expertise, including the legal domain (Bosque-Gil, Montiel-Ponsoda et al. 2016Bosque-Gil, Julia, Elena Montiel-Ponsoda, Jorge Gracia, and Guadalupe Aguado-de-Cea 2016 “Terminoteca RDF: a Gathering Point for Multilingual Terminologies in Spain.” In Proceedings of TKE 2016 the 12th International conference on Terminology and Knowledge Engineering, edited by Hanne Erdman Thomsen, Antonio Pareja-Lora and Bodil Nistrup Madsen, 136–146. Cophenhagen: Copenhagen Business School.).
In general, these resources have been employed in various research projects, both at national and European level. In fact, EuroVoc is being constantly used by EU organizations. In addition to the above non-exhaustive list, many efforts have been made to document other relevant resources in the legal domain, available on the web. The Lynx project data portal2222. http://data.lynx-project.eu/ is a good example of this.
Outside the legal domain, we can find many other linguistic resources structured in RDF. The most important are WordNet,2323. https://en-word.net/ BabelNet2424. https://babelnet.org/ and ConceptNet,2525. https://conceptnet.io/ among others. In the following section, we describe some of the modelling approaches to represent different types of information.
5.Models to represent linguistic information
The language resources mentioned above along with those that are part of the Linguistic Linked Open Data cloud are published following different RDF vocabularies, depending on the nature of each resource (structure, content, objectives, etc.). Some of the commonest vocabularies to represent linguistic information are briefly listed as follows:
-
lemon, the Lexicon Model for Ontologies, is intended to represent lexical information of a given term, such as the sense, form, abbreviation, to mention but a few (McCrae, Aguado-de-Cea et al. 2012McCrae, John, Guadalupe Aguado-de-Cea, Paul Buitelaar, Philipp Cimiano, Thierry Declerck, Asunción Gómez-Pérez, Jorge Gracia, Laura Hollink, Elena Montiel-Ponsoda, Dennis Spohr, and Tobias Wunner 2012 “Interchanging Lexical Resources on the Semantic Web.” Language Resources and Evaluation 46:701–719. ).
-
Ontolex is the evolution of lemon, and it is supported by the W3C Ontology-Lexica Community Group.2626. https://www.w3.org/community/ontolex/ Neither lemon nor Ontolex were originally conceived to represent lexica as Linked Data, but to lexicalize formal ontologies. However, it became the de facto standard to represent and interchange lexical data in the Semantic Web, since the model is able to represent different senses (ontolex:LexicalSense), pointing at different concepts (ontolex:LexicalConcept), of the same lexical entry (ontolex:LexicalEntry). Therefore, Ontolex represents terms, synonyms and translations as classes, which allows modelling additional information about these elements (Cimiano, McCrae et al. 2015Cimiano, Philipp, John P. McCrae, Víctor Rodríguez-Doncel, Tatiana Gornostay, Asunción Gómez-Pérez, Benjamin Siemoneit, and Andis Lagzdins 2015 “Linked Terminologies: Applying Linked Data Principles to Terminological Resources.” In Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11–13 August 2015, Herstmonceux Castle, United Kingdom, edited by Iztok Kosem, Miloš Jakubíček, Jelena Kallas and Simon Krek, 504–517. Ljubljana: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd.).
-
LIR, the Linguistic Information Repository, was intended for the localization of ontologies, catering for the representation of translations and term types (Montiel-Ponsoda et al. 2011Montiel-Ponsoda, Elena, Guadalupe Aguado-de-Cea, Asunción Gómez-Pérez, and Wim Peters 2011 “Enriching Ontologies with Multilingual Information.” Natural Language Engineering 17(3):283–309. ).
-
Lexinfo associates additional linguistic information to elements in an ontology (Cimiano, Buitelaar et al. 2011Cimiano, Philipp, Paul Buitelaar, John McCrae, and Michael Sintek 2011 “LexInfo: A Declarative Model for the Lexicon-Ontology Interface.” Journal of Web Semantics: Science, Services and Agents on the World Wide Web 9(1):29–51. ).
-
SKOS, the Simple Knowledge Organization System, structures thesauri and taxonomies, easing the creation of hierarchical relations between terms. It is widely used within the Semantic Web context since it can be combined with formal representation languages, such as the Web Ontology Language (OWL) (Miles, and Bechhofer 2009Miles, Alistair and Sean Bechhofer 2009 “SKOS Simple Knowledge Organization System Reference.” W3C Recommendation.).
Choosing the most appropriate vocabulary is an important step for a reliable methodology that should be followed when publishing resources as per the Linked Data paradigm (Vila Suero et al. 2014Vila Suero, Daniel, Asunción Gómez-Pérez, Elena Montiel-Ponsoda, Jorge Gracia, and Guadalupe Aguado-de-Cea 2014 “Publishing Linked Data on the Web: The Multilingual Dimension.” In Towards the Multilingual Semantic Web. Springer, edited by Paul Buitelaar and Philipp Cimiano, 101–117. Berlin: Springer. ). Such methodology stresses the importance of pre-processing the data, choosing a sound URI naming strategy, selecting the right technology for RDF generation and reliably linking with other datasets in the cloud.
More information about models to represent linguistic Linked Data can be found in Bosque-Gil, Gracia et al. (2018)Bosque-Gil, Julia, Jorge Gracia, Elena Montiel-Ponsoda, and Asunción Gómez-Pérez 2018 “Models to Represent Linguistic Linked Data.” Natural Language Engineering 24(6):811–859. . Still, the greatest part of the resources mentioned above are published according to the SKOS vocabulary,2727. https://www.w3.org/TR/swbp-skos-core-spec/ since it is aimed at representing the structure of knowledge organization systems such as thesauri and taxonomies and has allowed the conversion of available resources. However, to represent resources of the general domain, such as dictionaries, that contain entries with words that have more than one meaning, the most applied model is Ontolex, that allows to represent this kind of ambiguity.
In this context, previous work by the authors has addressed the semantic representation of enriched legal terminologies (Martín-Chozas, Vázquez-Flores et al. 2022Martín-Chozas, Patricia, Karen Vázquez-Flores, Pablo Calleja, Elena Montiel-Ponsoda, and Víctor Rodríguez-Doncel 2022 “TermitUp: Generation and Enrichment of Linked Terminologies.” Semantic Web 1(0):967–986. ). The term terminology enrichment refers to the generation of complex terminologies from a corpus. With this objective, it has been observed that most of the current terminology extraction tools return plain lists of terms (such as TermSuite,2828. http://termsuite.github.io/ TermoStat Web2929. http://termostat.ling.umontreal.ca/ and FiveFilters),3030. https://www.fivefilters.org/term-extraction/ and maybe translations or contextual information (such as Tilde’s Terminology platform3131. https://term.tilde.com/ (Gornostay 2010Gornostay, Tatiana 2010 “Terminology Management in Real Use.” Proceedings of the 5th International Conference Applied Linguistics in Science and Education, 25–26.) and SketchEngine).3232. https://www.sketchengine.eu/
Terminology enrichment is intended to alleviate such flatness by enriching automatically extracted terms with unambiguous information from existing resources. This information can be translations, synonyms, usage examples, relationships between terms (both hierarchical and other types) and contextual information. Therefore, to maintain the traceability of the information, it is important to choose a model that manages to maintain the sources of the information collected. For this purpose, it is possible to use SKOS XL, a further development of the SKOS vocabulary. This model treats the labels as classes, rather than pure literals, understood in this context as raw strings of text. This improvement allows extra metadata to be added, such as the source. Figure 5 shows an example of an enriched entry modelled in SKOS-XL. This representation proposal makes use of other vocabularies, such as DublinCore,3333. http://purl.org/dc/terms/ to model the source of the term and its frequency, and Creative Commons3434. http://creativecommons.org/ns to model information about the jurisdiction (details on this in Section 3.4.1). Table 3 exposes a description of every property used in the diagram.
Class/Property | Description |
---|---|
skos:Concept | skos:Concept is the central element of the model. It is represented by a URI, composed of the base URI of the resource plus the concept ID. In this case: http://mysampleuri/collective-agreement |
skos:inScheme | In SKOS, concepts are grouped in schemes, that can be considered as subdomains. This is, if we are modelling a legal terminology, we could create different schemes, such as labour law scheme, contract law scheme, industrial law scheme, etc. In this case, it points to the http://mysampleuri/labourlawscheme that is a class, with two attributes: label and source. |
skos:broader | This property is used to represent the broader concept of a term, therefore pointing to another concept, in this case: http://mysampleuri/agreement. The relation is hierarchical. |
skos:narrower | This property is used to represent the narrower concept of a term, therefore pointing to another concept, in this case: http://mysampleuri/intracompany-collective-agreement. The relation is hierarchical. |
skos:related | This property is used to represent the related concept of a term, therefore pointing to another concept, in this case: http://mysampleuri/temporary-agreement. The relation can be of any kind. |
skos:closeMatch | This property represents that the concept has an equivalent in another resource. It is normally used in semi-automatic processes, since it is more flexible than its sister property skos:exactMatch. In this case, we find a similar concept in EuroVoc: http://eurovoc.europa.eu/194 |
skos:definition | This property represents the definition of a concept, that is modelled as a literal. |
skos:example | This property is used to represent the context of the term. In this case, the example is an excerpt of the source corpus of the term. It points to a literal. |
skos-xl:prefLabel | This is an evolution of skos:prefLabel, which is used to express the main label of a term in different languages. While skos:prefLabel points to a literal, skos-xl:prefLabel points to a class, that allows to represent extra information. |
skos-xl:altLabel | This is an evolution of skos:altLabel, which is used to express alternative labels of a term (synonyms, acronyms, etc.) in different languages. While skos:altLabel points to a literal, skos-xl:altLabel points to a class, that allows to represent extra information. |
cc:jurisdiction | This property belongs to the CreativeCommons vocabulary and, in this case, it is used to describe the jurisdiction to which the term applies. |
dcterms:Frequency | This class belongs to the DublinCore ontology and, in this case, it is used to describe the frequency of the concept in the whole corpus. |
6.Benefits of Linked Data for terminology resources
Previous work in this field has already exposed the advantages of Linked Data on Language Resources (Bosque-Gil, Gracia et al. 2018Bosque-Gil, Julia, Jorge Gracia, Elena Montiel-Ponsoda, and Asunción Gómez-Pérez 2018 “Models to Represent Linguistic Linked Data.” Natural Language Engineering 24(6):811–859. ; Chiarcos, Hellmann, and Nordhoff 2012Chiarcos, Christian, Sebastian Hellmann, and Sebastian Nordhoff 2012 “Linking Linguistic Resources: Examples from the Open Linguistics Working Group.” In Linked Data in Linguistics, edited by Christian Chiarcos, Sebastian Nordhoff & Sebastian Hellmann, 201–216. Berlin: Springer. ; Chiarcos, McCrae et al. 2013Chiarcos, Christian, John McCrae, Philipp Cimiano, and Christiane Fellbaum 2013 “Towards Open Data for Linguistics: Linguistic Linked Data.” In New Trends of Research in Ontologies and Lexical Resources. Theory and Applications of Natural Language Processing, edited by Alessandro Oltramari, Piek Vossen, Lu Qin & Eduard Hovy, 7–25. Berlín: Springer. ; Cimiano, McCrae et al. 2015Cimiano, Philipp, John P. McCrae, Víctor Rodríguez-Doncel, Tatiana Gornostay, Asunción Gómez-Pérez, Benjamin Siemoneit, and Andis Lagzdins 2015 “Linked Terminologies: Applying Linked Data Principles to Terminological Resources.” In Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11–13 August 2015, Herstmonceux Castle, United Kingdom, edited by Iztok Kosem, Miloš Jakubíček, Jelena Kallas and Simon Krek, 504–517. Ljubljana: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd.). In this section, we list the most important benefits pointed out in these works, complementing them with specific advantages on legal terminological assets:
-
Interoperability: It is defined as the interaction between different elements, where there is an exchange of information or knowledge to obtain a common benefit (Wegner 1996Wegner, Peter 1996 “Interoperability.” ACM Computing Surveys (CSUR) 28(1):285–287. ). Interoperability is the main advantage of Linked Data, and one of the main issues of current legal terminological resources. Previous work by the authors (Martín-Chozas 2018Martín-Chozas, Patricia 2018 Towards a Linked Open Data cloud of Language. Master’s Thesis. Universidad Politécnica de Madrid.) includes a survey amongst professional legal translators discovering that the most used legal language resources in their daily activities are published in physical formats (such as the Black’s Law Dictionary), close formats (such as the International Monetary Fund glossary, in PDF) or non-queriable formats (such as the United Nations Terminology Database, in HTML). Such a fact generates interoperability issues amongst resources and querying them may become quite a cumbersome task. RDF structures allow the access to different resources from a single-entry point, easing the search of information.
-
Unambiguity: The first principle of the Linked Data principles states that every resource, such as a term in a terminology, owns a unique identifier (URI) which makes such resource uniquely and globally identifiable in an unambiguous manner. These URIs provide unambiguous results readable both for machines and for humans through a web browser. In legal terminology, URIs representing terms are especially useful, since one of the main legal translation problems is the homonymy (Alcaraz Varó, and Hughes 2002Alcaraz Varó, Enrique, Brian Hughes, and Adelina Gómez 2002 El español jurídico. Barcelona: Ariel.).
-
Linking and integration: Thanks to the identification of elements in a resource with URIs, as mentioned above, it is possible to link and integrate different resources, pursuing the interoperability. Even if those resources are structured in different RDF vocabularies, we can make connections amongst them and establish a match between their URIs. In this manner, from one entry in one resource, we are able to get knowledge from several entries of several resources by navigating through the links.
-
Unique access point: Such integration allows the publication of several language resources in a single container and enables their access from a single access point. Such query access over distributed resources grants easier exploitation, storage and maintenance of the data, reducing the existence of data silos. Examples of such an integration are the EU Vocabularies SPARQL Endpoint,3535. https://op.europa.eu/en/advanced-sparql-query-editor that connects all the resources in RDF published by the European Union and, at a smaller level, we find the same objective in the Lynx project through the Lynx Terminology platform3636. http://lkg.lynx-project.eu/kos
-
Metadata: According to the Harvard Law School,3737. https://hls.harvard.edu/dept/its/what-is-metadata/ metadata is information stored within a document that is not evident by just looking at the file, it is also described as a fingerprint. RDF enables the addition of unlimited metadata records, enabling the fine grain description of every single resource. Some relevant metadata fields include provenance, jurisdiction, authorship, creation dates or information on its validity.
There is no better way of exposing the benefits of linked data on legal terminology than showing them through some examples. In the following section, we have tried to reflect the above-mentioned advantages through a series of queries.
7.Hands on
Practical examplesThe following are queries performed on the EU vocabularies SPARQL Endpoint. We start with simple queries, adding more elements on each iteration. For instance, let us check whether the English term sick leave is contained in EuroVoc. To do that, we would use a query such as the one in Listing 3, where we ask to which skos:Concept the label sick leave belongs, taking into account that it must be of the type http://eurovoc.europa.eu/schema\#ThesaurusConcept.
As a result of the execution of this query, the variable ?concept retrieves the following URI: http://eurovoc.europa.eu/102. At this point, we have checked that this term is contained in EuroVoc. Extending the same query, we would retrieve its translations, for instance in German, French and Spanish, as shown in Listing 4. In this query, we add the elements OPTIONAL and FILTER. The first is used to ask for information that may or may not be available in the queried resource, while the latter is used to filter data, in this case, per language.
Therefore, in addition to the concept URI, as shown in Listing 3, we obtain the following translations:
-
?prefDE = Erwerbsunfähigkeit
-
?prefES = baja por enfermedad
-
?prefFR = congé de maladie
As shown in Table 3, SKOS uses altLabel to represent synonyms. We can therefore add to our query optional conditions to retrieve synonyms, as shown in Listing 5.
Consequently, in addition to the already mentioned information, we now obtain synonyms for German and Spanish:
-
?altDE = Krankheitsurlaub
-
?altES = licencia por enfermedad
At this point, we can go a step further and extend the query to retrieve conceptual relations by applying skos:broader, skos:narrower and skos:related. We can also ask for the preferred labels of those terms, as in Listing 6.
The previous query adds two more pieces of information: a broader term and a related term. In EuroVoc, this concept does not seem to have a narrower relation, therefore the variable ?narrower does not return any value. For the other conceptual relations, we retrieve the following data:
-
?broader = http://eurovoc.europa.eu/108
-
-?brprefEN = leave on social grounds
-
?related = http://eurovoc.europa.eu/175
-
?reprefEN = illness
Therefore, to summarize with one single query, as shown in Listing 6, we can retrieve pieces of information of different nature related to a given concept, as listed below:
-
?concept = http://eurovoc.europa.eu/102
-
?prefDE = Erwerbsunfähigkeit
-
?prefES = baja por enfermedad
-
?prefFR = cogné de maladie
-
?altDE = Krankheitsurlaub
-
?altES = licencia por enfermedad
-
?broader = http://eurovoc.europa.eu/108
-
?brprefEN = leave on social grounds
-
?related = http://eurovoc.europa.eu/1754
-
?reprefEN = illness
On the other hand, we can also filter the concepts depending on the concept scheme they belong to. For instance, we are interested in all the concepts that fall under the scheme Social Protection, which we know has the URI http://eurovoc.europa.eu/100214. We would ask for all the terms within that scheme, as shown in Listing 7. The first five results of such query are shown in Table 4.
?concept | ?prefEN |
---|---|
http://eurovoc.europa.eu/1004 | welfare |
http://eurovoc.europa.eu/2605 | social-security benefit |
http://eurovoc.europa.eu/3751 | pension scheme |
http://eurovoc.europa.eu/4028 | social security harmonisation |
http://eurovoc.europa.eu/4050 | social security |
… | … |
Thanks to the links to other datasets, we can check whether a given concept appears in other resources. For instance, we can take as an example the term social security, from Table 4, and check if it has matches in other thesauri with the query shown in Listing 8. The results are listed in Table 5.
EuroVoc also offers, in some occasions, definitions for the terms. However, they are not very frequent. Therefore, if the definition is an important requisite for our terms, we can add that condition as a requirement, as in Listing 9.The results of this query are shown in Table 6.
?concept | ?prefEN + ?defEN |
---|---|
http://eurovoc.europa.eu/6233 | care of the elderly: care that is designed to meet the needs and requirements of senior citizens at various stages. |
http://eurovoc.europa.eu/c_16e35fe6 | foster parent: adult that provides the care of a child without being the child’s parent or relative, or having parental responsibility for him. |
http://eurovoc.europa.eu/c_f5622f5f | active and assisted living: people living independently in their homes with the support of ICT-based solutions. |
8.Conclusions
This chapter has examined the use of Semantic Web techniques in the legal terminology domain. These techniques may seem little more than a manner of formatting and publishing data, but they initiate a profound transformation. Resources are no longer locked-in to a certain technology provider and the same terminological asset can be used by different computer programs. Vast amounts of linked open data are ready to be used, constantly growing and adapted to the changing world. The Semantic Web opens a new universe to be explored and the extra effort necessary to adopt the specifications of the W3C pays off well.
The Semantic Web universe is not only a dream of the academia. Public institutions have largely adopted these technologies with enthusiasm. European legislation is already identified with URIs (ELI – European Legislation Identifier), as well as judgements (ECLI – European Case Law Identifier). Documents in the Eur-Lex portal are described with metadata that are represented with RDF, and the ELI Ontology is the base for the legal document description in most of the EU member states. Moreover, the same fabric of legislation is being transformed, and specifications of Akoma Ntoso for EU (AKN4EU),3838. https://op.europa.eu/es/web/eu-vocabularies/akn4eu/ a machine-readable structured format for the exchange of legal documents in the EU, has been adopted by the EU Publications Office to describe certain rules in a machine-readable form.
Terminological resources are no exception to this semantic evolution of the web. The EU Vocabularies main site3939. https://op.europa.eu/en/web/eu-vocabularies/ has adopted the Semantic Web by design, representing every resource as RDF, serving a SPARQL endpoint, publishing information using the Linked Data principles. A vast number of documents in Eur-Lex portal are described with descriptors from the EuroVoc thesaurus, which is a SKOS Concept Scheme. This chapter is both a description of the Semantic Web technologies and an invitation to adopt them. Joining the virtuous circle of adoption leads to benefits to you, but also to the community.
Acknowledgements
This work is framed within the COST Action NexusLinguarum: European network for Web-centered linguistic data science (CA18209), supported by COST (European Cooperation in Science and Technology).