Multilingual legal terminology databases: Workflows and roles
1.Introduction
Multilingual legal terminology databases (MLTDBs) are terminology databases (TDBs) containing legal terminology in more than one language. This chapter focuses on the specific features related to the legal domain and their influence on user expectations and usage scenarios, database structure, workflows, roles, and quality aspects of MLTDBs.
We provide a brief introduction to TDBs (Section 2) and address the specific features influencing the structure of MLTDBs, i.e., the number of legal systems catered for, different usage scenarios, target users, and user-database interaction (Section 3). Several activities and processes are required to create and populate MLTDBs. These are presented in a typical workflow: needs analysis, design and implementation of MLTDBs, documentation, term extraction, compilation of terminological entries, revision and quality assurance, maintenance, and dissemination (Section 4). As terminology work and the creation of MLTDBs are typically collaborative practices, the following section outlines the roles involved in terminology work in the legal domain: terminologists, revisers, terminology coordinators, legal experts, IT experts, and users (Section 5). The role of machines is discussed not only as tools supporting various steps in the workflow (e.g., term extraction, maintenance, dissemination) but also as tools that exploit terminological data for other purposes. The different users, their needs and requirements are also reflected in the quality aspects of MLTDBs. The concepts of quality and quality management are discussed in Section 6, outlining quality planning, quality assurance (QA), and quality control (QC), including relevant standards. The chapter concludes with an outlook on how tools might be further integrated into workflows and on synergies with natural language processing (NLP) and machine learning.
2.Terminology databases
A terminology database or termbase is “a database comprising a terminological data collection” (ISO 26162–1 2019, Clause 3.2.1),11.The same definition is also present in ISO 30042 (2019), Clause 3.28, where it was taken from. i.e., comprising “a resource consisting of concept entries with associated metadata and documentary information” (ISO 26162–1 2019, Clause 3.2.4). According to Melby (2012, 8) a termbase is “a computer database consisting primarily of information about domain-specific concepts and the terms that designate them”. It provides “a structured repository of linguistic data, enriched with metadata and structured according to particular classification schemes and concept based analysis” (Steurs et al. 2015, 224). Unlike lexicographic products that follow a semasiological approach (e.g., legal dictionaries), terminographic products follow an onomasiological approach and are concept-oriented. The basic units of TDBs are terminological entries, which contain a set of related terminological data elements on a specific concept (Drewer, and Schmitz 2017, 128). This is reflected in the definition by Támas and Sermann (2019, 113), who see TDBs as a collection of electronically stored terminological data that was created following an onomasiological approach by mapping the conceptual system of a subject field and which contains terms and their definitions relating to one or several subject fields in one or more languages.
TDBs are essential for managing terminology, as they are used for collecting, handling, structuring, sharing, publishing, and re-using terminology. They may serve different purposes, such as supporting the drafting of texts and technical writing, translation and localization, terminology planning and standardization, knowledge representation and management. According to their purpose and intended user group, they will be conceived differently and present different functionalities and user interfaces (Steurs et al. 2015, 227). We may distinguish between ad hoc personalized “termbases” that do not strictly follow the onomasiological approach, and which are generally created for translation purposes and integrated into translation environments, vs. large-scale, mainly thematic, strictly controlled stand-alone resources maintained by large institutions that are often known as “term banks” (Bowker 2015, 306–307). The complexity of TDB structure may vary from quite basic to very complex, with additional features such as links to texts or visualization of concepts maps (Tamás, and Sermann 2019, 28). The more sophisticated TDBs are developing into TKBs, terminological knowledge bases, i.e., “knowledge repositories represented in a formal language that can be accessed by users via an expert system based on terminological units, which are organized into a conceptual network containing various types of relations” (Cabré Castellví 2006, 98). TKBs provide direct access to corpus data and ontologies, and often make data readable by machines besides human users.
Today, there is a wide choice of commercial or proprietary solutions to manage terminology in databases (Drewer, and Schmitz 2017, 142–143), including cloud-based solutions (Varga 2013), which may partly influence or constrain database structure (Kageura, and Marshman 2020, 66–67). Some allow smooth integration with authoring tools, CAT-tools, localization tools, knowledge management tools, etc.
TDBs may cover one or more languages, including language varieties and dialects (Matteucci 2006), thus being monolingual, bilingual, or multilingual (Melby 2012, 8). They may collect diverse types of concepts and terms, either focusing on one specific subject or domain (e.g., law, industrial automation) or on a wider range of content relevant to a given organization (e.g., EU terminology, company terminology) or activity (e.g., translation of legal documents, terminology development in a minority language, protection of intellectual property).
3.Multilingual Legal Terminology Databases
This chapter addresses multilingual – including bilingual – TDBs containing legal terminology. The salient characteristics of legal terminology, which influence (multilingual) terminology management in the legal domain, such as indeterminacy, variation, the relation to general language and other LSPs have been addressed in other chapters in this publication (see Biel, Jopek-Bosiacka, Mouritsen). In this section, we will focus on the specific features of MLTDBs. These may have specific characteristics according to the number of legal systems considered, their purpose, and target users.
3.1MLTDBs dealing with one legal system
A single legal system may be multilingual (e.g., the Swiss, South African, and EU legal systems). In this case, the legal concepts stored and organized in the TDB are expressed by designations in two or more languages. The conceptual level, i.e., legal institutions, rules, bodies, and the relations between them, is the same but is conveyed through more than one language (Gambaro, and Sacco 2009, 10). Conceptual characteristics and concept relations are shared across languages so that, for example, there will be only one reference concept system.
As all designations in a terminological entry refer to the same concept, in principle, most challenges reside at the language level (e.g., differences in designation length, complexity, transparency, connotation, etc.). For example, legal concepts may be first designated – or borrowed from another legal system – in one official language of a multilingual system and finding a designation in other official languages may not be straightforward. This may apply to EU concepts adapted from originally national concepts that need designations in the languages of all member states (e.g., “Advocate General” in EU law was originally taken from French law (Gombos 2014, 1)). It is also common in minority languages which must express legal concepts and rules usually developed in the majority language (Chiocchetti, and Ralli 2016). Challenges may also reside in managing or reducing synonymy and polysemy to foster unambiguous communication and in coping with intrinsic linguistic and structural differences between the languages of a legal system (e.g., while custodia, custody, in Italian law of obligations, can refer to both animate and inanimate entities, the official German designations in the province of Bolzano, where Italian and German are co-official languages, distinguish Verwahrung, referring to objects, from Beaufsichtigung, referring to living creatures such as livestock).
3.2MLTDBs dealing with two or more legal systems
MLTDBs may cover more than one legal system. In this case, the conceptual level will necessarily diverge to a smaller or greater extent, according to how close the legal systems are to each other and to possible harmonization efforts. A concept existing in one legal system may be totally unknown in another system. Concepts from different legal systems may have more or less comparable characteristics. The further apart the legal systems dealt with, the more challenging it is to find equivalent legal concepts (Cao 2007, 30–31; Pommer 2006, 43). There are systemic, linguistic, and cultural differences (Cao 2007, 23).
This is a consequence of every national and supranational legal system having its own specific set of rules and conceptual structures developed over time. Legal language and terminology express and reflect such specificities and are therefore system-bound (Cao 2007, 23–24; de Groot 1999, 12–17; 2002, 222; Pommer 2006, 18–19; Šarčević 1997, 13). “[D]ifferent legal languages have their own unique legal vocabulary” (Cao 2007, 20). While issues of missing or partial equivalence between concepts are not rare in the terminologies pertaining to the hard sciences, which are essentially universal, every country has its own legal system that differs to a lesser or greater extent from that of other countries (Arntz et al. 2014, 162; Drewer, and Schmitz 2017, 20). Consequently, full equivalence between concepts from distinct legal systems is definitely not a given rule (Cao 2007, 29; de Groot 1999, 21; Pommer 2006, 147; Sandrini 2014, 147; Šarčević 1997, 232). Therefore, when working with legal terminology, the notion of conceptual equivalence must necessarily be relativized (Arntz et al. 2014, 162; Pommer 2006, 147; Sandrini 2014, 148).
This implies that the legal conceptualization behind terms from two legal systems may differ even when it is expressed in the same natural language (Cao 2007, 33; de Groot 1999, 12; Gambaro, and Sacco 2009, 8; Sandrini 2014, 144). Arabic, English, French, German, Spanish are examples of languages that are used by more than one legal system in the world and thus have developed a set of legal languages where terminologies may vary or where the same terms may even designate completely distinct concepts. There are as many legal languages as there are legal systems using a specific natural language (de Groot 2002, 225–226) and the relation of legal terminology with its legal system is more crucial than that with its language (Sandrini 2014, 143). It is possible to create monolingual legal TDBs addressing more than one legal system, but the geographical and jurisdictional constraints of legal terminology must be specified even when working with a single natural language (see Section 4.2).
Consequently, terminology work for MLTDBs encompassing more than one legal system poses not only linguistic but primarily conceptual challenges. Working with legal terminologies from various legal systems is not just a linguistic task but also a legal task (Arntz et al. 2014, 163–170; de Groot 1999, 20) and requires a comparative legal and linguistic analysis (Prieto Ramos 2014, 125). MLTDBs must highlight not only linguistic but also conceptual similarities and differences, so that these can be considered in international communication (Sandrini 2009, 163) and translators have enough information to decide whether any differences are relevant for their specific target readership (de Groot 2002, 230–231; Sandrini 2009, 153), to name just two examples. This is generally achieved by applying methods borrowed from legal comparison to terminology work (inter alia Chiocchetti, and Ralli 2016; Künnecke 2013; Peruzzo 2014; Pontrandolfo 2018), especially micro-comparison (Ajani et al. 2018, 4; Del Giudice 2014, 19; Pommer 2006, 85; Zweigert and Kötz 1996, 4–5).
Micro-comparison enables us to spot analogies and discrepancies between legal norms or concepts from different legal systems (Ajani et al. 2018, 8; Del Giudice 2014, 19) and thereby supports the process of finding equivalents, i.e., concepts with the same conceptual characteristics or – more commonly in legal and terminological practice – narrower, broader, or overlapping concepts (Arntz et al. 2014, 145–148). The comparison may be systematic, i.e., concerning the concepts of an entire legal subdomain (e.g., contract law, family law), or ad hoc, i.e., focusing on a specific legal concept in a given communicative situation or text (Sandrini 2009, 158–159). The more legal systems that are considered, the more complex becomes the task of finding a common conceptual core for the legal concepts from each legal system, so that termbase content might have to be limited to a relatively narrow set of wider or rather generic concepts. Furthermore, to reduce the number of comparative analyses, one legal system may be defined as the reference or source legal system in an MLTDB and others as target systems.
Terminological entries in an MLTDB show the overlapping areas between legal concepts, possible differences at conceptual level (e.g., broader or narrower intension) and at linguistic level (e.g., differences in register or use) and also warn against false friends, i.e., designations that only superficially seem to refer to comparable concepts. One particular challenge is dealing with the multidimensionality of legal terms that may be embedded in several national and supranational legal systems and show significant variation at the designation level (Peruzzo 2014, 262). Another challenge is the lack of conceptual equivalence. When there are no sufficiently overlapping concepts in the target legal system or when there is no designation for the concept under analysis, a loan word is used or a new term is coined and proposed (Cao 2007, 55–56; de Groot 1999, 27). In this context, it is important that MLTDBs collect established terms (Biel 2008, 26; Molina, and Hurtado Albir 2002, 510) to foster unambiguous communication by presenting terminology that is already routinely used and understood by the speech community.
Due to all these specific challenges, terminological entries in MLTDBs may contain data categories such as geographical usage or legal system, degree of equivalence, notes with comparative information, etc., and picklist values such as proposed term, translingual borrowing, etc., or exploit them more than TDBs in other domains. There may also be more than one reference concept system or ontology, one per each legal system considered (Id-Youss 2016).
3.3Usage scenarios
MLTDBs can be employed in different scenarios and for a range of purposes. They make practitioners of the law and language mediators aware of the differences between legal systems at linguistic and/or conceptual level and keep them up to date with changes, thereby informing (multilingual) text drafting and translation. They support the growing harmonization efforts in a globalized world (Ajani et al. 2018, 10; Gambaro, and Sacco 2009, 27–28; Grass 2014, 104–105) and an increasingly globalized legal discourse (Gotti 2009). They can also be employed to structure legal knowledge and optimize monolingual and multilingual content management and information retrieval or serve as a knowledge repository for Artificial Intelligence (AI) applications or the Semantic Web (see Section 4.8). They may represent a reference point for language standardization, e.g., as a support tool for developing a minority language and/or for disseminating the results of official standardization bodies.
As a consequence, the structure and content of MLTDBs may vary according to their intended purpose. For example, legal phraseology is considered important in translation-oriented termbases, as it is a major aspect of difficulty for language mediators (Biel 2014, 182; Chromá 2014, 134; Grass 2014, 108–109) but is likely to be absent from termbases intended mainly for knowledge structuring and knowledge representation. Data categories such as grammatical gender may not be needed in MLTDBs focusing on concept harmonization, while such information may be desirable for work on some minority languages. Information on term frequency may be relevant for translation or language standardization work but not for knowledge structuring, while concept relations will be essential for the latter and additional information for the former. There might be one definition or reference concept system per each legal system considered or just a single one when aiming, for example, to standardize a minority language within a single legal system.
3.4Target users
The target users of MLTDBs influence the choice and structuring of termbase content. Therefore, detailed analyses of expectations and requirements should be performed for every user group (Nielsen 2014, 154–160; see Section 4.1). Traditionally, there are three main target user groups: language mediators, legal experts, and the general public (Sandrini 2014, 144). Further non-human applications as users of termbase content have emerged in recent years (see Section 5.7).
Language mediators like translators and interpreters have different needs and expectations compared to legal experts in terms of the content and functions of an MLTDB (Chromá 2014, 130; Peruzzo 2018). Professional language mediators mainly need to understand the meaning of a term, look for equivalents, check the adequacy of presumed equivalents, or look for alternative translations (Nord 2002, 133–134). For them, essential features of TDB content are the presence of (clear) definitions, possibly in all languages, of examples of use, possibly from real text, of phraseology, of abbreviations and acronyms. Domain labels, semantic information (e.g., synonyms), usage labels, images, and a range of equivalents with related explanations are desirable features (Durán Muñoz 2012, 144). They also consider lookup speed, number of precise hits, and a wide range of entries top features (Vasiljevs et al. 2010).
Quite differently, an analysis on what legal experts value most in MLTDBs lists information on reliability, the presence of one or more definition(s) and context(s), the possibility of selecting a specific (sub)domain, the level of specialization and precision of information, a clear layout, the presence of links to reliable and official sources, and information on the last update (Peruzzo 2018, 97). Legal experts also appreciate defining contexts that contain additional conceptual information. Regular checks on whether contents are still valid as well as many references to legal sources are of paramount importance for them (Peruzzo 2018, 98–99).
Legal experts may expect linguistic information that a linguist would not need, e.g., information on the use of loan words and on pronunciation (Peruzzo 2018, 100). Conversely, information on language register might be superfluous for a domain expert but essential for a translator. Finally, the type and depth of definitions required by linguists and legal experts may diverge. While legal experts value legal definitions from normative texts or official sources, language mediators would probably opt for more explanatory and less technical or less obscure definitions (Peruzzo 2018, 102–103; Vanden Bulcke, and De Groote 2016, 27).
Consequently, Bestué (2019, 140–141) proposes an entry structure targeted at legal translators based on the functions of the target text while Peruzzo (2018, 101–102) describes one better suited for legal experts. When MLTDBs address both language and domain experts, one of the challenges is striking a balance between the needs of diverse user groups. This may be solved, as the example of TDBs in other domains show (Vezzani et al. 2018), by granting access to targeted types of data and a growing depth of information through different interfaces.
3.5Interaction with users
A good way of ensuring that MLTDBs cater for users’ needs (see Section 4.1) is encouraging their direct involvement in terminology work or regular input and feedback. A traditional approach to user involvement is providing a terminology query service that answers user questions and may exploit these to direct terminology work to specific subject fields or just update existing entries after ad hoc searches triggered by user questions, a clear “win-win situation” (Dobrina 2010, 93) for both parties. Feedback and input forms on published entries provide another way of interacting with users (Ralli, and Andreatta 2018, 30) and may be a way of collecting information or terms that need to be inserted in the termbase (see Section 4.4). Collaborative terminology work between language and legal experts (Chiocchetti, and Wissik 2018) or wiki-style collaborative content creation enabling multiple users to create, edit, search, and consult term entries (Kageura, and Marshman 2020, 72) are more direct approaches to user involvement.
Collaboration in terminology work (see Section 5) reflects the growing popularity of peer-to-peer resources in all domains, including translation. For example, forums in websites like ProZ22. https://www.proz.com/ask (Accessed July 13, 2022) or Translators Café33. https://www.translatorscafe.com/tcterms (Accessed July 13, 2022) are widely used by language mediators (Biel 2008, 32). Gathering contributions from users has significant advantages, as it potentially allows large-scale content creation and may reach otherwise inaccessible experts of a very specific domain. There are disadvantages, however, as the quality, consistency, and coverage of the contributed data may not be constant or sufficient (Kageura, and Marshman 2020, 73), while traditional approaches leave terminologists in full control of the data. There are also some crowdsourcing approaches in terminology (e.g., Cauna 2018, 51; Karsch 2015). Overall, user participation and interaction is considered to be insufficiently integrated into the design of TDBs in general (Vasiljevs et al. 2010). In other words:
We are facing […] a general dilemma in terminology management. On the one hand, we need mechanisms to catch up with the rapid growth of terminologies. Manual and/or in-house elaboration is not sufficient. On the other hand, we have not yet established proper quality control in terminology management that can work in large-scale automatic or collaborative environments.(Kageura, and Marshman 2020, 73)
Finally, the growing role played by machines as new types of terminology users (see Section 5.7) implies that MLTDBs must also fulfill specific technical requirements to ensure data interoperability and reusability.
4.Workflows
Terminology work is complex and consists of a series of steps that may be carried out in sequence or in various loops. What we call workflow in this chapter – a “specified way to carry out an activity or a process” – is also known as procedure (ISO 9000 2015, Clause 3.4.5).
There are many ways to systematize the different activities and processes involved in terminology work because they depend highly on factors such as the type of terminology work and the setting where they are performed (institution vs. company). Consequently, there are many models for terminology workflows described in literature, from very generic ones to specific ones (inter alia Arndt et al. 2020; COTSOES 2002; Lušicky, and Wissik 2015; Popiołek 2015). Since legal terminology work has its own particular features (see Section 3.1 and 3.2), there are also specific workflow models describing this type of terminology work (Chiocchetti et al. 2013, 2017). A prototypical terminology workflow in the legal domain comprises the following steps: needs analysis, design, documentation, term extraction, compilation of terminological entries (with contrastive analysis and micro-comparison), revision and QA, maintenance, and dissemination (Chiocchetti et al. 2013, 2017).
Most workflow descriptions focus on collecting (legal) terminology and do not include the conceptualization and creation of the termbase. This step happens prior to the terminology workflows described above (inter alia Drewer, and Schmitz 2017, 99ff; Simonnæs 2018, 126ff; Schmitz 2020, 11) or after the needs analysis step. Several decisions regarding the design, data model, and data categories have to be made (Schmitz 2020, 11–17). Most of them concern all types of termbases (e.g., applying the concept-oriented approach and the principle of granularity of data categories, testing the prototype, and revising the data model) but some features, that are addressed in Section 3, are specific to legal termbases. These are reflected in the data model of MLTDBs and the data categories used. In Section 4.2 we describe the steps that are specific to MLTDBs.
4.1Needs analysis
The first step in the workflow is the needs analysis. Needs analyses can be described as the systematic process of identifying and assessing needs in a certain community or situation. A need can be defined as a gap between the current situation and the desired situation. The following actions are essential for assessing the needs in legal terminology work: describing the current situation or problem, determining the desired situation, defining one or more possible approaches to the problem, and implementing one or more solutions (Chiocchetti et al. 2013, 14–16). During this step, the type of terminology work (e.g., ad hoc, proactive, systematic) and the specific activities within the terminology workflow required to solve the current problem must be defined. The time frame must also be set. A possible need for systematic multilingual legal terminology work is the translation of the EU’s acquis by an accession candidate. Another possible need triggering the revision of multilingual legal terminology is a legal reform in a specific subdomain, e.g., criminal procedure law, within a bilingual country. The needs analysis can also include an analysis of the requirements (see Section 6) regarding the design and implementation of an MLTDB, as described in Section 4.2, if the database does not exist yet.
4.2Design and implementation of MLTDBs
Before a termbase can be filled with legal terms, decisions on the design and data model have to be taken according to the results from the requirements analysis in the previous step. The first decision is choosing the legal systems and languages considered in the MLTDB. Since different legal systems may use the same official language (e.g., German is used in Germany, Austria, Switzerland, Luxembourg, Lichtenstein, Belgium, and Italy), it is important that MLTDBs specify their geographical and jurisdictional constraints (Nielsen 2014, 161). This is often realized through the data category geographical usage or legal system at term level.
The next decision consists in selecting the relevant data categories. Information in an MLTDB is recorded in terminological entries, which are subdivided into data categories. Data categories, e.g., definition, source, context, etc., can be seen as a “generalization of the notion of a field in a database” (ISO 12620-1 2022, Clause 3.2). Data category specifications for terminological resources are standardized according to ISO 12620-1 (2022). Due to the specific challenges outlined in Section 3.1 and 3.2, an MLTDB might need specific data categories that do not feature in other termbases such as legal system, degree of equivalence or notes with comparative information. After having decided on the data categories, their type has to be established (e.g., open or closed data category) and they must be associated to one of the three levels (concept level, language level, and term level). According to terminological principles, the definition should be at the concept level, since it describes the concept. However, owing to the specific features of legal terminology, definitions are often present at language level in MLTDBs. There can be one definition for each legal system with the associated source.
Once the data model has been created and implemented, the prototype can be tested with real data and real users and, if necessary, the data model can be reviewed and adapted.
4.3Documentation
Terminology work is mainly document-based (if sources are available as digital corpora, it is called corpus-based terminology work), even though domain experts – legal experts in case of legal terminology work – can also be used as a source of information. In legal terminology work, the rules applying to collecting sources might deviate from standard rules in terminology work in other domains. First, the legal hierarchy of sources, which is country dependent, must be considered separately for every domain in question, as not all types of sources might be equally relevant in all domains (cf., e.g., the importance of international treaties for international trade terminology vs. local legislation for childcare facilities terminology). In addition, texts with different positions in the legal hierarchy (e.g., constitution vs. codices vs. decrees) might use different terminology to regulate the same issue. Therefore, the legal hierarchy of a document is not necessarily the only crucial aspect when deciding whether a document should be included into the source collection, its relevance for the specific aim of the terminological project should also be considered. Even though the terminology project might concern recent terminology, in legal terminology work texts written several decades ago might still be fundamental as a terminological reference (e.g., the ABGB, i.e., the Austrian Civil Code, dates from 1812 and is still in force). Additionally, translated texts with an official status such as international treaties or EU legislation might need to be included into the source collection (Chiocchetti et al. 2013, 17–20).
4.4Term extraction
A crucial task in terminology work is to identify and record the potential terms, i.e., candidate terms, for later input into the terminological resource in question. Term extraction can be done in different ways: manually, by reading the source material and excerpting candidate terms, or via (semi-)automatic term extraction from electronic documents, including corpora and translation memories, for example (on automatic term extraction, see also Marín Pérez in this volume).
Different approaches are used for (semi-)automatic term extraction: statistical, linguistic, and hybrid. The statistical approach applies statistical criteria to define the degree of termhood of candidate terms; the linguistic approach applies linguistic filtering techniques to identify specific syntactic term patterns (Heylen, and De Hertog 2015; Pazienza et al. 2005). A prerequisite for the linguistic approach is a part of speech tagged corpus. Hybrid approaches combine different methods to recognize terms (Pazienza et al. 2005). There are several term extraction tools – open source or commercial – that require different degrees of computational skills. Some tools can be used without programming skills (inter alia Kilgarriff et al. 2014).
Extracting legal terminology contains a number of challenges that derive from the nature of the process itself. One challenge is that some legal terms might not occur very frequently in a legal text. Sometimes they only occur once in the title of a law, or in one paragraph, but they are still key terms in the specific domain (Wissik 2014, 127–129). Usually, statistical term extraction methods have problems extracting such rare terms. Another challenge consists in disambiguating legal terms from general language words, as some terms may be used in their ordinary meaning or with a legal meaning (Cao 2007, 21; Mattila 2012, 31). If manual term extraction is not done by a legal expert, some key terms might not be identified as legal terms. Also automatic term extraction methods based on a comparison between specialized and general language corpora might fail to extract relevant terminology. If term extraction is (semi-)automatic, the lists of candidate terms have to be validated by terminologists or legal experts. Furthermore, information on terms to be included into a terminological resource might also come from other input, e.g., automatic captures of unsuccessful searches in the MLTDB or requests and feedback from users through a user feedback system (Arndt et al. 2020, 13; Chiocchetti et al. 2013, 21–22; Ralli, and Andreatta 2018, 30; see Section 3.5).
4.5Compilation of terminological entries
During this step, all the collected information, the terms, the definitions, the equivalents in other languages, etc., are inserted into their respective fields for the different data categories in the terminological entry as defined in the data model. In this phase, the methods borrowed from legal comparison as previously described (see Section 3.2) play an important role.
4.6Revision and QA
Revision can be divided into formal revision (e.g., check whether the entry is complete, all the information is in the appropriate data category, hyperlinks are working), linguistic revision (e.g., spelling, new term suggestions), and content revision (e.g., correctness of the definition, equivalence). Formal revision is usually performed by terminologists or terminology database managers, linguistic revision by native speakers, and content revision by legal experts (Chiocchetti et al. 2013, 28–30). When developing new legal terms (e.g., when translating the acquis communautaire into the language of an accession candidate), legal experts are also often involved in linguistic revision, as they validate new term suggestions (Chiocchetti, and Wissik 2018, 144).
The quality assurance framework for MLTDBs considers the QA of the following four aspects: persons, processes, products, and services supported by dedicated technology (Chiocchetti et al. 2017, 168–179). For more detailed information on QA see Section 6.4.
4.7Maintenance
To keep an MLTDB serving its purpose, a set of proactive activities have to be performed. These activities can be classified as IT-related activities and as content-related activities. Among the IT-related activities, there are software updates, server updates, user interface enhancements, bug fixes, data back-ups, etc. Some content-related activities are improving single terminological entries, deleting duplicate terms, merging terminological entries, performing global changes (e.g., after a spelling reform or a legal reform) as well as reorganizing terminological resources (e.g., adding new data categories) (Chiocchetti et al. 2013, 30–31).
4.8Dissemination
Terminological data can be disseminated in different ways. Typically, terminological data are published in databases that are publicly available on the internet or only to a restricted audience, e.g., in an intranet. Furthermore, terminological data are also published as dictionaries or glossaries, online as well as in print format. Dissemination should take the different usage scenarios and target users of MLTDBs into consideration (see Section 3.3 and 3.4 respectively), so that terminological data can be structured according to the needs of specific user categories and shared in line with target user expectations.
With these forms of publication, the data can be accessed and looked up by human users, but they often have the disadvantage that the data cannot be further processed. Since collecting, compiling, and maintaining terminological resources is a very resource intensive work, already existing resources should be (re)used as well for applications in other domains, e.g., Natural Language Processing (NLP). An obstacle in reusing existing terminological data is that they are not findable, accessible, or interoperable. A way of enhancing the findability and interoperability of language resources such as terminological resources is publishing terminological datasets in standardized formats, e.g., TermBase eXchange (TBX) (ISO 30042 2019), Simple Knowledge Organization System (SKOS)44. https://www.w3.org/TR/skos-primer/ (Accessed July 13, 2022) or OntoLex-Lemon,55. https://www.w3.org/2016/05/ontolex/ (Accessed July 13, 2022) with their respective metadata in repositories and catalogs (inter alia CLARIN ERIC 2021; Lušicky, and Wissik 2019, 330–332).
Another possibility is using methods coming from the field of Linked Data (LD) in the context of the Semantic Web to publish terminological resources (Cimiano et al. 2020; Martin-Chozas et al. in this volume). Linked Data “refers to interlinked collections of datasets published on the Web” (Cimiano et al. 2020, 4). The subset concerned with linguistic data is called Linguistic Linked Data (LLD). In recent years, a number of approaches have been proposed to publish terminological resources as LLD (Cimiano et al. 2015; Di Buono et al. 2020; McCrae et al. 2015; Montiel-Ponsoda et al. 2015; Rodriguez-Doncel et al. 2015). Applying the LD principle to publish (legal) terminological resources generally means using unique resource identifiers (URIs) so that a particular terminological resource as well as a single term entry in the resource can be unambiguously identified. Consequently, people can look the resource up, get useful information for it, and also discover related resources (Cimiano et al. 2020, 4–5). Furthermore, the data can be processed and used in further applications; for example, terminological resources can be integrated into knowledge bases to provide the reader of a legal text directly with a definition of terms used in the specific legal text or to annotate legal texts, to classify them, and in question answering systems (Rehm et al. 2019).
5.Roles
Legal terminology work includes cooperative as well as collaborative aspects.66.Cooperation focuses on reaching a common goal or creating a product through the division of labor, while collaboration implies a more intense and regular interaction in the form of group work (Chiocchetti and Wissik 2018, 140). Thus, we can identify different roles in the workflow, those with terminology/language-related expertise, those with management-related expertise, those with legal expertise, and those with IT-related expertise (Chiocchetti et al. 2013, 40–53; Chiocchetti, and Wissik 2018; Lušicky, and Wissik 2015, 31–37). However, not only humans play a crucial role in the terminology workflow. In the last decade, machines (e.g., tools, machine learning algorithms and other AI applications) became increasingly important, which is why this section will look at the roles of humans and machines within the terminology workflow.
Regarding the roles of humans, be aware that a specific role is not necessarily bound to a single person but can be shared among more people in the team and a person can be involved in terminology work in different roles (e.g., as terminologist and reviser).
5.1Terminologists
Terminologists are experts in compiling, maintaining, and disseminating monolingual and multilingual specialized vocabularies. They are familiar with terminology theory and practical terminology work and are involved in several steps in the terminology workflow. They compile monolingual or multilingual terminologies, they are involved in terminology planning activities, and they provide consulting and training activities. Furthermore, they define workflows in terminology work and evaluate terminology related tools and software products. They draft requirements and specifications for these tools and software products, and they can be involved in their further development as well as in the data modeling of MLTDBs.
Terminologists may perform the following tasks (inter alia Chiocchetti et al. 2013; RaDT 2004, 2020):
-
Collecting relevant source materials (in one or more languages) and study them;
-
Creating concept systems;
-
Extracting relevant designations from texts or corpora;
-
Conducting contrastive analyses with other languages/legal systems to find equivalents;
-
Compiling terminological entries (i.e., fill out all the required data categories and, if necessary, make translation proposals);
-
Updating existing terminological entries;
-
Reviewing terminological entries;
-
Modeling and defining workflows;
-
Cooperating in the planning, data modeling, and evaluation of terminology databases;
-
Disseminating terminological resources;
-
Assisting in standardization activities;
-
Assessing the quality of terminological resources.
Furthermore, terminologists can work closely with legal experts in all or some of these activities.
Terminologists may also carry out the role of terminology (database) managers and are responsible for the terminology management system and/or database, managing access rights, importing and exporting data, and backing up and archiving the data.
5.2Revisers
Revisers, also called reviewers,77.According to ISO 17100 (2015) reviser and reviewer are not synonyms. A reviser is “a person who revises target language content against source language content” (i.e., comparing source and target) while a reviewer is “a person who reviews target language content” (i.e., monolingually). However, this distinction is not made in all contexts systematically. QA specialists, or QA evaluators, can be language experts (e.g., translators), legal experts, or terminologists, depending on the type of revision or quality assurance they are involved in. They may perform formal quality checks, revise the terminology linguistically or from a legal point of view. They may give feedback on the quality of terminological resources and document the revision or QA (inter alia Chiocchetti et al. 2013, 2017; Drewer et al. 2020, 5). For details on QA see Section 6.4.
5.3Terminology coordinators
Terminology coordinators, also called terminology project managers or terminology managers, have management-related expertise and project management skills and are familiar with terminology work. They are in charge of managing and coordinating terminology projects, terminology units, or language units. They are responsible for a team of people and oversee the whole terminology workflow. Furthermore, they coordinate all relevant activities throughout the entire workflow, facilitate the collaboration and cooperation between terminologists and legal experts, and are the main contact for external stakeholders (Chiocchetti et al. 2013, 44–46; Drewer et al. 2020, 5).
5.4Legal experts as domain experts
Legal experts, as domain experts, can have different roles in the terminology workflow (Chiocchetti et al. 2013, 46–48; Chiocchetti, and Wissik 2018, 142–146). They can act as consultants in various steps throughout the terminology workflow, participate as revisers in the review and QA phase, be part of standardization committees, and are often the end users of legal terminology products. Rarely do they act as terminologists, creating terminological entries on their own (Chiocchetti, and Ralli 2014).
5.5IT experts
IT experts have expertise in information technology and are responsible for taking care of administering, maintaining, developing, and enhancing tools for terminology work and related tasks (Chiocchetti et al. 2013, 49–51; Drewer et al. 2020, 5). Increasingly they are not only database experts but also come from the fields of NLP, machine learning, Semantic Web, and AI. They develop new tools and methods to enhance the terminology workflow, to enrich terminological resources automatically, and to explore new ways to disseminate terminological data.
5.6Users of legal terminology
Besides traditional users like translators, interpreters, legal experts, legal drafters, people working in public administrations or international organizations, and the general public (see Section 3.4), IT experts from the fields of NLP, machine learning, Semantic Web, and AI are increasingly becoming users of terminology with the aim to enhance algorithms and develop new tools and applications in different domains (e.g., text mining, machine translation, question answering, robotics).
5.7Machines
AI and machine learning methods play an increasing role in different areas of the terminology workflow, e.g., enhancing automatic term extraction (Marín Pérez in this volume), automatic definition extraction, or automatic enrichment of terminological resources. They also facilitate the automatic creation of ontologies and concept systems, as it is implemented, for example, in the WIPO Pearl termbase (Reininghaus 2018, 15–16). They can also play a role in streamlining revision, e.g., by predicting whether a candidate term will be approved based on previous data (Fleischmann 2021). Furthermore, terminological data can be used by machines in other areas such as machine translation, text mining, and robotics. In this context it has to be stressed that terminological data used by machines might have different requirements (e.g., data in machine readable form, data elementarity, explicit data disambiguation) and also other quality criteria (e.g., veracity of metadata) than terminological data used by humans.
6.Quality and MLTDBs
Since MLTDBs are embedded in multilingual legal communication with the intention to help harmonize terminology and to avoid misunderstandings that may result from the different terms and their interpretations by various stakeholders in these processes, the quality of MLTDBs is of uttermost importance. Quality is neither an absolute nor entirely objective variable but is ultimately determined by the stakeholders, users, and applications of MLTDBs. Quality refers to the “degree to which a set of inherent characteristics […] of an object […] fulfils requirements” (ISO 9000 2015, Clause 3.6.2). Requirements are expectations that can be generated by different stakeholders and can be a combination of implied, stated, and obligatory requirements. The requirements in the scope of the quality of an MLTDB are informed by the general methods of terminology management, quality management, translation quality management, and data quality management.
Multilingual legal communication is usually supported by translation services that can be provided internally or outsourced to external contractors. In both cases, MLTDBs play a crucial role in assuring the quality of the translation services. By adequately deploying an MLTDB, it can be ensured that terminology is used correctly and consistently in translations. The terminology rendered in the scope of translation services is often fed back into MLTDBs and therefore depends on translation quality management. Given that MLTDBs are data-comprised products, data quality management principles should be considered as well.
6.1Quality management
Quality management refers to the development of policies, goals, and processes to achieve these objectives (ISO 9000 2015, Clause 3.3.4). In practice, MLTDBs are embedded into multilingual legal communication and are therefore influenced by the quality management deployed by the organization in which they take place. This means that the quality policy of MLTDBs should be consistent with the overall quality policy of the organization.
Quality objectives are usually established for relevant functions or processes. MLTDBs are outputs of processes of an organization or several organizations (e.g., in the case of IATE, the European Union’s terminology database). On the one hand, they are products (databases) that implement requirements formulated along the lines of process quality, database data quality, and data model quality. On the other hand, they are also services, allowing for example querying, filtering, collaborative work, etc. Quality objectives ideally address both functions of MLTDBs.
The quality objectives regarding MLTDBs can be defined either as part of another product or service (e.g., translation service) or as specificities of an MLTDB. In both cases, the quality objectives are achieved through numerous sub-processes that follow the quality management framework outlined in ISO 9000 (2015, Clause 3.3.4): quality planning, quality assurance, quality control, and quality improvement.
6.2MLTDB and international standards
The primary purpose of standards is to define transparent and widely acknowledged conformity requirements. They increase the reliability, effectiveness, consistency, and efficacy of the product or service. It should be noted that there are neither international nor national standards that explicitly standardize legal terminology databases.
However, international collaboration on the standardization of terminology and terminology work in general enjoys a long history and has produced several international standards that are applicable to MLTDBs. At the international level, ISO Technical Committee ISO/TC 37 – Language and terminology oversees terminology standards. In the following, we give a short overview of the most relevant standards (for more information, see Kockaert, and Steurs 2015).
The core principles and methods of terminology and terminology work are specified in ISO 704 (2022) Terminology work – Principles and methods, and in ISO 860 (2007) Terminology work – Harmonization of concepts and terms. Data categories for terminological entries are specified in ISO 12620-1 (2022) Management of terminology resources – Data categories – Part 1: Specifications, and more in detail in ISO 12616–1 (2021) Terminology work in support of multilingual communication – Part 1: Fundamentals of translation-oriented terminography, including data categories relevant for the translation process and also for the deployment of MLTDBs. Specific requirements regarding terminology products and services are addressed in ISO 22128 (2008) Terminology products and services. Another useful standard dealing with terminology standardization is ISO 15188 (2001) Project management guidelines for terminology standardization. The requirements regarding data modeling and the realization of terminological data interoperability are outlined in ISO 30042 (2019) Management of terminology resources – TermBase eXchange (TBX). All published standards and standards under development are listed in the ISO catalog.88. https://www.iso.org/committee/48104/x/catalogue/p/0/u/1/w/0/d/0 (Accessed July 13, 2022).
6.3Quality planning
Quality planning entails setting the quality objectives and processes that are needed to achieve these objectives. Quality planning can be strictly focused on MLTDBs (e.g., a certain number of entries that will be checked by legal experts), but can also be implemented in the vertical quality planning of the organization (e.g., the technical infrastructure that will be available). Quality planning is ideally performed simultaneously and in combination with the overall terminology planning process. The planning should be adapted to the requirements of legal terminology work and workflows (see Section 4.2) and to specific MLTDB requirements.
The scope and depth of quality planning generally depends on the complexity and number of requirements, as well as on the outputs of other processes. Requirements of preceding outputs should be formulated and planned in consideration of subsequent processes. For example, if terminological ontologies have been deployed as a means for concept clarification and conceiving concept definitions (Madsen, and Erdman Thomsen 2015, 269), the conformity requirements and the quality characteristics of terminological ontologies may influence the quality of MLTDBs.
The output of the quality planning process is a quality plan that includes a quality manual with detailed specifications, procedures (e.g., how to carry out a terminological activity), and qualitative and quantitative characteristics of an MLTDB, etc.99.A quality manual is often used or realized as a user manual, e.g., IATE’s User’s Handbook.
The characteristics of an MLTDB can be organized along the following categories into quality criteria (Arndt et al. 2020, 22–23; Chiocchetti et al. 2013, 28–29; Lušicky, and Wissik 2015, 69–71):
-
Linguistic (e.g., correctness of term creation, appropriateness of the terms in the given context or domain, misspellings);
-
Data-related (e.g., veracity, correctness of definitions, reliability, concept duplicates);
-
Formal (e.g., formal duplicates, number of entries, completeness of entries, language attribution, correct legal system attribution (see Section 3.2));
-
Data model-related (e.g., granularity of data categories, elementary nature of data categories, closed and open data categories);
-
Temporal (e.g., updates after a legal reform, availability of entries, timeliness of data entries);
-
Functional (e.g., query responsiveness, role management, querying, and filtering).
The documentation (e.g., records) of activities and results achieved should be part of quality planning. The documentation can be used later to monitor the effectiveness and efficiency of the quality measures and activities and to ensure traceability (e.g., the information provided by legal experts), verification of specific requirements met (e.g., the termbase covers specific legal systems and subdomains, provides for collaborative work, etc.), implementation of preventive actions to avoid nonconformity with the requirements, and corrective actions in case of nonconformity.
6.4Quality assurance and quality control
Quality assurance is “focused on providing confidence that the quality requirements will be fulfilled” (ISO 9000 2015, Clause 3.3.6). It entails the operations taking place before, during and after devising, setting up, populating, and using an MLTDB. Quality control is focused on fulfilling quality requirements (ISO 9000 2015, Clause 3.6.5), making sure that the termbase complies with the requirements for the intended use.
Some quality requirements can be objectively and quantitatively fulfilled (e.g., completeness of entries). On the other hand, several quality requirements, e.g., data-related criteria, such as the correctness of definitions, may be characterized by a degree of vagueness and interpretation in the scope of legal terminology.
QA should be conducted both routinely in each and through all production cycles but can be defined as a stand-alone project in case of larger amounts of terminological entries. QA can focus routinely on all new entries, on already existing entries for which feedback or requests have been received, or on random samples. Routine maintenance of older entries is recommended, in particular to ensure that they do not contain obsolete concepts or terms, that hyperlinks are still working, etc. (Arndt et al. 2020, 21). Terminologists should also regularly check duplicates at the concept and term level.
Ideally, each entry and data category is checked by a terminologist and a reviser (see Sections 5.1 and 5.2 respectively) against predefined quality criteria. Depending on the set of quality criteria that are being checked, several revisers may be involved: a legal expert may check the quality of data in terms of veracity and reliability, while a language expert may check the linguistic correctness of a term and its linguistic attributes. QA is supported by role and workflow management components of the terminology management tools which streamline the terminology processes and keep track of ongoing or finished tasks, thereby also documenting the process.
The operationalization and the precise moment during which QA is performed also depends on the setting and the purpose of the terminology work (descriptive, prescriptive, normative). For example, if an MLTDB is used as part of a translation process, it may be deployed as a prescriptive requirement of translation quality, for instance as a quality gate. Quality gates are milestones that require that predefined criteria be met before proceeding to the next step. In such settings, an MLTDB is used before and during translation as well as in the revision stage as an instrument of translation quality assurance (ISO 17100 2015). This means that the QA of the termbase or the relevant entries may need to be finalized before the translation process.
7.Conclusions and outlook
In this chapter, we have described the specific features of MLTDBs and illustrated the steps within a typical workflow for creating MLTDBs, i.e., needs analysis, design, documentation, term extraction, compilation of terminological entries, revision and QA, maintenance, and dissemination, the roles involved, and the users of MLTDBs. We have shown that users, workflow, roles, and QA partly differ in MLTDBs from TDBs in other domains. We have also discussed that there is a need for more flexibility within MLTDBs in consideration of different and/or new users and usage scenarios (e.g., different access to the same data by humans and machines, different data visualizations for various user categories, navigation via concept systems/ontologies and not just through designations, etc.). Moreover, we have covered the topic of quality, which is of uttermost importance to avoid misunderstanding in a multilingual communication setting, and discussed how to implement quality planning, assurance, and control in MLTDBs. Database quality is a particularly challenging aim when dealing with the legal domain, which is also difficult to define, due to the unique linguistic and system-bound aspects of this domain. Furthermore, we raised the issue that terminological data used by machines might have different requirements and also other quality criteria than terminological data used by humans.
We argue that some of the activities within the workflow and some of the roles have changed following the development of recent technologies and methods. For example, the role of IT experts changed from that of pure database experts, who are involved in implementing the database and maintaining it, to that of experts in the fields of NLP and machine learning, who are involved in developing new methods to enhance the terminology workflow, to enrich terminological resources automatically, and to explore new ways of disseminating terminology.
Finally, we have touched on the issue of machines being increasingly involved in the legal terminology workflow. They can support quite labor-intensive terminology work (e.g., term extraction, automatic ontology creation) as well as streamline the revision process. Furthermore, terminological data can have application scenarios in areas such as machine translation, text mining, or smart systems.