EU phraseological verbal patterns in the PETIMOD 2.0 corpus: A NER-enhanced approach

Gloria Corpas PastorFernando Sánchez Rodas
Table of contents

1.Introduction

As Prieto Ramos (2021Prieto Ramos, Fernando 2021 “Translating legal terminology and phraseology: between inter-systemic incongruity and multilingual harmonization.” Perspectives 29(2):175–183. DOI logoGoogle Scholar, 175) puts it, “terminology and phraseology are key features of legal discourses, and central aspects of professional practice and research in legal translation”. In the context of legal and institutional EU settings, document drafting and mediation (translation/interpreting) are characterized by a high degree of formulaicity, which seems to be particularly notable in the case of translations (Biel 2018 2018 “Lexical bundles in EU law: The impact of translation process on the patterning of legal language.” In Phraseology in Legal and Institutional Settings: A Corpus-Based Interdisciplinary Perspective, edited by Stanisław Goźdź-Roszkowski and Gianluca Pontrandolfo, 11–26. London: Routledge.Google Scholar).

Recent advances in corpus-based methodologies have allowed researchers to establish the contribution of single-word terms, multi-word terms, lexical phraseological patterns and verbal phraseological patterns to the legal flavour and genre adherence of those texts. One of the biggest advantages of the use of corpora for terminology is the study of concordances to identify specialized phraseology, also in the form of clusters (alternatively known as lexical bundles). As a matter of fact, specialized phraseology tends to cluster around terms, forming a link between such terms and the text and the text (Pontrandolfo 2015 2015 “Investigating Judicial Phraseology with COSPE: A contrastive Corpus-based Study.” In New directions in corpus-based translation studies, edited by Claudio Fantinuoli and Federico Zanettin, 137–159. Berlin: Language Science Press.Google Scholar, 148). For legal and institutional translators, for example, translation is not only a question of terminology, but also a problem of phraseological conventions. Beyond lexical and terminological equivalence, translators have to tackle the additional difficulty of acquiring familiarity with the genre structures or routine formulae, if they want to produce a text which is accurate from the discourse and register point of view (Pontrandolfo 2015 2015 “Investigating Judicial Phraseology with COSPE: A contrastive Corpus-based Study.” In New directions in corpus-based translation studies, edited by Claudio Fantinuoli and Federico Zanettin, 137–159. Berlin: Language Science Press.Google Scholar, 137–138).

This chapter deals with the phraseological verbal patterns associated with named entities in an intermodal corpus of EU petitions in English and Spanish (PETIMOD v. 2.0). After a brief overview of terminology and phraseology research in EU legal discourse (Section 2), Section 3 presents the main goals of the study and covers data collection, data extraction and methodology of analysis. Section 4 offers the main findings of our study and discussion of results, followed by some concluding remarks on the new venues opened by our research for contrastive analysis, translation, and interpreting (Section 5).

2.Related work

From its very beginning the study of EU legal terminology has been strongly marked by the peculiar nature of EU law and culture. The first publications already strived to differentiate between new and foreign institutional terms, which actually led to the identification of a distinct linguistic variety in EU texts, known as Eurolect (Goffin 1994Goffin, Roger 1994 “L ’ eurolecte : oui , jargon communautaire : non.” Meta 39(4):636–642. DOI logoGoogle Scholar, 641). The existence of Eurolect has already been empirically proven in many official languages, such as English (Sandrelli 2018Sandrelli, Annalisa 2018 “Observing Eurolects: The case of English.” In Observing Eurolects: Corpus analysis of linguistic variation in EU law, edited by Laura Mori, 63–92. DOI logoGoogle Scholar) and Spanish (Blini 2018Blini, Lorenzo 2018 “Observing Eurolects: The case of Spanish.” In Observing Eurolects: Corpus analysis of linguistic variation in EU law, edited by Laura Mori, 329–367. DOI logoGoogle Scholar). As Biel, Biernacka, and Jopek-Bosiacka (2018 2018 “Lexical bundles in EU law: The impact of translation process on the patterning of legal language.” In Phraseology in Legal and Institutional Settings: A Corpus-Based Interdisciplinary Perspective, edited by Stanisław Goźdź-Roszkowski and Gianluca Pontrandolfo, 11–26. London: Routledge.Google Scholar, 257) state, “as a result of non-native influences on EU English and an increased need to create neologisms, EU texts are marked by some unnatural word combination, including untypical collocations and collocational distortions”. Furthermore, it is observed that:

Eurolects have developed a distinct supranational terminology, as well as stylistic and grammatical features, which depart from certain conventions of national languages. With the advent of corpus methods, it has recently become possible to explore the nature of Eurolects empirically on a large scale.(Biel 2021 2021 “Eurolects and EU Legal Translation.” In The Oxford Handbook of Translation and Social Practices, edited by Meng Ji and Sara Laviosa, 477–500. Online: Oxford University Press. DOI logoGoogle Scholar, 1)

Despite the promising avenues of research, so far there have been relatively few empirical studies of word combinations in the domain of law and in the many different contexts where legal discourse is used. Contrastive and comparative studies also remain relatively scarce, possibly due to the absence of systematic, publicly available corpora for the study of legal language (Biel 2021 2021 “Eurolects and EU Legal Translation.” In The Oxford Handbook of Translation and Social Practices, edited by Meng Ji and Sara Laviosa, 477–500. Online: Oxford University Press. DOI logoGoogle Scholar, 2). At a monolingual level, seminal contributions were Goźdź-Roszkowski (2011Goźdź-Roszkowski, Stanisław 2011Patterns of Linguistic Variation in American Legal English: A Corpus-Based Study. Frankfurt am Main: Peter Lang. DOI logoGoogle Scholar, 2012 2012 “Discovering Patterns and Meanings: Corpus Perspectives on Phraseology in Legal Discourse.” Roczniki Humanistyczne 60(8):47–70. https://​www​.ceeol​.com​/search​/article​-detail​?id​=129241), both about phraseology in US legal English. The publication of a special issue of Fachsprache (Goźdź-Roszkowski and Pontrandolfo 2015b eds. 2015b “Legal Phraseology Today. A Corpus-based View.” Fachsprache 37(3–4). DOI logoGoogle Scholar) contributed to filling the existing gap in the literature with regard to contrastive and translational multilingual studies (Gozdz-Roszkowski and Pontrandolfo 2015a 2015a “Legal Phraseology Today: Corpus-based Applications Across Legal Languages and Genres.” Fachsprache 37(3–4):130–138. 10.24989/fs.v37i3-4.1287. DOI logoGoogle Scholar, 134), followed by a volume with invited contributions and workshop papers (Goźdź-Roszkowski and Pontrandolfo 2018Goźdź-Roszkowski, Stanisław and Gianluca Pontrandolfo eds. 2018Phraseology in legal and institutional settings: A corpus-based interdisciplinary perspective. London: Routledge. DOI logoGoogle Scholar).

Regarding EU language, formulaicity has attracted most attention from researchers, especially in written genres. Empirical corpus-based research has been carried out to discern how phraseology behaves in legal translation and whether it is likely to retain the same level of formulaicity as non-translated law, as stated by Biel (2014)Biel, Łucja 2014 “Phraseology in legal translation: A corpus-based analysis of textual mapping in EU law.” In The Ashgate Handbook of Legal Translation, edited by Le Cheng, King Kui Sin & Anne Wagner, 177–192. DOI logoGoogle Scholar, one of the most prolific scholars in the field. The hypothesis that translations are less patterned and less formulaic than originals and corresponding non-translated texts in the target language was tested by Biel (2018) 2018 “Lexical bundles in EU law: The impact of translation process on the patterning of legal language.” In Phraseology in Legal and Institutional Settings: A Corpus-Based Interdisciplinary Perspective, edited by Stanisław Goźdź-Roszkowski and Gianluca Pontrandolfo, 11–26. London: Routledge.Google Scholar. The author examines to what extent it is possible to recreate or prime the typical patterning of legal language in translation. In her study, the hypothesis that translations are less patterned and less formulaic than non-translations was not confirmed. In fact, translations tend to exhibit their own bundles (n-grams) due to interference of their source texts.

Another study by Biel, Koźbiał, and Wasilewska (2019)Biel, Łucja, Dariusz Koźbiał, and Katarzyna Wasilewska 2019 “The formulaicity of translations across EU institutional genres: A corpus-driven analysis of lexical bundles in translated and non-translated language.” Translation Spaces 8(1):67–92. DOI logoGoogle Scholar showed a strong correlation between formulaicity and genres, as well as multiple facets of formulaicity (e.g. tokens vs. types), confirming the increased aggregate formulaicity of translations as regards bundle tokens for all EU genres, except for judgments (Biel, Koźbiał, and Wasilewska 2019Biel, Łucja, Dariusz Koźbiał, and Katarzyna Wasilewska 2019 “The formulaicity of translations across EU institutional genres: A corpus-driven analysis of lexical bundles in translated and non-translated language.” Translation Spaces 8(1):67–92. DOI logoGoogle Scholar). This study also argued that translations develop their own formulaic profiles which are levelled out compared to EU English corpora and which minimally overlap with formulaic profiles of domestic genres, in line with Biel (2018) 2018 “Lexical bundles in EU law: The impact of translation process on the patterning of legal language.” In Phraseology in Legal and Institutional Settings: A Corpus-Based Interdisciplinary Perspective, edited by Stanisław Goźdź-Roszkowski and Gianluca Pontrandolfo, 11–26. London: Routledge.Google Scholar.

Contrasting formulaicity of translated and non-translated texts is one of the most researched topics in the literature. For space limitations, suffice to mention relevant publications that focus on the analysis of multi-word terms and collocations in various types of EU legal, judicial, and regulatory documents. Most authors adopt a monolingual, supranational or transnational perspective (Biel, Biernacka, and Jopek-Bosiacka 2018Biel, Łucja, Agnieszka Biernacka, and Anna Jopek-Bosiacka 2018 “Collocations of Terms in EU Competition Law: A Corpus Analysis of EU English Collocations.” In Language and Law: The Role of Language and Translation in EU Competition Law, edited by Silvia Marino, Łucja Biel, Martina Bajčić and Vilelmini Sosoni, 249–274. Cham: Springer International Publishing. DOI logoGoogle Scholar; Biel and Doczekalska 2020Biel, Łucja and Agnieszka Doczekalska 2020 “How do supranational terms transfer into national legal systems?Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 26(2):184–212. DOI logoGoogle Scholar; Biel, Koźbiał, and Wasilewska 2019Biel, Łucja, Dariusz Koźbiał, and Katarzyna Wasilewska 2019 “The formulaicity of translations across EU institutional genres: A corpus-driven analysis of lexical bundles in translated and non-translated language.” Translation Spaces 8(1):67–92. DOI logoGoogle Scholar; Biel and Pytel 2021Biel, Łucja and Izabela Pytel 2021 “Corrigenda of EU Legislative Acts as an Indicator of Quality Assurance Failures.” In Institutional Translation and Interpreting, edited by Fernando Prieto Ramos, 150–173. New York: Routledge. DOI logoGoogle Scholar; Hrežo 2020Hrežo, Vladimir 2020 “Exploring Phraseology in EU Legal Discourse.” Language – Culture – Politics 1:29–52.Google Scholar; Pontrandolfo 2011Pontrandolfo, Gianluca 2011 “Phraseology in criminal judgments: A corpus study of original vs. translated Italian.” Sendebar 22:209–234.Google Scholar, 2021 2021 “National and EU judicial phraseology under the magnifying glass: a corpus-assisted analysis of complex prepositions in Spanish.” Perspectives 29(2). 260–277. DOI logoGoogle Scholar), although there are studies that also take a cross-lingual stance, including translation (Dobrić Basaneže 2017Dobrić Basaneže, Katja 2017 “Interpreting Phraseological Units in Contracts: The Case of Extended Term–Embedding Collocation.” Suvremena Lingvistika 43(84):199–216. DOI logoGoogle Scholar; Klabal 2019Klabal, Ondřej 2019 “Corpora in Legal Translation: Overcoming Terminological and Phraseological Assymetries between Czech and English.” CLINA: Revista Interdisciplinaria de Traducción, Interpretación y Comunicación Intercultural 5(2):165–86. DOI logoGoogle Scholar; Pontrandolfo 2015 2015 “Investigating Judicial Phraseology with COSPE: A contrastive Corpus-based Study.” In New directions in corpus-based translation studies, edited by Claudio Fantinuoli and Federico Zanettin, 137–159. Berlin: Language Science Press.Google Scholar; Seracini 2020Seracini, Francesca L. 2020 “Phraseology in multilingual EU legislation: a corpus-based study of translated multi-word terms.” Perspectives 29:245–259. DOI logoGoogle Scholar; Trklja 2018Trklja, Aleksandar 2018 “A corpus investigation of formulaicity and hybridity in legal language: A case of EU case law texts.” In Phraseology in Legal and Institutional Settings: A Corpus-Based Interdisciplinary Perspective, edited by Stanisław Goźdź-Roszkowski and Gianluca Pontrandolfo, 89–108. London: Routledge. DOI logoGoogle Scholar; Vigier-Moreno and Sánchez Ramos 2017Vigier-Moreno, Francisco Javier and María del Mar Sánchez Ramos 2017 “Using parallel corpora to study the translation of legal system-bound terms: The case of names of English and Spanish courts.” In Computational and Corpus-Based Phraseology. Second International Conference, Europhras 2017 London, UK, November 13–14, 2017 Proceedings, edited by Ruslan Mitkov, 260–273. Cham: Springer. DOI logoGoogle Scholar), or even present an intermodal approach (Ferraresi and Miličević 2017Ferraresi, Adriano and Maja Miličević 2017 “Phraseological patterns in interpreting and translation : similar or different ?” In Empirical Translation Studies: New Methodological and Theoretical Traditions, edited by Gert De Sutter, Marie-Aude Lefer and Isabelle Delaere, 157–182. Berlin: De Gruyter Mouton. DOI logoGoogle Scholar; Ferraresi et al. 2017Ferraresi, Adriano, Silvia Bernardini, Marie-Aude Lefer, and Maja Miličević 2017 “Investigating the language of written translation and simultaneous interpretation: Simplification in EPTIC.” In Congrès Mondial de Traductologie (Université de Paris-Nanterre, du 10/04/2017 au 14/04/2017). http://​hdl​.handle​.net​/2078​.1​/185346; Santandrea 2014Santandrea, Manuela 2014Le collocazioni in traduzione e interpretazione tra italiano e inglese: uno studio su EPTIC_01_2011. Università di Bologna. https://​amslaurea​.unibo​.it​/cgi​/users​/home​?screen​=EPrint%3A%3AView​&eprintid​=7839), among others.

On the other hand, studies on formulaicity in EU interpreting are scarce, but not inexistent. Apart from the studies already mentioned comparing the phraseological patterns in both modalities of mediated discourse (translation and interpreting), some studies are devoted specifically to interpreting. For instance, Henriksen (2007)Henriksen, Line 2007 “The song in the booth: Formulaic interpreting and oral textualisation.” Interpreting 9(1):1–20. DOI logoGoogle Scholar conducted an experimental study on the Danish booth in the Joint Interpretation Service of the European Commission. Her findings suggest that formulaic language production contributes to the overall creation of EU discourse and is characterized by an increased homogeneity of simultaneous interpreting in general and with regards to specific booths, as interpreters tend to borrow formulaic phraseologies from their colleagues. In the same vein, Aston (2018)Aston, Guy 2018 “Acquiring the Language of Interpreters: A Corpus-based Approach.” In Making Way in Corpus-based Interpreting Studies, edited by Mariachiara Russo, Claudio Bendazzoli & Bart Defrancq, 83–96. Singapore: Springer. DOI logoGoogle Scholar analysed a corpus of European Parliament transcripts and established that interpreters tend to use recurrent formulaic patterns as a way to enhance their fluency, manage turn-taking and other discourse features (linked to turn-taking, justification, etc.), and reduce the cognitive load of interpretations. Some recent intermodal studies of formulaicity in constrained communication were also contributed by Kajzer-Wietrzny and Grabowski (2021)Kajzer-Wietrzny, Marta and Łukasz Grabowski 2021 “Formulaicity in Constrained Communication: An Intermodal Approach.” MonTI. Monografías de Traducción e Interpretación 13:148–83. DOI logoGoogle Scholar.

3.Study goals and methodology

This chapter builds on our previous work and intends to delve further into the study of named entities and their phraseological patterns in an enlarged, intermodal corpus of EU petitions in English and Spanish.

Named entities (NEs) tend to be particularly ubiquitous in Eurolects. Surprisingly, this is an under-researched and almost unexplored topic. In Corpas Pastor and Sánchez Rodas (2022), we conducted an NLP-enhanced analysis of the translation and interpreting shifts of NEs in a former version of the PETIMOD corpus (v. 1.0), an EN<>ES intermodal corpus of documents and speeches rendered at the Committee on Petitions of the European Parliament. In this study we claim that institutional texts exhibit an argument-structure text-organizing pattern centred around named entities and their phraseology, in the same way as it has been demonstrated in the field of data mining, in which sentiment has been analysed by looking at opinions towards entities (Steinberger et al. 2011Steinberger, Josef, Polina Lenkova, Mijail Kabadjov, Ralf Steinberger, and Erik Van Der Goot 2011 “Multilingual entity-centered sentiment analysis evaluated by parallel corpora.” In International Conference Recent Advances in Natural Language Processing, RANLP, edited by Galia Angelova, Kalina Bontcheva, Ruslan Mitkov & Nikolai Nikolov, 770–775. Hissar: Association for Computational Linguistics.Google Scholar). To this end, we have established three main objectives:

  1. Identifying, extracting and classifying NEs in the corpus;

  2. Identifying, extracting and classifying NEs’ phraseological verbal patterns in the corpus;

  3. Comparing the quantitative and qualitative nature of NE-based phraseology in each of the Eurolect variants (English/Spanish, translated/non-translated, interpreted/non-interpreted) and its degree of formulaicity (cf. Biel, Koźbiał, and Wasilewska 2018).

This study will use a corpus-based and NLP-enhanced methodology, as extraction of NEs and their phraseological patterns will be computer-assisted.

3.1Units of analysis

For this study we have selected named entities as our units of analysis. NEs are special types of terms that identify real-world “objects”, understood in a broad sense to encompass persons (Ursula von der Leyen), geopolitical locations, such as cities, countries, and states (Brussels, Belgium, UK), non-geopolitical locations, such as mountain ranges, seas, rivers, etc. (Danube), organizations and governing bodies (the International Court of Justice, the European Agency for Medicine EMA), buildings and other infrastructure (Westminster Abbey), and products (AstraZeneca), as well as dates (13 April, 2021), times (hour), numbers (third, 40.5), quantities and percentages (kilowatt, 3%), currency (Sterling pound, £) and even languages (English), among others.

Named entities, especially those referring to persons, locations and organizations, behave exactly like terms11.In fact, named entities have also been referred to as “system-bound” or “culture-bound” terms in the legal translation literature (Vigier-Moreno and Sánchez Ramos 2017Vigier-Moreno, Francisco Javier and María del Mar Sánchez Ramos 2017 “Using parallel corpora to study the translation of legal system-bound terms: The case of names of English and Spanish courts.” In Computational and Corpus-Based Phraseology. Second International Conference, Europhras 2017 London, UK, November 13–14, 2017 Proceedings, edited by Ruslan Mitkov, 260–273. Cham: Springer. DOI logoGoogle Scholar). However, despite drawing attention to common problems which professionals could face when translating these entities, the literature does not explore the possible relations between single or multi-word entities, collocations, and formulaic structures in the legal and institutional discourse. and present similar challenges for their systematic study: homographic pairs, variant spellings, morphological inflection, phraseological patterns, and the possibility of being referred to as multi-word expressions or acronyms (Jacquet et al. 2019Jacquet, Guillaume, Maud Ehrmann, Jakub Piskorski, Hristo Tanev, and Ralf Steinberger 2019 “Cross-lingual linking of multi-word entities and language-dependent learning of multi-word entity patterns.” In Representation and Parsing of Multiword Expressions: Current trends, edited by Yannick Parmentier and Jakub Waszczuk, 269–297. Berlin: Language Science Press. DOI logoGoogle Scholar).

For the formulaicity analysis, we will use the classification proposed by Biel (2014Biel, Łucja 2014 “Phraseology in legal translation: A corpus-based analysis of textual mapping in EU law.” In The Ashgate Handbook of Legal Translation, edited by Le Cheng, King Kui Sin & Anne Wagner, 177–192. DOI logoGoogle Scholar, 178–181), which depicts a phraseological continuum with fuzzy boundaries between five categories, ranging from the global textual level to the local microlevel. Examples (in italics) illustrate various phraseological patterns (in bold) with NEs (underlined).

  • Text-organizing patterns are repetitive global textual sequences which are often prescribed in drafting guidelines. They form a matrix of a legal text, emphasizing its ritualized nature. Typical text-organizing patterns include the title of the document, citations, transitions between sections, enacting formulas, amending formulas, and closing formulas, e.g.: The meeting opened at 10.06 on Wednesday, 19 February 2020, with Ms Yana Toom, (2nd Vice – Chair) presiding .

  • Grammatical patterns are genre-specific, recurrent, and express deontic modality, if-then mental models of legal reasoning and other conditional clauses, purpose clauses, the passive voice, and other impersonal structures. E.g.: PETI should not draft an opinion to the AFET, El artículo 29 de la CDPD debe entenderse junto con el artículo 9 (Accesibilidad) [Article 29 of the CRPD should be read in conjunction with Article 9 (Accessibility)].

  • Term-forming patterns are collocates of a generic term which form more specific multi-word terms of varying degrees of terminologicality. The typical, most productive term-forming patterns tend to be Adj + N and N + N, but in practice multi-word terms may be structurally very complex, e.g.: European Commission , cuenca del Mar Menor [(The) Mar Menor basin].

  • Term-embedding collocations22.The term collocation used in this paper follows Biel’s approach (2014Biel, Łucja 2014 “Phraseology in legal translation: A corpus-based analysis of textual mapping in EU law.” In The Ashgate Handbook of Legal Translation, edited by Le Cheng, King Kui Sin & Anne Wagner, 177–192. DOI logoGoogle Scholar), which partially deviates from mainstream postulates of phraseological studies (see Corpas Pastor 2017Corpas Pastor, Gloria 2017 “Collocations in E-Bilingual Dictionaries: From Underlying Theoretical Assumptions to Practical Lexicography and Translation Issues.” In Collocations and Other Lexical Combinations in Spanish. Theoretical and Applied Approaches, edited by Sergi Torner Castells and Elisenda Bernal, 139–160. London: Routledge.Google Scholar for an overview). are collocates of terms which embed terms in cognitive scripts and the text, evidencing combinatory properties of terms. N + V term-embedding collocations can be deemed prototypical in this category and provide important conceptual domain information. They denote what one can typically do with (or to) the object denoted by the base noun, e.g.: The EU ‎Commission launched an infringement procedure ; La Comisión es consciente de las preocupaciones que señalan los peticionarios [The Commission is aware of the concerns raised by the petitioners].

  • Finally, lexical collocations are routine formulae at the microstructural level which are not built around terms. They include inter/intratextual referential patterns, such as collocates of editing units, other recurrent patterns referred to as qualifications and non-terminological lexical bundles. In contrast to term-embedding collocations and multi-word terms, recurrence is an important criterion in their identification, e.g.: according to the Water Framework Directive ; propuesta de resolución conforme al artículo 227 punto 2 [motion for a resolution pursuant to Rule 227 (2)].

Term-forming patterns and lexical collocations are outside the scope of this paper on verbal patterns with NEs as subjects or complements.

3.2Choice of corpus

For this study we will use an enlarged version of the PETIMOD corpus (described in Corpas Pastor and Sánchez Rodas, 2022). PETIMOD is a parallel intermodal corpus composed of citizens’ petitions and other documents related to the European Parliament’s Committee on Petitions (PETI). It comprises two subcorpora: (a) original texts and speeches in English and Spanish (PETIMOD_ORIG), and (b) their corresponding translations into Spanish and interpretations into English (PETIMOD_MEDIATED). The genres included are basically notices to members, speeches by MEPs and speakers invited to the Committee of Petitions’ sessions, and non-petitional public documents discussed in the sessions (e.g. reports, opinions).

3.2.1Corpus size

The initial version of the corpus (PETIMOD 1.0) consisted of all petitions discussed during the sessions of February 2020, and all original Spanish (ES) speeches and their English (EN) interpretations of the session of 19th February 2020. Corpus compilation involved a collection of spoken and written documents. Data collection of written documents was rather straightforward: through the eMeeting portal33. https://​emeeting​.europarl​.europa​.eu​/emeeting​/committee​/agenda​/202002​/PETI​?meeting​=PETI​-2020​-0219​_1P​&session​=02​-19​-10​-00. of the European Parliament notices to members, reports and opinions were accessed, downloaded, reformatted (from .pdf to .txt), coded and stored. In comparison, the collection of oral data and transcriptions was more complex and time-consuming. It involved: (1) accessing oral data via the Webstreaming section of the EP Committees site; (2) downloading and storing recordings in HQ .mp4 format; (3) carrying out automatic speech recognition (ASR) plus automatic text transcription (ATT) of audiofiles with YouTube; (4) manually checking and revising in line with EPTIC conventions (Bernardini et al. 2018Bernardini, Silvia, Adriano Ferraresi, Mariachiara Russo, Camille Collard, and Bart Defrancq 2018 “Building Interpreting and Intermodal Corpora: A How-to for a Formidable Task.” In Making Way in Corpus-based Interpreting Studies, edited by Mariachiara Russo, Claudio Bendazzoli & Bart Defrancq, 21–42. Singapore: Springer. DOI logoGoogle Scholar) and the EU Interinstitutional Style Guide (European Union 2012aEuropean Union 2012aInterinstitutional Style Guide 2011. Luxembourg: Publications Office of the European Union. DOI logoGoogle Scholar and 2012b 2012bLibro de estilo interinstitucional 2011. Luxembourg: Publications Office of the European Union. DOI logoGoogle Scholar),44.The transcriptions employed the English and Spanish ISG PDF editions of the year 2012, together with the latest modifications of the ISG website up to the submission date, that is, March 2021 (see the news archive at http://​publications​.europa​.eu​/code​/en​/en​-000300​.htm). However, a new PDF edition for all EU languages was released in 2022 (https://​data​.europa​.eu​/doi​/10​.2830​/215072). among others; and (5) coding and storing.

For this study, the corpus has been enlarged to include the English and Spanish draft agendas55. https://​emeeting​.europarl​.europa​.eu​/emeeting​/committee​/en​/agenda​/202002​/PETI​?meeting​=PETI​-2020​-0219​_1P​&session​=02​-19​-10​-00. and minutes66. https://​emeeting​.europarl​.europa​.eu​/emeeting​/committee​/en​/agenda​/202004​/PETI. from the February 2020 meeting, as well as the original Spanish speeches from the 20th February session and their interpretations into English. This session was chosen because it had similar features to that of 19th February already compiled (original Spanish speeches on related environmental topics). Draft agendas and minutes are very relevant documents in the context of committee meetings which were previously used for guidance at the first steps of PETIMOD compilation, especially for identifying and codifying the speakers’ interventions. This time they were also included in the corpus itself. In an NER-oriented study, these two text genres can be very fruitful, as they provide additional NE samples apart from the people and topics intervening in each session. Section B of draft agendas, for example, lists petitions which are proposed for closure in the light of the Commission’s written reply or other documents received. In accordance with the committee’s Guidelines, these items are not discussed during the meeting, but any PETI Member may ask before the end of the meeting for an item in section B to be kept open (European Parliament 2018European Parliament 2018Guidelines: Committee on Petitions. https://​www​.europarl​.europa​.eu​/cmsdata​/138889​/1145997EN​.pdf). Minutes also include additional information in the form of requests on certain petitions not necessarily reflected in the agenda, and an attendance list of each meeting with MEPs, commissioners, guests, journalists, etc. Table 1 summarizes the size of PETIMOD 2.0 (in total, per component and per language). The total number of documents, running words (tokens) and word types (types) have been calculated using ReCor.77.ReCor is a solution to determine the minimum size of a corpus or a textual collection, regardless of language or textual genre of the collection, establishing therefore the minimum threshold for representation by an algorithm (N-Cor) and analyzing lexical density according to the incremental increase in the corpus (http://​www​.lexytrad​.es​/en​/resources​/recor​-3/). The corpus size was increased by 86 documents, 3,544 types and 23,608 tokens.

Table 1.PETIMOD 2.0
PETIMOD 2.0 Documents Types Tokens
PETIMOD2_ORIG_EN  21  5,330  52,421
PETIMOD2_ORIG_ES  81  3,072  18,409
PETIMOD2_MEDIATED_EN  81  2,025  15,709
PETIMOD2_MEDIATED_ES  21  6,262  61,377
PETIMOD2_EN (ORIG + mediated) 102  7,355  68,130
PETIMOD2_ES (ORIG + mediated) 102  9,334  79,786
PETIMOD2 204 16,689 147,916

3.2.2Transcription conventions and revisions

The initial set of transcription conventions were based on Bernardini et al. (2018Bernardini, Silvia, Adriano Ferraresi, Mariachiara Russo, Camille Collard, and Bart Defrancq 2018 “Building Interpreting and Intermodal Corpora: A How-to for a Formidable Task.” In Making Way in Corpus-based Interpreting Studies, edited by Mariachiara Russo, Claudio Bendazzoli & Bart Defrancq, 21–42. Singapore: Springer. DOI logoGoogle Scholar, 26–27). For this study, features and codes used in the first compilation process were also revised. The purpose was to better accommodate transcription conventions to our NER-based methodology, removing oral features which could hinder the recognition of named entities or complex phraseology, a task proved problematic in our previous research (cf. Corpas Pastor and Sánchez Rodas, 2022Corpas Pastor, Gloria & Fernando Sánchez Rodas 2022NLP-enhanced Shift Analysis of Named Entities in an English Spanish Intermodal Corpus of European Petitions. In Marta Kajzer-Wietrzny, Adriano Ferraresi, Ilmari Ivaska & Silvia Bernardini (eds.), Mediated discourse at the European Parliament: Empirical investigations, 219–251. Berlin: Language Science Press. https://​langsci​-press​.org​/catalog​/book​/343Google Scholar). Thus, from the initial set a series of non-relevant or problematic items were removed, namely filled pauses (ehm), mid-word pauses (proposal /pro_posal), non-verbalized noises ([applause]), non-standard pronunciation (sun /su:n/) and truncated words (admin-). Our proposal for transcription conventions also includes a new feature (borrowings/denominations), as illustrated in Table 2.

Our transcription method follows a minimalistic line, ignoring the representation of purely oral features such as filled and mid-word pauses, non-verbalized noises, non-standard pronunciation, and truncated words. Mispronunciations are relegated to lapses affecting NEs. In these cases, the NE is corrected in transcription and the complete mistaken form written between single slashes (/) right after; this allows NE recognition for each case without compromising error analysis later on. Single-word repetitions are not conveyed, and neither are clarifications from interpreters. On the other hand, more flexibility is given to the use of ellipses (…). Apart from silent pauses, ellipses are also used for signalling relevant reformulations (that is, when an entire word or phrase is uttered and then completely amended) and sub-sentence segments (e.g. appositions or additional discourse markers at the beginning or the middle of a sentence). An inclusion of our own is the use of quotation marks to identify borrowings and denominations in plain-text format.88.English quotation marks are used for both languages.

Table 2.Simplified transcription conventions
Feature Code
Silent pause / reformulations / sub-sentence segments
Rising intonation ?
Inaudible segment #
Mispronunciation Parlamento /parlo’mento/
Ambiguity NA
Overlapping talk NA
Sentence-like segments99.Double slashes (//), used in EPIC, were preferred over the original full stop (.) from the start because of aesthetic reasons. //
Borrowings/denominations “fitness check” de las Directivas

As in the first version of the corpus, spelling and capitalization were revised using a number of related resources for each language. This time, however, the process was not two-phased: on the contrary, segmentations and spelling revision could be performed simultaneously, taking advantage of our simplified approach to transcription. The English1010. https://​publications​.europa​.eu​/code​/en​/en​-000100​.htm. and Spanish1111. https://​publications​.europa​.eu​/code​/es​/es​-000100​.htm. versions of the Interinstitutional Style Guide (ISG) were used as reference, from which a selection of additional spelling resources stemmed.1212.These resources are listed as “Reference Works” for the English publications in the Official Journal and “Obras de consulta” for Spanish-specific conventions, respectively. In those cases in which language offers different correct usages, we follow the indications of the Spanish version of the ISG (European Union 2012b, 151), which states that the prevailing criteria will be those of EU translation services (e.g. for capitalization) and the Publications Office agreed norms for all languages (e.g. for acronyms). The PETIMOD corpus (version 1.0.) was also consulted throughout the revision process for the sake of coherence.

3.3Named entity recognition

Parallel to the enlargement of the corpus and simplification of the transcription conventions, the named entity recognition (NER) procedure was also revised. In Natural Language Processing (NLP), NER is a subtask of information retrieval consisting in the location and classification of unambiguous objects or items encountered in a document or in a corpus (for a general overview, see Nasar, Jaffry, and Malik 2021Nasar, Zara, Syed Waqar Jaffry, and Muhammad Kamran Malik 2021 “Named Entity Recognition and Relation Extraction: State-of-the-Art.” ACM Computing Surveys 54(1), 1–39. DOI logoGoogle Scholar and Nouvel, Ehrmann, and Rosset 2016Nouvel, Damien, Maud Ehrmann, and Sophie Rosset eds. 2016Named Entities for Computational Linguistics. Hoboken, NJ: John Wiley & Sons, Inc. DOI logoGoogle Scholar). Examples of real-world “objects” or NEs found in our corpus are Czech Republic, European Social Fund, Angel Dzhambazki, the Greens, the Ombudsman for Persons with Disabilities, etc.

3.3.1Extraction of entities and system performance

In Corpas Pastor and Sánchez Rodas (2022)Corpas Pastor, Gloria & Fernando Sánchez Rodas 2022NLP-enhanced Shift Analysis of Named Entities in an English Spanish Intermodal Corpus of European Petitions. In Marta Kajzer-Wietrzny, Adriano Ferraresi, Ilmari Ivaska & Silvia Bernardini (eds.), Mediated discourse at the European Parliament: Empirical investigations, 219–251. Berlin: Language Science Press. https://​langsci​-press​.org​/catalog​/book​/343Google Scholar, the chunking, detection and extraction of NEs were carried out in two phases. First, entities were retrieved automatically with the VIP1313.For a brief description of the VIP system, see Corpas Pastor (2021) 2021 “Technology Solutions for Interpreters: The VIP System.” Hermēneus. Revista de Traducción e Interpretación 23:91–123.Google Scholar. NER module and exported to an Excel file. The NER module integrated SpaCy1414.SpaCy is a free open-source library in Python (https://​spacy​.io/). and pre-trained models for English and Spanish.1515.The VIP NER annotation schemes distinguish at least four basic entity types: named persons or families (PER, e.g. Dorthe Christensen, Ádám Kósa); names of politically or geographically defined locations (LOC, e.g. Mar Menor Coastal Lagoon, Salinas y Arenales de San Pedro del Pinatar), names of corporate, governmental or other organizational entities (ORG, e.g. EU, Tribunal de Cuentas Europeo) and miscellaneous entities (MISC, e.g. Tihange 2, AAE UE-Japón). In order to assess the system performance (and, therefore, the accuracy of results), precision and recall were calculated.1616.Precision refers to the fraction of relevant NEs (i.e. a total number of correctly retrieved NEs minus errors) among all retrieved instances whereas recall refers to the fraction of retrieved instances among all relevant instances found in the corpus. For precision, two levels of analysis were established: (1) relevant NEs, i.e. entities that had been correctly extracted, and (2) relevant NEs that had been correctly ascribed to four categories: person (PER), organization (ORG), location (LOC) and miscellaneous (MISC). Precision results for English were 0.697/0.603; and 0.513/0.385 for Spanish. These results showed a better performance for English, especially when relevant and correctly classified NEs were considered. During the second phase of the analysis, the list of automatically retrieved NEs was supplemented with entities extracted manually with SketchEngine.1717. https://​www​.sketchengine​.eu/. Recall of our NER system was then calculated (relevant NEs / relevant and correctly classified NEs): 0.773/0.969 for English, and 0.677/0.612 for Spanish.

Low precision/recall results were attributed to various issues related to transcription conventions, lack of finer-grained named entity categories, or different pre-trained language models. In order to overcome those limitations, transcription conventions were revised and simplified (cf. 3.2). In addition, we adopted a new finer-grained classification of NEs by replacing the initial four categories with a longer list of pretrained categories that can accommodate up to 18 entities for English and 14 for Spanish (people, organizations, places, money, time, date, laws, languages, products, numbers, etc.).

For NER we used DeepPavlov,1818. https://​deeppavlov​.ai/. an open-source framework for deep learning tasks in Python (Burtsev et al. 2018Burtsev, Mikhail, Alexander Seliverstov, Rafael Airapetyan, Mikhail Arkhipov, Dilyara Baymurzina, Nickolay Bushkov, and Marat Zaynutdinov 2018 “Deeppavlov: Open-source library for dialogue systems.” In Proceedings of ACL 2018, System Demonstrations, 122–127. Melbourne: Association for Computational Linguistics. DOI logoGoogle Scholar). The 18 categories have been trained on OntoNotes 5.0 (Weischedel et al. 2013Weischedel, Ralph, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, and Ann Houston 2013 “OntoNotes Release 5.0 LDC2013T19.” Philadelphia: Linguistic Data Consortium. DOI logoGoogle Scholar) and BERT word embeddings have been used (Devlin et al. 2019Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova 2019 “Bert: Pre-training of deep bidirectional transformers for language understanding.” https://​arxiv​.org​/abs​/1810​.04805v2). Unlike SpaCy-based models, which employ convolutional neural networks, the BERT model offers improved performance for almost all NLP tasks. Table 3 compares performance of SpaCy and DeepPavlov NER models on PETIMOD 2.0.

Table 3.SpaCy and DeepPavlov performance compared (precision)
PETIMOD 2.0 (NER) SpaCy DeepPavlov
EN ES EN ES
errors   322   820   309   383
relevant nes retrieved 2,256 1,261 1,616 2,185
precision 0.875 0.605 0.839 0.851
0.74 0.87

Precision results with SpaCy improve slightly (+0.178 for English, and +0.002 for Spanish; mean precision rate: 0.74; difference rate among languages: 0.27). This could be due to the NER-friendly simplified transcription conventions used in PETIMOD 2.0. By contrast, NER based on DeepPavlov shows slightly lower precision results for English (−0.036), but it exhibits better performance for Spanish (+0.246), with a higher mean precision rate for both languages (0.87) and a smaller difference rate (0.012).

3.3.2Phraseological Pattern Extraction for NEs

The VIP Query Corpus (QC) module and the NER module have been used to study formulaicity. First, the NER module has been used to extract NEs for manual pattern extraction. Secondly, the QC Pattern functionality has been used to combine part of speech (PoS) tags with the named entity (ENT) tag for automatic pattern extraction. Thirdly, the QC concordance functionality was also used to perform manual KWIC searches. Since the QC Pattern functionality is programmed in SpaCy, it was necessary to complement the automated retrieval with manual KWIC searches in order to locate phraseological patterns for the remaining NEs that had only been recognized with DeepPavlov. This was also necessary at times in order to retrieve contextualized examples of the patterns from the corpus. Targeted searches using key verbs and/or NEs were performed in order to retrieve candidate patterns and/or NEs spotted in the texts as well, which had not been identified neither by DeepPavlov nor SpaCy. Finally, results were imported automatically to Excel files or entered manually when needed.

Our analysis focuses on verbal patterns, divided into (i) entity-as-subject patterns and (ii) entity-as-complement patterns. Verbal patterns convey formulaicity at the level of text organization, grammatical relations, and collocational networks. Table 4 shows the list of all phraseological patterns used to query the PETIMOD 2.0 corpus.

Table 4.List of total verbal patterns searched for in the corpus
verbal patterns
Entity-as-subject patterns Entity-as-complement patterns
ENT + V V + ENT
ENT + * + V V + * ENT
ENT + V + * V + * + ENT
V + ENT + *
V + * + * + ENT
V + * + * + * + ENT

4.Results and discussion

Additional Excel tables were created to collate and organize all NER results and corpus data obtained in our study. Concrete patterns and subpatterns have been selected according to frequency criteria from the Excel files generated by the VIP NER module and/or VIP QC Patterns functionality. Automatic pattern search and retrieval have been customized by using by using the verb (V) tag and wildcards (*), together with the ENT tag (for NEs). Figure 1 shows an example of VIP Corpus query-pattern extraction [prep + * + V + * + ENT] and results per NE type in the PETIMOD_EN subcorpus. Figure 2 displays the results obtained for the same query pattern with ORG filtering. Results can also be exported in .xls format.

Numbers of relevant hits1919.In this context, “relevant hits” means corpus-based examples of patterns containing the named entities (e.g. ENT + V). have been obtained in various ways, depending on the retrieval mode in place. If it was automatic (A), the column shows the number of hits manually counted with the help of the Excel search function. If it was semi-automatic (S), it shows the number of relevant hits automatically counted by the pattern search function of VIP. If it was manual (M), it shows the number of relevant hits manually counted in the concordance window of VIP.

While the VIP NER module enables sorting by NE type, and the QC Pattern functionality enables sort both by type and selected pattern, none of the three retrieval modes allows for automatic counting of the NE types involved. For this reason, the number of types of NEs has been calculated manually.

Figure 1.VIP pattern extraction interface and query results (PETIMOD_EN)
Figure 2.Filtering results by entity type (PETIMOD_EN)

The tables also include a concordance line containing the pattern instance, which is marked in bold. These ‘good’ examples have been selected according to their frequency, relevance, and context-defining potential. Whenever possible, the corresponding translation or interpretation was searched for in the mediated subcorpora, in order to analyse possible translation and interpretation shifts and/or translationese features.

The results of our analysis are displayed by means of tables that include general results (Table 5 in Section 4.1), and patterns with verbs as pivoting elements (Tables 617 in Sections 4.24.4), representing the most relevant findings for the three categories in all subcorpora. For reasons of space and scope, term-embedding collocations have been illustrated with prototypical NEs (Commisión/Comisión Europea and Commission/European Commission, respectively) deemed representative of the possible microscopic mapping of entities in these textual genres.

A detailed analysis of the similarities and differences across subcorpora, brief descriptions of observed translation and interpreting shifts, discussion of main findings, including translation traits, etc., are provided for each table. But, first, a word of caution is necessary due to the limitations of our study, namely the size of the corpus and some processing errors in the pattern extraction phase.

4.1Distribution of NEs

Table 5 shows the general distribution of NEs across subcorpora. Figures for absolute and normalized frequencies are provided. The most represented categories in non-mediated English are PER (4.17), DATE (3.99), ORG (3.75), LAW (3.37) and MISC (2.18). However, translations into Spanish reflect a slightly different distribution as regards those main categories: DATE (5.46), ORG (4.06), LAW (3.46), PER (3.43), and MISC (3.18). These differences could be due to the impact of non-translated Spanish (as DATE is higher in the normalized frequency rank, while ORG is 0.1 more frequent than PER), or else be a direct consequence of various translation shifts. The fact that almost all main categories (except PER) present more instances in translated Spanish than in their original English texts also points to the existence of explicitation. In any case, the lower number of instances of PER found in translated Spanish is not in contradiction to explicitation, as original PER might have been translated as ORG.

Regarding interpreted English, the actual figures seem to point to simplification instead, as this subcorpus systematically includes fewer NEs than the corresponding original Spanish speeches. For instance, there are 36 instances of PER in original Spanish as opposed to 17 in interpreted English (the same applies to most categories). The lower number of dates (15) and cardinals (14) in the interpreted corpus compared to the higher number of instances found in the Spanish original speeches (43 and 36, respectively) could also be a sign of cognitive overload or difficulties experienced by the interpreters. Total number of NEs seem to support those two opposed general tendencies: the total number of NEs in the translated Spanish subcorpus (1,631) is higher than in the original English texts (1,239), whereas interpreted English contains fewer NEs (145) than original speeches in Spanish (294).

Table 5.Distribution of NEs across subcorpora
NE TYPE EN ES
ORIG_EN (Non-T) MED_EN (I) ORIG_ES (Non-I) MED_ES (T)
PER 219 4.17  17 1.08  36 1.95 211 3.43
NORP  30 0.57  13 0.82  27 1.46  51 0.82
FAC  13 0.24   1 0.06   0   8 0.05
ORG 197 3.75  15 0.95  38 2.05 250 4.06
GPE  90 1.71  20 1.27  32 1.73  79 1.28
LOC  21 0.39   7 0.44  14 0.75  26 0.42
PRODUCT   3 0.05   0   0   1 0.01
EVENT   0   0   0   0
WORK_OF_ART   0   0   0   0
LAW 177 3.37  11 0.69  23 1.24 213 3.46
LANGUAGE  12 0.22   0   0  10 0.16
DATE 210 3.99  15 0.69  43 2.32 336 5.46
TIME  30 0.57   2 0.12   5 0.27  28 0.45
PERCENT  13 0.24   2 0.12   2 0.10  15 0.24
MONEY   9 0.17   7 0.44   7 0.37   9 0.14
QUANTITY   8 0.00   4 0.25   6 0.32  10 0.16
ORDINAL  14 0.26   8 0.50   9 0.48  15 0.24
CARDINAL  78 1.48  14 0.88  36 1.94 173 2.81
MISC 115 2.18   9 0.57  16 0.86 196 3.18
TOTAL 1239  23.59 145 9.21 294 15.89 1631  26.51

4.2Text-organizing patterns

Text-organizing patterns (Table 6) are the most common category in ORIG_EN (94/139 subpatterns, 68%), and also the most varied one (11/24, 46%). 9/11 variations (82%) are built around entities in subject position. The most frequent associated NE types are DATE (42/139, 30%), ORG (36/139, 26%) and TIME (3/139, 2%). This is an indication of the importance of time organization in the activity of the Committee on Petitions, which is reflected in the different parts dividing the sessions (draft agendas and minutes), but also in the administrative processes summarized in the petition themselves. Another common point is that practically all subpatterns (10/11, 91%) contain a past participle or past continuous form. This could indicate a certain delay between the time when petitions are received and the moment at which the Committee or the Commission takes action. Each pattern can also be related to specific text-organizing functions. Two out of three ENT + V + * subpatterns, for example, are associated with concluding remarks in the petition summaries, whereas the entity-as-complement subpatterns reflect more practical, administrative functions such as deadlines, committee meeting pauses and starts, etc.

Table 6.Text-organizing patterns (ORIG_EN)
Pattern Subpattern Retrieval Hits NE type Example
V + * + ENT received + * + ENT S 26 DATE: 20
ORG: 3
CARDINAL: 2
GPE: 1
Commission reply, ‎received on 30 August 2017
V + * + * + ENT declared + * + * + ENT S 18 DATE: 17
LAW: 1
Admissibility Declared admissible on 4 March 2019
ENT + V + * ENT + will + * A 16 ORG: 14
GPE: 1
NORP: 1
Conclusion The Commission will continue to raise the issue in every possible forum
V + * + ENT requested + * + ENT A 16 ORG: 14
GPE: 1
NORP: 1
Information requested from Commission under Rule 227(6 )
V + * + ENT closed + * + ENT S  4 DATE: 3
LAW: 1
The following petitions will be closed: 1512/2010, 1063/2018…
ENT + V + * ENT + decided + * A  3 ORG: 2
GPE: 1
5. Opinions (a) ‎Coordinators decided that PETI should not draft an opinion to the AFET Annual report
V + * + ENT continued + * + ENT S  3 DATE: 1
CARDINAL: 1
TIME: 1
The meeting continued at 11:56 with Ryszard Czarnecki (3rd Vice – Chair) presiding.
V + * + ENT resumed + * + ENT S  3 TIME: 2
DATE: 1
The meeting resumed at 14:33, with Tatjana Ždanoka (1st Vice – Chair) presiding.
V + ENT + * received + ENT + * S  3 CARDINAL: 1
PERSON: 1
ORG: 1
c) Letters received -Poland’s Climate Ministry reply 1099 -
ENT + V + * ENT + concluded + * S  1 ORG: 1 On the basis of these data, and following consultation with the Group of Experts, the Commission concluded that the implementation of the project…
V + * + * + ENT adopted + * + * ENT M  1 ORG: 1 INFORMATION REPORT Section for Employment, Social Affairs and Citizenship Real rights of persons with disabilities to vote in European Parliament elections Rapporteur: Krzysztof PATER Legal basis Rule 31 of the Rules of Procedure Section responsible Section for Employment, Social Affairs and Citizenship Adopted in section 06/03/2019 Adopted at plenary 20/03/2019

When translating English sequences with text-organizing patterns into Spanish (Table 7), translators depart considerably from non-mediated Spanish. Most patterns in translated Spanish present entities as complements (6/8, 75%) and the most frequent NE types are PER and DATE instead of ORG (cf. Table 4). This reveals that although translated texts generally transfer the usage of text-organizing patterns for the same organizing functions (deadlines, concluding remarks, etc.), normalization seems to operate simultaneously on a smaller scale, affecting syntax (and interestingly also NE types) by accommodation to the Spanish norms. Consider the subpattern presentada + * + ENT, which does not exist in ORIG_EN. Non-translated petitions opt for a sequence that omits any verbal form (e.g. Petition No 1106/2018 by Alexander ‎Edberg Thorén). The shift in translation (Petición n.º 1106/2018, presentada por Alexander Edberg Thorén) adheres better to the target language norms, which demand the introduction of a past participle before the preposition por.

Table 7.Text-organizing patterns (MED_ES)
Pattern Subpattern Retrieval Hits NE type Example
V + * + ENT presentada + * + ENT M 82 PER: 81
ORG: 1
Asunto: Petición n.º 1106/2018, presentada por Alexander Edberg Thorén, de nacionalidad sueca [Petition No 1106/2018 by Alexander ‎Edberg Thorén (Swedish)]
V + * + ENT recibida + * + ENT M 20 DATE: 20 Respuesta de la Comisión, recibida el 30 de agosto de 2017 [Commission reply, ‎received on 30 August 2017]
V + * + * + * + ENT admitida + * + * + * + ENT M 17 DATE: 17 Admisibilidad Admitida a trámite el 4 de marzo de 2019 [Admissibility Declared admissible on 4 March 2019]
ENT + V + * ENT + concluyó + * A  3 ORG: 3 Sobre la base de estos datos, y previa consulta al Grupo de Expertos, la Comisión concluyó que la aplicación del proyecto… [On the basis of these data, and following consultation with the Group of Experts, the Commission concluded that the implementation of the project…]
V + * + ENT apruebe + * + ENT S  2 LOC: 2 SUGERENCIAS La Comisión de Peticiones pide a la Comisión de Libertades Civiles, Justicia y Asuntos de Interior, competente para el fondo, que incorpore las siguientes sugerencias en la propuesta de Resolución que apruebe: 1 [SUGGESTIONS The Committee on Petitions calls on the Committee on Civil Liberties, Justice and Home Affairs, as the committee responsible, to incorporate the following suggestions in its motion for a resolution: 1.]
V + * + * + ENT reanuda + * + * + ENT M  2 TIME: 2 La reunión se reanuda a las 14.33 horas bajo la presidencia de Tatjana Ždanoka vicepresidenta primera) [The meeting resumed at 14:33, with Tatjana Ždanoka (1st Vice – Chair) presiding]
ENT + V + * ENT + seguirá + * A  1 ORG: 1 Conclusiones La Comisión seguirá planteando la cuestión en todos los foros posibles [Conclusion The Commission will continue to raise the issue in every possible forum]
V + * + * + * + ENT aprobado + * + * + * + ENT M  1 DATE: 1 DOCUMENTO INFORMATIVO Sección de Empleo, Asuntos Sociales y Ciudadanía El derecho real de voto en las elecciones al Parlamento Europeo de las personas con discapacidad Ponente: Krzysztof PATER Fundamento jurídico Artículo 31 del Reglamento interno Sección competente Sección de Empleo, Asuntos Sociales y Ciudadanía Aprobación en sección 06/03/2019 Aprobado en el pleno 20/03/2019 [INFORMATION REPORT Section for Employment, Social Affairs and Citizenship Real rights of persons with disabilities to vote in European Parliament elections Rapporteur: Krzysztof PATER Legal basis Rule 31 of the Rules of Procedure Section responsible Section for Employment, Social Affairs and Citizenship Adopted in section 06/03/2019 Adopted at plenary 20/03/2019]

Another difference with ORIG_ES is that these past forms (5/11 patterns, 45%) are not completely hegemonic, but coexist with present simple (2/11 patterns, 18%) and future simple forms (1/11 pattern, 9%). This could also be compatible with normalization, evidenced by a decreased use of passive forms in Spanish and some differences in textual conventions between languages (see minutes, where resumed + * + ENT shifts to reanuda + * + * + ENT). Finally, explicitation is also observed through translation shifts. For instance, the subpattern apruebe + * + ENT typically occurs in MED_ES. It refers to a future process in which a second EP committee would approve (or not) a motion of resolution helped by the opinion of the Committee on Petitions (incorpore las siguientes sugerencias la propuesta de resolución que apruebe). The non-translated version is less explicit and does not make reference to the autonomy of the neighbouring committee (incorporate the following suggestions into its motion for a resolution).

Table 8.Text-organizing patterns (ORIG_ES)
Pattern Subpattern Retrieval Hits NE type Example
V + * + * + * + ENT tiene + * + * + * + ENT M  7 PER: 5
TIME: 1
ORG: 1
// tiene la palabra por .. bienvenidos // tiene la palabra por cinco minutos // [you have the floor for … welcome five minutes // you have the floor for five minutes]
V + * + ENT presentada + * + ENT M  6 PER: 6 Petición 827/2018 ‎presentada por Olga Daskali de nacionalidad griega [Petition 0827/2008 presented by Olga Daskali (Greek)]
V + ENT tiene + ENT M  5 TIME: 3
LOC: 1
ORG: 1
tiene cinco minutos para plantear su intervención // adelante // [you have five minutes to present your petition // go ahead //]
V + * + * + * + ENT pasaríamos + * + * + * + ENT M  2 CARDINAL: 2 pues damos por concluida esta petición y ‎ pasaríamos a la petición 23 [that concludes that item on our agenda and brings us to item 23]
V + * + * + * + ENT terminar + * + * + * + ENT M  1 CARDINAL: 1 para terminar // los puntos 8 9 y 10 del orden del día se realizarán mediante procedimiento escrito [to conclude items 8 9 and 10 of the agenda will be carried out via the written procedure]
ENT + V tomaron + V A  1 ORG: 1 en la reunión de 19 de febrero del 2020 los coordinadores de PETI tomaron las siguientes decisiones [at the meeting of 19 February 2020 the PETI coordinators decided the following]

Text-organizing patterns in the non-interpreted Spanish speeches (Table 8) are strongly associated with PER entities (11/22). Most of the ritualized, repeated forms are used either to present the initiators of the petitions (presentada por Olga Daskali) or to give the floor to them. In the first case, it is worth noting that the oral pattern seems to borrow the above-mentioned phraseme appearing in the title of Spanish petitions (presentada por). In the second case, the complex structure tiene la palabra is used not only to give the floor to certain participants, but also to remind them of the intervention time available, as in the example tiene la palabra por cinco minutos. This introduction, however, seems to have more frequent associations with simpler patterns, such as V + ENT (tiene cinco minutos).

Table 9.Text-organizing patterns (MED_EN)
TEXT-ORGANIZING PATTERNS (MED_EN)
Pattern Subpattern Retrieval Hits NE type Example
V + * + ENT presented + * + ENT S  6 PER: 5
NORP: 1
this is the petition 0827/2008 presented by Olga Daskali // Greek //
V + ENT have + ENT M  6 TIME: 6 // you have five minutes // tell us about your petition // go ahead //
V + * + * + * + ENT brings + * + * + * + ENT S  1 CARDINAL: 1 that concludes that item on our agenda // that brings us to item 23 //
V + * + * + * + ENT move + * + * + * + ENT S  1 CARDINAL: 1 thank you very much // we can ‎move on to point 12 //
V + * + ENT conclude + * + ENT S  1 CARDINAL: 1 to conclude items 8 9 and 10 of the agenda will be carried out via the written procedure

When interpreted into English, text-organizing patterns (Table 9) occurs less frequently (34%) than in original subcorpora of Spanish (55%) and English (68%). Variation is also smaller in mediated English when compared to both originals (ORIG_EN 46%, ORIG_ES 33%, MED_EN 31%), although in the case of ORIG_ES there is almost no difference. These figures point towards simplification in the interpreted discourse, which can be additionally supported by two shifts. The interpreted concordance you have five minutes // tell us about your petition has a simpler syntax than the non-interpreted Spanish tiene cinco minutos para plantear su intervención). Another example of simplification is petition 0827/2008 presented by Olga Daskali // Greek //, in which the original Spanish de nacionalidad griega is condensed in only one word. Negative transfer (i.e. interference) can also be observed in the use of the subpattern presented + * + ENT.

4.3Grammatical patterns

Grammatical patterns (Table 10) are the second most frequent category in ORIG_EN (28/139, 20%). They are highly associated with ORG NEs (20/28, 71%) and mostly feature entity-as-subject patterns (4/6, 67%). Deontic modal verbs represent most findings in this category,2020.It must also be noted that different modal verbs appear in different types of patterns. Consider for example will, which was previously associated to text-organizing patterns (see Table 10). although important differences between them should be established. Weaker obligation patterns, such as should (7/28) and may (3/28), are more numerous than the stronger shall (6/28) and must (3/28). As an example, the opinion PETI should not draft is weaker than the sentence The European Parliament shall draw up a proposal, which is in fact a quotation from the EC Treaty. Two impersonal structures were also detected, with different forms: one is passive (provided + V + ENT) and the other one is a third-person-singular present tense (calls on the Commission to…).

Table 10.Grammatical patterns (ORIG_EN)
Pattern Subpattern Retrieval Hits NE type Example
ENT + V + * ENT + should + * A  7 ORG: 6
LAW: 1
Opinions (a) Coordinators decided that PETI should not draft an opinion to the AFET Annual report on arms export
ENT + V + * ENT + shall + * A  6 ORG: 5
GPE: 1
The European Parliament shall draw up a proposal for elections by direct universal suffrage
V + * + ENT provided + * + ENT S  6 ORG: 3
CARDINAL: 1
DATE: 1
NORP: 1
According to information provided by the European Investment Bank, the promoter undertakes also to exclusively use biomass
ENT + V + * ENT + may + * S  3 ORG: 2
GPE: 1
correct transposition of the Directive, failing which the ‎Commission may refer the case to the 1 European Nuclear Safety Regulators Group
ENT + V + * ENT + must + * A  3 GPE: 2
ORG: 1
According to this conditionality, Member States must have in place and implement a national strategic policy framework for poverty reduction
V + * + ENT calls + * + ENT A  3 ORG: 3 calls on the Commission to respect the commitments made in its 2019 communication

Grammatical patterns in MED_ES (Table 11) are more numerous than in ORIG_EN (27% vs. 20%), which could be an indicator of aggregate transfer. Another indicator of transfer would be the stability of the relative percentage of entity-as-subject patterns, which remains practically the same as in ORIG_EN (25% vs. 29%).

Together with transfer, normalisation is hypothesized at the microtextual level on the basis of three shifts. Deontic modal verbs are not always calqued into Spanish, even when possible, which reflects the preference of the target language for alternative forms of expressing modality. This is the case for two example patterns which share the same verb (elaborar).2121.This may be regarded as an additional hint of simplification and/or convergence. In the first, Coordinators decided that PETI should not draft changes to Los coordinadores deciden que la Comisión PETI no elabore (subjunctive).2222.Also, the use of subjunctive with the verb decidir conveys a higher degree of assertiveness than the alternatives debería elaborar and debe elaborar. In the second example, The European Parliament shall draw up a proposal changes to el Parlamento Europeo elaborará un proyecto (future simple). Even when equivalent periphrastic structures are used, involuntary errors occur due to the influence of target language rules. In the example the Commission may refer the case to the European Nuclear Safety Regulators Groupla Comisión puede remitir el asunto al Tribunal de Justicia de la Unión Europea, the second NE is a human mistake, probably caused by the usual association of the term case with the Court of Justice of the European Union.

Table 11.Grammatical patterns (MED_ES)
Pattern Subpattern Retrieval Hits NE type Example
V + * + * + ENT pide + * + * + ENT M 36 ORG: 35
MISC: 1
pide a la Comisión que respete los compromisos asumidos en su Comunicación de 2019 [calls on the Commission to respect the commitments made in its 2019 communication]
ENT + V + * ENT + deben + * M  7 GPE: 5
ORG: 2
Según esta condición, los Estados miembros deben crear y aplicar un marco estratégico nacional para la reducción de la pobreza [According to this conditionality, Member States must have in place and implement a national strategic policy framework for poverty reduction]
ENT + V + * ENT + puede + * M  3 ORG: 3 correcta transposición de la Directiva; en su defecto, la Comisión puede remitir el asunto al Tribunal de Justicia de la Unión Europea [correct transposition of the Directive, failing which the ‎Commission may refer the case to the 1 European Nuclear Safety Regulators]
ENT + * + V ENT + * + elabore S  2 ORG: 2 Los coordinadores deciden que la Comisión PETI no elabore una opinión sobre el Informe anual de la Comisión AFET sobre la exportación de armas [Coordinators decided that PETI should not draft an opinion to the AFET Annual report on arms export]
ENT + V ENT + elaborará M  2 ORG: 2 el Parlamento Europeo elaborará un proyecto encaminado a hacer posible su elección por sufragio universal directo [The European Parliament shall draw up a proposal for elections by direct universal suffrage]
V + * + ENT facilitada + * + ENT M  1 ORG: 1 Según la información facilitada por el Banco Europeo de Inversiones, el promotor se compromete, asimismo, a emplear en el proyecto exclusivamente biomasa [According to information provided by the European Investment Bank, the promoter undertakes also to exclusively use biomass]

Grammatical patterns associated with entities in the non-mediated Spanish subcorpus (Table 12) are less numerous than in the mediated subcorpus. Occurrence percentages of grammatical patterns grow steadily from ORIG_EN (20%) and MED_ES (26%) to ORIG_ES (28%). As in non-translated English (71%) and translated Spanish (88%), NEs are mostly ORG (82%). Similarly to text-organizing patterns, these progressively increased percentages can be read as a sign of transposition. Additional evidence might be that 3/5 verbs were already present in MED_ES (pedir, facilitar, deber) and that such patterns present two subtle shifts apparently caused by a remote, indirect influence of the source language. Consider the NE associated with the pattern pedimos + * + ENT (Comité PETI), literally borrowed from PETI Committee, and the subpattern pedir + * + ENT, a grammatical mistake in standard Spanish possibly influenced by English.

In the interpreted English subcorpus, two new grammatical patterns (Table 13) are introduced. The first one is ask + * + ENT, which functions as a register-down alternative to call + * + ENT. In fact, it presents more occurrences (nine against six), which will again confirm the informality of the interpreted English subcorpus. Another new pattern is ENT + would + *. The PER entity functioning as subject (Sánchez) is an informal alternative to Pedro Sánchez. In a further example, the subpattern ENT + should + * reveals another register-down shift which additionally decreases the number of words in the interpreted corpus (el Mar Menor debe ser una cuestión de EstadoMar Menor should be tackled by the State). As it happened in MED_ES, shifts in modal verbs imply shifts of meaning and register also in the main verb.

Table 12.Grammatical patterns (ORIG_ES)
Pattern Subpattern Retrieval Hits NE type Example
V + * + ENT pedimos + * + ENT M 4 ORG: 4 pedimos que el Comité PETI solicite a la Comisión Europea que asuma su responsabilidad [we call on the PETI Committee to ask the European Commission to assume its responsibilities]
V + * + ENT pedir + * + ENT M 4 ORG: 4 y cómo no… también pedir a la Comisión Europea que informe sobre los motivos [and of course… also call on the European commission to inform us about the reasons…]
ENT + V + * ENT + tiene + * S 1 ORG: 1 eso es un recurso normal presentado por unos ciudadanos y por lo tanto el Supremo tiene que decidir sobre esa petición de nulidad [this is an ordinary appeal filed by citizens and therefore the Supreme Court has to decide on this petition for annulment]
ENT + V ENT + debe M 1 LOC: 1 el Mar Menor ‎ debe ser un asunto de Estado [The Mar Menor should be tackled by the State]
V + * + ENT facilitado + * + ENT M 1 DATE: 1 según la información que se nos ha facilitado en el año 2012 siendo el peticionario alcalde [according to the information provided to us in 2012 when the petitioner was the lord mayor]
Table 13.Grammatical patterns (MED_EN)
Pattern Subpattern Retrieval Hits NE type Example
V + * + ENT ask + * + ENT M 9 ORG:7
GPE: 1
LOC: 1
we will ask the Belgian government as the Petitions Committee to provide information about the file
V + * + ENT call + * + ENT M 6 NORP: 4
ORG: 2
we would also call on the Commission to draw up some guidelines on the appropriate use of lighting so it causes less damage
ENT + V + * ENT + should + * A 4 ORG: 2
LOC: 2
Mar Menor should be tackled by the State
ENT + V + * ENT + would + * M 1 PER: 1 Sánchez wouldn’t loan this to the region and prevented the recovery plans for the Mar Menor

4.4Term-embedding collocations

Term-embedding collocations display some similar numbers in all subcorpora because prototypical NEs (Commission/European Commission) and similar patterns (N + V) were searched for to offer an exemplification. 100% of the total occurrences are associated with ORG NEs and relate to a subject entity. The rest of the indicators, however, show certain differences between components (see below). Although these numbers may be biased by our conscious choice, the findings might also suggest transfer in translated and interpreted subcorpora.

This category is least frequent (17/139, 12%) in non-translated English (Table 14). Even though the subpatterns are distributed among ENT + V (43%) and ENT + V + * (57%), personification is a common feature of all term-embedding collocations found in ORIG_EN. The European Commission is construed as a human being with the ability of performing both mental and physical tasks. The first group (12/17, 71%) describes the often diplomatic postures maintained by this body in relation to the citizens’ petitions (Commission + is + aware, Commission + considers + that, Commission + would + like, Commission + maintains, Commission + understands).2323.A further note on these patterns would be their similarity with other categories. The example The Commission would like to reiterate contains a grammatical pattern, but it conveys a specific idiomatic meaning (“The Commission sends a friendly reminder”). Conversely, the patterns Commission + maintains and Commission + understands could be also classified as text-organizing, given that they are embedded in the concluding section of petitions. In the second group (5/17, 29%), the auxiliary can is introduced as a mechanism to limit more direct actions available for this entity. The main verb take can be spotted in both examples (The Commission cannot take on the role… and The Commission can take legal action…).

Table 14.Term-embedding collocations (ORIG_EN) – Commission/European Commission
Pattern Subpattern Retrieval Hits NE type Example
ENT + V + * Commission + is + aware M 4 ORG: 4 The Commission is aware of the concerns raised by the petitioners on the threats and problems affecting the Mar Menor lagoon
ENT + V + * Commission + can + not A 4 ORG: 4 the Commission can not take on the role of an independent monitoring mechanism to ensure the implementation of the Convention in the EU
ENT + V + * Commission + considers + that A 3 ORG: 3 The Commission considers that the difficult situation concerning the citrus market in Spain was caused by the particular conditions of production
ENT + V + * Commission + would + like A 2 ORG: 2 The Commission would like to reiterate that, in line with the division of responsibilities under EU law, the decision to operate a nuclear power plant remains with the Member State
ENT + V Commission + maintains A 2 ORG: 2 Conclusion The Commission maintains its previous conclusions in relation to this Petition
ENT + V Commission + can A 1 ORG: 1 the Commission can take legal action against Member States failing to comply with the new requirements.
ENT + V Commission + understands A 1 ORG: 1 Conclusion Based on the information provided, the Commission understands that the situation has not been normalised yet

Collocates of Commission in translated Spanish (Table 15) are lower in numbers compared to the non-translated English corpus (7% vs. 12%), and equally reflect personification. Nevertheless, some patterns are redistributed because of alternative verb choices. Comisión + entiende, for example, presents one more hit because the verb entender has been used for translating both understand and consider. Similarly, the pattern Commission + would + like is split into two (Comisión + desea and Comisión + querría). The example La Comisión desea reiterar que… is considerably less distant than The Commission would like to reiterate, adding an affective nuance not found in the English personifications.

Table 15.Term-embedding collocations (MED_ES) – Commission/European Commission
Pattern Subpattern Retrieval Hits NE type Example
ENT + V + * Comisión + es + consciente M 4 ORG: 4 La Comisión es consciente de las preocupaciones que señalan los peticionarios sobre los problemas y amenazas que afectan a la laguna del Mar Menor [The Commission is aware of the concerns raised by the petitioners on the threats and problems affecting the Mar Menor lagoon]
ENT + * + V Comisión + no + puede M 3 ORG: 3 la Comisión no puede asumir el papel de un mecanismo de supervisión independiente para garantizar la aplicación de la Convención en la Unión [the Commission cannot take on the role of an independent monitoring mechanism to ensure the implementation of the Convention in the EU]
ENT + V + * Comisión + considera + que S 2 ORG: 2 La Comisión considera que la información contenida en la petición no requiere, en principio, la adopción de medida adicional alguna en el caso en cuestión por lo que respecta a la aplicación del Tratado Euratom y del Derecho derivado de este [The Commission considers that the information contained in the petition does not, in principle, require any further action to be taken in the case in question as regards the application of the Euratom Treaty and secondary legislation deriving from it]
ENT + V Comisión + entiende A 2 ORG: 2 La Comisión entiende que la difícil situación del mercado de los cítricos en España se produjo por las condiciones particulares de la producción [The Commission considers that the difficult situation concerning the citrus market in Spain was caused by the particular conditions of production]
ENT + V Comisión + desea A 1 ORG: 1 La Comisión desea reiterar que, en consonancia con el reparto de responsabilidades en virtud de la legislación de la Unión, la decisión de explotar una central nuclear incumbe al Estado miembro [The Commission would like to reiterate that, in line with the division of responsibilities under EU law, the decision to operate a nuclear power plant remains with the Member State]
ENT + V Comisión + querría S 1 ORG: 1 la Comisión querría indicar que la Directiva 2014/52/UE8 (Directiva sobre la evaluación de impacto ambiental (EIA)) exige [the Commission would like to point out that Directive 2014/52/EU8 (the Environmental Impact Assessment (EIA) Directive) requires]
ENT + V Comisión + puede M 1 ORG: 1 la Comisión puede emprender acciones legales contra los Estados miembros que no cumplan los nuevos requisitos. [the Commission can take legal action against Member States failing to comply with the new requirements]

Non-mediated Spanish term-embedding collocations (Table 16) are more frequent (17%) than in the corresponding mediated subcorpus (7%), which supports the hypothesis of simplification of mediated discourse. Other translationese traits are reflected in the percentage of subpatterns. Only 29% (two out of seven) are shared between original and mediated Spanish. Despite a more reduced number of occurrences (7), collocations also reflect personification phenomena in the non-interpreted Spanish corpus. Indeed, metaphors acquire more diverse forms than in the written corpora. The affective verb desear is repeated in this subcorpus and is linked to two perception verbs which embody (and thus personify) the institution (la Comisión desea siempre escuchar a todas las partes… conocer su visión //). Additionally, new affective meanings are introduced (la Comisión comparte la preocupación…), together with verbs of speech (la Comisión Europea afirma…) and others by which the institution performs tasks humans would (la Comisión Europea incoó un procedimiento… and la Comisión Europea publicó…). This could be a hedging mechanism, designed to hide the real human subject of such procedures. As can be observed, ENT + V patterns with Comisión Europea can be found in this subcorpus, but not in the written collections. It is possible that this terminological variant is introduced in non-interpreted Spanish for better distinguishing the executive Commission and the PETI Committee.

Table 16.Term-embedding collocations (ORIG_ES) – Commission/European Commission
Pattern Subpattern Retrieval Hits NE type Example
ENT + V Comisión + desea A 1 ORG: 1 la Comisión desea siempre escuchar a todas las partes // conocer su visión // [the Commission is always willing to listen to all parties // to hear their views //]
ENT + V Comisión + comparte A 1 ORG: 1 la Comisión comparte la preocupación existente por las distintas presiones que el Mar Menor está sufriendo [The Commission shares the concern about the different pressures that the Mar Menor is undergoing]
ENT + V Comisión + considera A 1 ORG: 1 Comisión considera en general que España no monitoriza de manera adecuada sus aguas [The Commission considers that, in general, Spain is not adequately surveying its waters]
ENT + V Comisión Europea + incoó A 1 ORG: 1 La Comisión Europea ‎incoó un procedimiento de infracción contra España en el año 2015 [The European Commission opened an infringement proceeding against Spain in 2015]
ENT + V Comisión Europea + considera A 1 ORG: 1 la Comisión Europea considera en efecto que la aplicación efectiva de la política medioambiental de la Unión Europea… [the European Commission considers that the effective implementation of the EU environmental policy…]
ENT + V Comisión Europea + afirma A 1 ORG: 1 la Comisión Europea afirma que las autoridades regionales han emprendido una serie de medidas jurídicas y técnicas [European Commission states that the regional authorities have undertaken a number of legal and technical measures]
ENT + V Comisión Europea + publicó M 1 ORG: 1 Comisión Europea publicó el año pasado en 2019 su… en un informe que remitió al Consejo… […European Commission published last year in 2019 its… in a report to the Council…]

In the interpreted English subcorpus (Table 17), the total number of term-embedding collocations (9) remains quite similar to non-interpreted Spanish. Even though a new mental verb is introduced (believe), the rest of the examples are associated with more physical meanings, even with an idea of movement and urgency which cannot be found in the non-interpreted Spanish (the Commission needs to inform us, measures the Commission takes, the Parliament but also the Commission act to find a solution…).

Table 17.Term-embedding collocations (MED_EN) – Commission/European Commission
Pattern Subpattern Retrieval Hits NE type Example
ENT + V Commission + said A 2 ORG: 2 there were three parliamentary groups present here // socialist ECR and EPP // and they too heard what the Commission said //
ENT + V Commission + needs A 2 ORG: 2 the Commission needs to inform us about what steps are being taken
ENT + V European Commission + needs A 1 ORG: 1 the ‎European Commission needs to guarantee the reciprocity clauses
ENT + V European Commission + remains A 1 ORG: 1 the infringement process continues its path and the European Commission remains alert
ENT + V European Commission + believes A 1 ORG: 1 the European Commission believes that effective application of EU environmental law
ENT + V Commission + takes A 1 ORG: 1 we ‘ll look to see what measures the Commission takes in the future on the safety of asbestos
ENT + V Commission + act M 1 ORG: 1 to keep this petition open so that the Parliament but also the Commission act to find a solution

Finally, these data are in line with our previous findings (Corpas Pastor and Sánchez Rodas, 2022Corpas Pastor, Gloria & Fernando Sánchez Rodas 2022NLP-enhanced Shift Analysis of Named Entities in an English Spanish Intermodal Corpus of European Petitions. In Marta Kajzer-Wietrzny, Adriano Ferraresi, Ilmari Ivaska & Silvia Bernardini (eds.), Mediated discourse at the European Parliament: Empirical investigations, 219–251. Berlin: Language Science Press. https://​langsci​-press​.org​/catalog​/book​/343Google Scholar). Simplification appears to be a contingent feature which depends on the mediation mode and the source languages involved (and also on the topic of the source text). Another relevant finding of this study is that there are clear differences in shifts between EN-ES translations and ES-EN interpretations of NEs in the Petitions Committee, as well as normalisation of specialized phraseology, especially in interpreted English (from Spanish).

5.Conclusion

EP institutional texts exhibit an argument-structure text-organizing pattern centred around named entities (NEs) and their phraseology. This study is one of the first computer-assisted analyses of named entities in an English<>Spanish intermodal corpora derived from the activities of the PETI Committee at the European Parliament. To uncover their formulaic patterns we have used an NLP-enhanced methodology which combines corpus work and NER. The most represented NE categories are PER, DATE, ORG, LAW and MISC, although their frequency ranks vary across subcorpora.

In terms of NE frequency and distribution, translations (mediated Spanish) tend to show explicitation traits, whereas interpretations (mediated English) appear to be more prone to simplification. Regarding NE formulaicity, text-organizing patterns are the most frequent type and present higher numbers of patterns and subpatterns. In translated Spanish, text-organizing patterns tend to conform to Spanish norms and exhibit some degree of explicitation. By contrast, interpreted English exhibits fewer patterns and less variation, which also points to simplification. Grammatical patterns reveal differences between mediated and non-mediated discourse. Our findings suggest the existence of negative transfer in translated Spanish, whereas interpreted English reveals register-down shifts (simplification). On the other hand, term-embedding collocations are heavily associated with personification across subcorpora. In mediated Spanish, they instantiate cases of transfer and/or simplification. In mediated English, however, term-embedding collocations are a clear case of source language shining through (Teich 2003Teich, Elke 2003Cross-Linguistic Variation in System and Text: A Methodology for the Investigation of Translations and Comparable Texts. Berlin: Mouton de Gruyter. DOI logoGoogle Scholar).

Finally, NER-enhanced corpus analysis has proved to be a powerful methodology to discover translationese traits through NEs’ phraseological patterns. In this line, several cases of negative transfer have been found to operate mostly from non-translated English to translated Spanish. Further studies are needed to establish whether this is only typical of the PETIMOD corpus or whether it can also be a feature of other types of Eurolects.

Notes

1.In fact, named entities have also been referred to as “system-bound” or “culture-bound” terms in the legal translation literature (Vigier-Moreno and Sánchez Ramos 2017Vigier-Moreno, Francisco Javier and María del Mar Sánchez Ramos 2017 “Using parallel corpora to study the translation of legal system-bound terms: The case of names of English and Spanish courts.” In Computational and Corpus-Based Phraseology. Second International Conference, Europhras 2017 London, UK, November 13–14, 2017 Proceedings, edited by Ruslan Mitkov, 260–273. Cham: Springer. DOI logoGoogle Scholar). However, despite drawing attention to common problems which professionals could face when translating these entities, the literature does not explore the possible relations between single or multi-word entities, collocations, and formulaic structures in the legal and institutional discourse.
2.The term collocation used in this paper follows Biel’s approach (2014Biel, Łucja 2014 “Phraseology in legal translation: A corpus-based analysis of textual mapping in EU law.” In The Ashgate Handbook of Legal Translation, edited by Le Cheng, King Kui Sin & Anne Wagner, 177–192. DOI logoGoogle Scholar), which partially deviates from mainstream postulates of phraseological studies (see Corpas Pastor 2017Corpas Pastor, Gloria 2017 “Collocations in E-Bilingual Dictionaries: From Underlying Theoretical Assumptions to Practical Lexicography and Translation Issues.” In Collocations and Other Lexical Combinations in Spanish. Theoretical and Applied Approaches, edited by Sergi Torner Castells and Elisenda Bernal, 139–160. London: Routledge.Google Scholar for an overview).
4.The transcriptions employed the English and Spanish ISG PDF editions of the year 2012, together with the latest modifications of the ISG website up to the submission date, that is, March 2021 (see the news archive at http://​publications​.europa​.eu​/code​/en​/en​-000300​.htm). However, a new PDF edition for all EU languages was released in 2022 (https://​data​.europa​.eu​/doi​/10​.2830​/215072).
7.ReCor is a solution to determine the minimum size of a corpus or a textual collection, regardless of language or textual genre of the collection, establishing therefore the minimum threshold for representation by an algorithm (N-Cor) and analyzing lexical density according to the incremental increase in the corpus (http://​www​.lexytrad​.es​/en​/resources​/recor​-3/).
8.English quotation marks are used for both languages.
9.Double slashes (//), used in EPIC, were preferred over the original full stop (.) from the start because of aesthetic reasons.
12.These resources are listed as “Reference Works” for the English publications in the Official Journal and “Obras de consulta” for Spanish-specific conventions, respectively. In those cases in which language offers different correct usages, we follow the indications of the Spanish version of the ISG (European Union 2012b, 151), which states that the prevailing criteria will be those of EU translation services (e.g. for capitalization) and the Publications Office agreed norms for all languages (e.g. for acronyms).
13.For a brief description of the VIP system, see Corpas Pastor (2021) 2021 “Technology Solutions for Interpreters: The VIP System.” Hermēneus. Revista de Traducción e Interpretación 23:91–123.Google Scholar.
14.SpaCy is a free open-source library in Python (https://​spacy​.io/).
15.The VIP NER annotation schemes distinguish at least four basic entity types: named persons or families (PER, e.g. Dorthe Christensen, Ádám Kósa); names of politically or geographically defined locations (LOC, e.g. Mar Menor Coastal Lagoon, Salinas y Arenales de San Pedro del Pinatar), names of corporate, governmental or other organizational entities (ORG, e.g. EU, Tribunal de Cuentas Europeo) and miscellaneous entities (MISC, e.g. Tihange 2, AAE UE-Japón).
16.Precision refers to the fraction of relevant NEs (i.e. a total number of correctly retrieved NEs minus errors) among all retrieved instances whereas recall refers to the fraction of retrieved instances among all relevant instances found in the corpus.
19.In this context, “relevant hits” means corpus-based examples of patterns containing the named entities (e.g. ENT + V).
20.It must also be noted that different modal verbs appear in different types of patterns. Consider for example will, which was previously associated to text-organizing patterns (see Table 10).
21.This may be regarded as an additional hint of simplification and/or convergence.
22.Also, the use of subjunctive with the verb decidir conveys a higher degree of assertiveness than the alternatives debería elaborar and debe elaborar.
23.A further note on these patterns would be their similarity with other categories. The example The Commission would like to reiterate contains a grammatical pattern, but it conveys a specific idiomatic meaning (“The Commission sends a friendly reminder”). Conversely, the patterns Commission + maintains and Commission + understands could be also classified as text-organizing, given that they are embedded in the concluding section of petitions.

References

Aston, Guy
2018 “Acquiring the Language of Interpreters: A Corpus-based Approach.” In Making Way in Corpus-based Interpreting Studies, edited by Mariachiara Russo, Claudio Bendazzoli & Bart Defrancq, 83–96. Singapore: Springer. DOI logoGoogle Scholar
Bernardini, Silvia, Adriano Ferraresi, Mariachiara Russo, Camille Collard, and Bart Defrancq
2018 “Building Interpreting and Intermodal Corpora: A How-to for a Formidable Task.” In Making Way in Corpus-based Interpreting Studies, edited by Mariachiara Russo, Claudio Bendazzoli & Bart Defrancq, 21–42. Singapore: Springer. DOI logoGoogle Scholar
Biel, Łucja
2014 “Phraseology in legal translation: A corpus-based analysis of textual mapping in EU law.” In The Ashgate Handbook of Legal Translation, edited by Le Cheng, King Kui Sin & Anne Wagner, 177–192. DOI logoGoogle Scholar
2018 “Lexical bundles in EU law: The impact of translation process on the patterning of legal language.” In Phraseology in Legal and Institutional Settings: A Corpus-Based Interdisciplinary Perspective, edited by Stanisław Goźdź-Roszkowski and Gianluca Pontrandolfo, 11–26. London: Routledge.Google Scholar
2021 “Eurolects and EU Legal Translation.” In The Oxford Handbook of Translation and Social Practices, edited by Meng Ji and Sara Laviosa, 477–500. Online: Oxford University Press. DOI logoGoogle Scholar
Biel, Łucja, Agnieszka Biernacka, and Anna Jopek-Bosiacka
2018 “Collocations of Terms in EU Competition Law: A Corpus Analysis of EU English Collocations.” In Language and Law: The Role of Language and Translation in EU Competition Law, edited by Silvia Marino, Łucja Biel, Martina Bajčić and Vilelmini Sosoni, 249–274. Cham: Springer International Publishing. DOI logoGoogle Scholar
Biel, Łucja and Agnieszka Doczekalska
2020 “How do supranational terms transfer into national legal systems?Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 26(2):184–212. DOI logoGoogle Scholar
Biel, Łucja, Dariusz Koźbiał, and Katarzyna Wasilewska
2019 “The formulaicity of translations across EU institutional genres: A corpus-driven analysis of lexical bundles in translated and non-translated language.” Translation Spaces 8(1):67–92. DOI logoGoogle Scholar
Biel, Łucja and Izabela Pytel
2021 “Corrigenda of EU Legislative Acts as an Indicator of Quality Assurance Failures.” In Institutional Translation and Interpreting, edited by Fernando Prieto Ramos, 150–173. New York: Routledge. DOI logoGoogle Scholar
Blini, Lorenzo
2018 “Observing Eurolects: The case of Spanish.” In Observing Eurolects: Corpus analysis of linguistic variation in EU law, edited by Laura Mori, 329–367. DOI logoGoogle Scholar
Burtsev, Mikhail, Alexander Seliverstov, Rafael Airapetyan, Mikhail Arkhipov, Dilyara Baymurzina, Nickolay Bushkov, and Marat Zaynutdinov
2018 “Deeppavlov: Open-source library for dialogue systems.” In Proceedings of ACL 2018, System Demonstrations, 122–127. Melbourne: Association for Computational Linguistics. DOI logoGoogle Scholar
Corpas Pastor, Gloria
2017 “Collocations in E-Bilingual Dictionaries: From Underlying Theoretical Assumptions to Practical Lexicography and Translation Issues.” In Collocations and Other Lexical Combinations in Spanish. Theoretical and Applied Approaches, edited by Sergi Torner Castells and Elisenda Bernal, 139–160. London: Routledge.Google Scholar
2021 “Technology Solutions for Interpreters: The VIP System.” Hermēneus. Revista de Traducción e Interpretación 23:91–123.Google Scholar
Corpas Pastor, Gloria & Fernando Sánchez Rodas
2022NLP-enhanced Shift Analysis of Named Entities in an English Spanish Intermodal Corpus of European Petitions. In Marta Kajzer-Wietrzny, Adriano Ferraresi, Ilmari Ivaska & Silvia Bernardini (eds.), Mediated discourse at the European Parliament: Empirical investigations, 219–251. Berlin: Language Science Press. https://​langsci​-press​.org​/catalog​/book​/343Google Scholar
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
2019 “Bert: Pre-training of deep bidirectional transformers for language understanding.” https://​arxiv​.org​/abs​/1810​.04805v2
Dobrić Basaneže, Katja
2017 “Interpreting Phraseological Units in Contracts: The Case of Extended Term–Embedding Collocation.” Suvremena Lingvistika 43(84):199–216. DOI logoGoogle Scholar
European Parliament
European Union
2012aInterinstitutional Style Guide 2011. Luxembourg: Publications Office of the European Union. DOI logoGoogle Scholar
2012bLibro de estilo interinstitucional 2011. Luxembourg: Publications Office of the European Union. DOI logoGoogle Scholar
Ferraresi, Adriano, Silvia Bernardini, Marie-Aude Lefer, and Maja Miličević
2017 “Investigating the language of written translation and simultaneous interpretation: Simplification in EPTIC.” In Congrès Mondial de Traductologie (Université de Paris-Nanterre, du 10/04/2017 au 14/04/2017). http://​hdl​.handle​.net​/2078​.1​/185346
Ferraresi, Adriano and Maja Miličević
2017 “Phraseological patterns in interpreting and translation : similar or different ?” In Empirical Translation Studies: New Methodological and Theoretical Traditions, edited by Gert De Sutter, Marie-Aude Lefer and Isabelle Delaere, 157–182. Berlin: De Gruyter Mouton. DOI logoGoogle Scholar
Goffin, Roger
1994 “L ’ eurolecte : oui , jargon communautaire : non.” Meta 39(4):636–642. DOI logoGoogle Scholar
Goźdź-Roszkowski, Stanisław
2011Patterns of Linguistic Variation in American Legal English: A Corpus-Based Study. Frankfurt am Main: Peter Lang. DOI logoGoogle Scholar
2012 “Discovering Patterns and Meanings: Corpus Perspectives on Phraseology in Legal Discourse.” Roczniki Humanistyczne 60(8):47–70. https://​www​.ceeol​.com​/search​/article​-detail​?id​=129241
Goźdź-Roszkowski, Stanisław and Gianluca Pontrandolfo
eds. 2018Phraseology in legal and institutional settings: A corpus-based interdisciplinary perspective. London: Routledge. DOI logoGoogle Scholar
2015a “Legal Phraseology Today: Corpus-based Applications Across Legal Languages and Genres.” Fachsprache 37(3–4):130–138. 10.24989/fs.v37i3-4.1287. DOI logoGoogle Scholar
eds. 2015b “Legal Phraseology Today. A Corpus-based View.” Fachsprache 37(3–4). DOI logoGoogle Scholar
Henriksen, Line
2007 “The song in the booth: Formulaic interpreting and oral textualisation.” Interpreting 9(1):1–20. DOI logoGoogle Scholar
Hrežo, Vladimir
2020 “Exploring Phraseology in EU Legal Discourse.” Language – Culture – Politics 1:29–52.Google Scholar
Jacquet, Guillaume, Maud Ehrmann, Jakub Piskorski, Hristo Tanev, and Ralf Steinberger
2019 “Cross-lingual linking of multi-word entities and language-dependent learning of multi-word entity patterns.” In Representation and Parsing of Multiword Expressions: Current trends, edited by Yannick Parmentier and Jakub Waszczuk, 269–297. Berlin: Language Science Press. DOI logoGoogle Scholar
Kajzer-Wietrzny, Marta and Łukasz Grabowski
2021 “Formulaicity in Constrained Communication: An Intermodal Approach.” MonTI. Monografías de Traducción e Interpretación 13:148–83. DOI logoGoogle Scholar
Klabal, Ondřej
2019 “Corpora in Legal Translation: Overcoming Terminological and Phraseological Assymetries between Czech and English.” CLINA: Revista Interdisciplinaria de Traducción, Interpretación y Comunicación Intercultural 5(2):165–86. DOI logoGoogle Scholar
Nasar, Zara, Syed Waqar Jaffry, and Muhammad Kamran Malik
2021 “Named Entity Recognition and Relation Extraction: State-of-the-Art.” ACM Computing Surveys 54(1), 1–39. DOI logoGoogle Scholar
Nouvel, Damien, Maud Ehrmann, and Sophie Rosset
eds. 2016Named Entities for Computational Linguistics. Hoboken, NJ: John Wiley & Sons, Inc. DOI logoGoogle Scholar
Pontrandolfo, Gianluca
2011 “Phraseology in criminal judgments: A corpus study of original vs. translated Italian.” Sendebar 22:209–234.Google Scholar
2015 “Investigating Judicial Phraseology with COSPE: A contrastive Corpus-based Study.” In New directions in corpus-based translation studies, edited by Claudio Fantinuoli and Federico Zanettin, 137–159. Berlin: Language Science Press.Google Scholar
2021 “National and EU judicial phraseology under the magnifying glass: a corpus-assisted analysis of complex prepositions in Spanish.” Perspectives 29(2). 260–277. DOI logoGoogle Scholar
Prieto Ramos, Fernando
2021 “Translating legal terminology and phraseology: between inter-systemic incongruity and multilingual harmonization.” Perspectives 29(2):175–183. DOI logoGoogle Scholar
Sandrelli, Annalisa
2018 “Observing Eurolects: The case of English.” In Observing Eurolects: Corpus analysis of linguistic variation in EU law, edited by Laura Mori, 63–92. DOI logoGoogle Scholar
Santandrea, Manuela
2014Le collocazioni in traduzione e interpretazione tra italiano e inglese: uno studio su EPTIC_01_2011. Università di Bologna. https://​amslaurea​.unibo​.it​/cgi​/users​/home​?screen​=EPrint%3A%3AView​&eprintid​=7839
Seracini, Francesca L.
2020 “Phraseology in multilingual EU legislation: a corpus-based study of translated multi-word terms.” Perspectives 29:245–259. DOI logoGoogle Scholar
Steinberger, Josef, Polina Lenkova, Mijail Kabadjov, Ralf Steinberger, and Erik Van Der Goot
2011 “Multilingual entity-centered sentiment analysis evaluated by parallel corpora.” In International Conference Recent Advances in Natural Language Processing, RANLP, edited by Galia Angelova, Kalina Bontcheva, Ruslan Mitkov & Nikolai Nikolov, 770–775. Hissar: Association for Computational Linguistics.Google Scholar
Teich, Elke
2003Cross-Linguistic Variation in System and Text: A Methodology for the Investigation of Translations and Comparable Texts. Berlin: Mouton de Gruyter. DOI logoGoogle Scholar
Trklja, Aleksandar
2018 “A corpus investigation of formulaicity and hybridity in legal language: A case of EU case law texts.” In Phraseology in Legal and Institutional Settings: A Corpus-Based Interdisciplinary Perspective, edited by Stanisław Goźdź-Roszkowski and Gianluca Pontrandolfo, 89–108. London: Routledge. DOI logoGoogle Scholar
Vigier-Moreno, Francisco Javier and María del Mar Sánchez Ramos
2017 “Using parallel corpora to study the translation of legal system-bound terms: The case of names of English and Spanish courts.” In Computational and Corpus-Based Phraseology. Second International Conference, Europhras 2017 London, UK, November 13–14, 2017 Proceedings, edited by Ruslan Mitkov, 260–273. Cham: Springer. DOI logoGoogle Scholar
Weischedel, Ralph, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, and Ann Houston
2013 “OntoNotes Release 5.0 LDC2013T19.” Philadelphia: Linguistic Data Consortium. DOI logoGoogle Scholar