<TitleType>01</TitleType> <TitleText textformat="02">Parallel Corpora for Contrastive and Translation Studies</TitleText> <Subtitle textformat="02">New resources and applications</Subtitle>

219-7677 10 7500817 John Benjamins Publishing Company Marketing Department / Karin Plijnaar, Pieter Lamers onix@benjamins.nl 201903221314 ONIX title feed eng 01 EUR

943019008 03 01 01 JB John Benjamins Publishing Company 01 JB code SCL 90 Eb 15 9789027262844 06 10.1075/scl.90 13 2019001810 DG 002 02 01 SCL 02 1388-0373 Studies in Corpus Linguistics 90 <TitleType>01</TitleType> <TitleText textformat="02">Parallel Corpora for Contrastive and Translation Studies</TitleText> <Subtitle textformat="02">New resources and applications</Subtitle> 01 scl.90 01 https://benjamins.com 02 https://benjamins.com/catalog/scl.90 1 B01 Irene Doval Doval, Irene Irene Doval University of Santiago de Compostela 2 B01 M. Teresa Sánchez Nieto Sánchez Nieto, M. Teresa M. Teresa Sánchez Nieto University of Valladolid 01 eng 311 ix 301 LAN009000 v.2006 CFX 2 24 JB Subject Scheme LIN.COMP Comparative linguistics 24 JB Subject Scheme LIN.COMPUT Computational & corpus linguistics 24 JB Subject Scheme LIN.CORP Corpus linguistics 24 JB Subject Scheme TRAN.TRANSL Translation Studies 06 01 This volume assesses the state of the art of parallel corpus research as a whole, reporting on advances in both recent developments of parallel corpora – with some particular references to comparable corpora as well– and in ways of exploiting them for a variety of purposes. The first part of the book is devoted to new roles that parallel corpora can and should assume in translation studies and in contrastive linguistics, to the usefulness and usability of parallel corpora, and to advances in parallel corpus alignment, annotation and retrieval. There follows an up-to-date presentation of a number of parallel corpus projects currently being carried out in Europe, some of them multimodal, with certain chapters illustrating case studies developed on the basis of the corpora at hand. In most of these chapters, attention is paid to specific technical issues of corpus building. The third part of the book reflects on specific applications and on the creation of bilingual resources from parallel corpora. This volume will be welcomed by scholars, postgraduate and PhD students in the fields of contrastive linguistics, translation studies, lexicography, language teaching and learning, machine translation, and natural language processing. 04 09 01 https://benjamins.com/covers/475/scl.90.png 04 03 01 https://benjamins.com/covers/475_jpg/9789027202345.jpg 04 03 01 https://benjamins.com/covers/475_tif/9789027202345.tif 06 09 01 https://benjamins.com/covers/1200_front/scl.90.hb.png 07 09 01 https://benjamins.com/covers/125/scl.90.png 25 09 01 https://benjamins.com/covers/1200_back/scl.90.hb.png 27 09 01 https://benjamins.com/covers/3d_web/scl.90.hb.png 10 01 JB code scl.90.prelim i iv 4 Prelim pages -1 <TitleType>01</TitleType> <TitleText textformat="02">Prelim pages</TitleText> 10 01 JB code scl.90.toc v viii 4 Table of contents 0 <TitleType>01</TitleType> <TitleText textformat="02">Table of contents</TitleText> 10 01 JB code scl.90.ack ix 1 Acknowledgments 1 <TitleType>01</TitleType> <TitleText textformat="02">Acknowledgments</TitleText> 10 01 JB code scl.90.01dov 1 15 15 Introduction 2 <TitleType>01</TitleType> <TitleText textformat="02">Parallel corpora in focus</TitleText> <Subtitle textformat="02">An account of current achievements and challenges</Subtitle> 1 A01 Irene Doval Doval, Irene Irene Doval 2 A01 M. Teresa Sánchez Nieto Sánchez Nieto, M. Teresa M. Teresa Sánchez Nieto 10 01 JB code scl.90.p1 17 90 74 Section header 3 <TitleType>01</TitleType> <TitleText textformat="02">Part I. Parallel corpora</TitleText> <Subtitle textformat="02">Background and processing</Subtitle> 10 01 JB code scl.90.02har 19 38 20 Chapter 4 <TitleType>01</TitleType> <TitleText textformat="02">Comparable parallel corpora</TitleText> <Subtitle textformat="02">A critical review of current practices in corpus-based translation studies</Subtitle> 1 A01 Lidun Hareide Hareide, Lidun Lidun Hareide Møreforsking Molde Norway 20 comparable parallel corpora 20 the Gravitational Pull Hypothesis 20 unique items 01 Are papers presented in corpus-based translation studies truly scientific? These are normally done on only one language pair, often on purpose-made parallel corpora, and can normally not be replicated. Therefore their value is limited in a strictly scientific sense. The use of comparable parallel corpora allows both for the replication of studies, and the testing of complex hypotheses like Halverson’s Gravitational Pull hypothesis. This chapter defines and discusses the concept of comparable parallel corpora, and exemplifies their value by illustrating their use. The chapter also presents hopes for the future, as new groundbreaking technology that will allow the linguist to create her own parallel corpora without the aid of computer scientists is currently being launched at the University of León in Spain. 10 01 JB code scl.90.03mar 39 56 18 Chapter 5 <TitleType>01</TitleType> <TitleText textformat="02">Living with parallel corpora</TitleText> <Subtitle textformat="02">The potentials and limitations of their use in translation research</Subtitle> 1 A01 Josep Marco Marco, Josep Josep Marco University Jaume I 20 comparable corpora 20 COVALT 20 main source of data 20 parallel corpora 20 supplementary source of data 01 Parallel corpora can be used in translation research in at least two ways: as the main source of data or as a supplement to data retrieved from a comparable corpus, enabling data triangulation. In the former scenario, they may throw light on contrastive aspects or on translator techniques and methods. In the latter they will tend to be searched to account for differences perceived between the two components of a comparable corpus. Two case studies will be put forward in order to illustrate these two uses of parallel corpora. Both draw on the English-Catalan subcorpus of COVALT (Valencian Corpus of Translated Literature). The first analyses the translation of meal names whereas the second focuses on the -ment adverb + adjective construction. 10 01 JB code scl.90.04rab 57 78 22 Chapter 6 <TitleType>01</TitleType> <TitleText textformat="02">Working with parallel corpora</TitleText> <Subtitle textformat="02">Usefulness and usability</Subtitle> 1 A01 Rosa Rabadán Rabadán, Rosa Rosa Rabadán University of Leon 20 parallel corpora applications 20 parallel corpora reusability 20 parallel corpora uses 01 Although parallel corpora are vital for cross-linguistic and natural language processing (NLP) research, most have been designed for just one particular purpose, which may unnecessarily restrict their usefulness and usability. My argument is that the usefulness of existing parallel corpora increases exponentially when data so obtained are combined with those yielded by comparable and/or monolingual corpora. Usability criteria such as the choice of processing tools and adherence to international standards, among others, also have an impact on corpus usefulness. This chapter proposes courses of action that serve to improve the recycling and reprocessing of available resources. It also presents a corpus-based, post-editing and quality assessment application as an illustration of the multifarious uses parallel corpora may serve. 10 01 JB code scl.90.05vol 79 90 12 Chapter 7 <TitleType>01</TitleType> <TitleText textformat="02">Innovations in parallel corpus alignment and retrieval</TitleText> 1 A01 Martin Volk Volk, Martin Martin Volk University of Zurich 20 corpus annotation 20 corpus retrieval 20 multiparallel corpora 20 word alignment 01 In this chapter, we give an overview of parallel corpus annotation, alignment and retrieval. We present standard annotation methods such as Part-of-Speech tagging, lemmatization and dependency parsing, but we also introduce language-specific methods, for example for dealing with split verbs or truncated compounds in German. Our corpus annotation includes the identification of code-switching within sentences as a special case of language identification. We argue for careful sentence and word alignment for parallel corpora. And we explain how word alignment is the basis for a wide range of applications from translation variant ranking to lemma disambiguation. 10 01 JB code scl.90.p2 93 247 155 Section header 8 <TitleType>01</TitleType> <TitleText textformat="02">Part II. Parallel corpora</TitleText> <Subtitle textformat="02">Creation, annotation and access</Subtitle> 10 01 JB code scl.90.06cer 93 101 9 Chapter 9 <TitleType>01</TitleType> <TitleText textformat="02">InterCorp</TitleText> <Subtitle textformat="02">A parallel corpus of 40 languages</Subtitle> 1 A01 Petr Čermák Čermák, Petr Petr Čermák Charles University Prague 20 comparison of languages 20 Czech National Corpus 20 InterCorp 20 parallel corpus 20 Spanish 01 This chapter presents the current version of InterCorp, a parallel corpus created at the Faculty of Arts, Charles University in Prague. The corpus contains texts in Czech aligned with one or more foreign-language version(s), including Czech and 39 other languages. The chapter analyses its structure and technical parameters, and describes some technical tools used with the corpus (Kontext, a corpus query interface, and InterText, a parallel text alignment editor created specifically for the project). Similarly, the contribution discusses Treq (Translation Equivalents Database), a collection of bilingual Czech-foreign language dictionaries built automatically from InterCorp. In the last section of the chapter, the possibilities for methodological and linguistic exploitation of the corpus are discussed. 10 01 JB code scl.90.07dov 103 121 19 Chapter 10 <TitleType>01</TitleType> <TitleText textformat="02">Corpus PaGeS</TitleText> <Subtitle textformat="02">A multifunctional resource for language learning, translation and cross-linguistic research</Subtitle> 1 A01 Irene Doval Doval, Irene Irene Doval University of Santiago de Compostela 2 A01 Santiago Fernández Lanza Lanza, Santiago Fernández Santiago Fernández Lanza University of Santiago de Compostela 3 A01 Tomás Jiménez Juliá Juliá, Tomás Jiménez Tomás Jiménez Juliá University of Santiago de Compostela 4 A01 Elsa Liste Lamas Liste Lamas, Elsa Elsa Liste Lamas University of Santiago de Compostela 5 A01 Barbara Lübke Lübke, Barbara Barbara Lübke University of Santiago de Compostela 20 corpus alignment 20 corpus visualization 20 German 20 parallel corpora 20 Spanish 01 This chapter presents the bilingual parallel corpus PaGeS, compiled by the research group SpatiAlEs from the University of Santiago de Compostela. PaGeS currently amounts to nearly 20 million tokens and consists of texts originally written in German and in Spanish and their correspondent translations into the other language, as well as a small portion of German and Spanish translations from third languages. The present contribution introduces the main characteristics of the PaGeS corpus, focusing on its design and compilation. It first explains the criteria for the selection of the texts and the details of text pre-processing, automatic alignment and manual review. It then addresses the search and display features describing the server architecture and indexing process. Finally, the intended development of the PaGeS corpus is briefly discussed. 10 01 JB code scl.90.08fer 123 139 17 Chapter 11 <TitleType>01</TitleType> <TitleText textformat="02">Building EPTIC</TitleText> <Subtitle textformat="02">A many-sided, multi-purpose corpus of EU parliament proceedings</Subtitle> 1 A01 Adriano Ferraresi Ferraresi, Adriano Adriano Ferraresi University of Bologna 2 A01 Silvia Bernardini Bernardini, Silvia Silvia Bernardini University of Bologna 20 corpus annotation 20 intermodal corpora 20 loan words 20 text-to-text alignment 20 text-to-video alignment 01 This chapter describes the steps involved in the construction of EPTIC, an intermodal corpus of European Parliament speeches. Despite its limited size, this corpus has features that justify its labour-intensive building process, in particular its multiple alignments. The text-to-text alignments allow users to compare interpretations and translations of source speeches and their written-up reports, while text-to-video alignments allow them to access the multimedia components from concordance lines. To illustrate the potential of EPTIC, a case study is presented of English loan words in original, translated and interpreted Italian and French. Results suggest that borrowing is more likely to occur in translated Italian than in any of the other corpus components. 10 01 JB code scl.90.09gom 141 158 18 Chapter 12 <TitleType>01</TitleType> <TitleText textformat="02">Enriching parallel corpora with multimedia and lexical semantics</TitleText> <Subtitle textformat="02">From the CLUVI Corpus to WordNet and SemCor</Subtitle> 1 A01 Xavier Gomez Guinovart Gomez Guinovart, Xavier Xavier Gomez Guinovart University of Vigo 20 lexical semantics 20 multimedia 20 parallel corpora 20 SemCor 20 WordNet 01 In this chapter, I present the main characteristics of the CLUVI Corpus, an open collection of sentence-level aligned parallel corpora with over 44 million words in nine specialised domains (fiction, computing, popular science, biblical texts, law, consumer information, economy, tourism, and film subtitling) and different language combinations including Galician, Spanish, English, French, Portuguese, Catalan, Italian, Basque and Latin. Then, I present the methodology developed for extending the film subtitles section of the CLUVI Corpus with multimedia data. Finally, I discuss the resources and methods used to build the SensoGal Corpus, a SemCor-based English-Galician parallel corpus semantically annotated based on WordNet and aligned at the sentence and word levels. 10 01 JB code scl.90.10lop 159 182 24 Chapter 13 <TitleType>01</TitleType> <TitleText textformat="02">Discourse annotation in the MULTINOT corpus</TitleText> <Subtitle textformat="02">Issues and challenges</Subtitle> 1 A01 Julia Lavid-López Lavid-López, Julia Julia Lavid-López Complutense University of Madrid 20 annotation 20 corpus 20 discourse 20 English 20 Spanish 01 This chapter summarises and discusses recent work on the development of a bilingual (English-Spanish) corpus consisting of original comparable and parallel texts from a variety of genres and annotated with complex linguistic features such as modality and evidentiality, metadiscourse markers, and thematization, as carried out within the framework of the MULTINOT project. The annotation of these complex features in bilingual parallel texts poses important challenges for the researcher at the different stages of the corpus development, from the preprocessing phases to the manual annotation phase, but, at the same time, it allows the investigation of complex linguistic research questions which could not be addressed on the basis of raw corpora or even with the help of an automatic part-of-speech tagging system. 10 01 JB code scl.90.11mik 183 195 13 Chapter 14 <TitleType>01</TitleType> <TitleText textformat="02">PEST</TitleText> <Subtitle textformat="02">A parallel electronic corpus of state treaties</Subtitle> 1 A01 Mikhail Mikhailov Mikhailov, Mikhail Mikhail Mikhailov University of Tampere, Finland 2 A01 Miia Santalahti Santalahti, Miia Miia Santalahti University of Tampere 3 A01 Julia Souma Souma, Julia Julia Souma University of Tampere 20 balanced corpus 20 compiling parallel corpora 20 language of state treaties 20 legal language 01 This chapter introduces the Parallel Electronic corpus of State Treaties (PEST). The current plan is to compile a parallel corpus, which will include treaties concluded between Russia and Finland, Finland and Sweden, and Sweden and Russia. In addition, there will be a subcorpus of international conventions in all three languages plus English, to be used as reference data. The chapter describes the structure of the subcorpora (number of documents, their chronological distribution and topics featured), and it also addresses the challenges of balancing such a corpus. In the future, this material can be used for studies ranging from lexicon and semantics to grammar, style, discourse, translation studies, and language for special purposes. 10 01 JB code scl.90.12mol 197 214 18 Chapter 15 <TitleType>01</TitleType> <TitleText textformat="02">Indexation and analysis of a parallel corpus using CQPweb</TitleText> <Subtitle textformat="02">The COVALT PAR_ES Corpus (EN/FR/DE > ES)</Subtitle> 1 A01 Teresa Molés-Cases Molés-Cases, Teresa Teresa Molés-Cases Universitat Politècnica de València, Universitat Jaume I 2 A01 Ulrike Oster Oster, Ulrike Ulrike Oster Universitat Politècnica de València, Universitat Jaume I 20 corpus compilation 20 corpus indexation 20 COVALT corpus 20 CQPweb 01 This contribution presents a section of the Corpus Valencià de Literatura Traduïda (COVALT), created by the research group of the same name (Department of Translation and Communication, Universitat Jaume I, Spain). The COVALT corpus is a four-million word corpus made up of narrative works originally written in English, French, and German and their Catalan translations published in the autonomous community of Valencia between 1990 and 2000. Since the members of the Covalt group are interested in translation research, and more specifically in the investigation of translated Catalan and Spanish, this corpus has recently been extended to include translations into Spanish published in Spain (COVALT PAR_ES corpus). This chapter presents the COVALT PAR_ES corpus, as well as its process of compilation and analysis with CQPweb. 10 01 JB code scl.90.13san 215 231 17 Chapter 16 <TitleType>01</TitleType> <TitleText textformat="02">P-ACTRES 2.0</TitleText> <Subtitle textformat="02">A parallel corpus for cross-linguistic research</Subtitle> 1 A01 Hugo Sanjurjo-González Sanjurjo-González, Hugo Hugo Sanjurjo-González University of Huddersfield 2 A01 Marlén Izquierdo Izquierdo, Marlén Marlén Izquierdo University of the Basque Country (UPV/EHU) 20 (parallel) corpus compilation 20 ACTRES 20 corpus analysis software 20 web interface 01 This chapter describes an updated version of the ACTRES Parallel Corpus (P-ACTRES 2.0), an English-Spanish bidirectional corpus that contains over 4 million words. The composition of the corpus is recounted, regarding the number of words in each direction, and the types of texts included together with the linguistic variants that users will find in the corpus. Its composition is shaped by research purposes as well as availability issues. The computerization process is also explained, while commenting on the text processing, alignment and tagging. The chapter concludes with a brief demonstration of the usefulness and usability of P-ACTRES 2.0 in cross-linguistic research, be it contrastive linguistics or translation studies either independently or, most importantly, jointly. 10 01 JB code scl.90.14san 233 247 15 Chapter 17 <TitleType>01</TitleType> <TitleText textformat="02">An overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus</TitleText> <TitlePrefix>An </TitlePrefix> <TitleWithoutPrefix textformat="02">overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus</TitleWithoutPrefix> 1 A01 Zuriñe Sanz-Villar Sanz-Villar, Zuriñe Zuriñe Sanz-Villar University of the Basque Country (UPV/EHU) 20 Aleuska corpus 20 Basque corpora 20 Basque MWEs 20 TAligner 01 Since the 1980s, considerable efforts have been made to create different types of Basque corpora. However, to systematically analyse the Basque translations of German literary texts, it was necessary to create a corpus from the ground up. Intermediary versions were included in this corpus whenever the Basque target text was not a translation from the German original but came instead from a translation into another language (Spanish in most cases). A tool called TAligner was used to align the bitexts and the tritexts. The aim of this chapter is, firstly, to provide the reader with an overview of the main Basque corpora. Secondly, I will describe the design and compilation process of a parallel and multilingual corpus using TAligner 3.0. Thirdly, I will present how the corpus has been lemmatized and annotated at the level of part-of-speech. Finally, the process of extracting potential Basque multi-word expressions will be shown. 10 01 JB code scl.90.p3 249 298 50 Section header 18 <TitleType>01</TitleType> <TitleText textformat="02">Part III. Parallel corpora</TitleText> <Subtitle textformat="02">Tools and applications</Subtitle> 10 01 JB code scl.90.15gam 251 265 15 Chapter 19 <TitleType>01</TitleType> <TitleText textformat="02">Strategies for building high quality bilingual lexicons from comparable corpora</TitleText> 1 A01 Pablo Gamallo Otero Gamallo Otero, Pablo Pablo Gamallo Otero University of Santiago de Compostela 20 bilingual lexicons 20 cognates 20 comparable corpora 20 distributional similarity 20 extraction of translation candidates 01 This chapter outlines two strategies to automatically build bilingual dictionaries: One is based on the use of a pivot language and existing bilingual dictionaries, while the other relies on string similarity and cognate extraction. Both strategies have in common the use of translation equivalents extracted from comparable corpora to filter out odd bilingual pairs and validate the correct ones. The correctness of the entries validated with comparable corpora is very high, close to that achieved by using parallel corpora. The chapter reports several case studies describing how to build new high-quality bilingual lexicons, namely English-Galician, English-Portuguese, and Portuguese-Spanish dictionaries with more than 90% precision. This outperforms state-of-the-art systems on bilingual extraction from comparable corpora, whose best scores hardly reach 70 or 80%. 10 01 JB code scl.90.16gon 267 279 13 Chapter 20 <TitleType>01</TitleType> <TitleText textformat="02">Discovering bilingual collocations in parallel corpora</TitleText> <Subtitle textformat="02">A first attempt at using distributional semantics</Subtitle> 1 A01 Marcos Garcia Garcia, Marcos Marcos Garcia University of A Coruña 2 A01 Marcos García-Salido García-Salido, Marcos Marcos García-Salido University of A Coruña 3 A01 Margarita Alonso-Ramos Alonso-Ramos, Margarita Margarita Alonso-Ramos University of A Coruña 20 collocations 20 distributional semantics 20 parallel corpora 20 phraseology 01 This chapter presents a method that exploits parallel corpora to automatically extract bilingual collocation equivalents. First, we use dependency parsing and statistical measures to identify collocation candidates in corpora. Then, we leverage the parallel corpora to extract bilingual word-embeddings. Finally, we use these distributional models as probabilistic dictionaries in order to identify bilingual collocation equivalents. To evaluate our strategy we carry out a set of experiments in Portuguese and Spanish focusing on verb-object collocations, for example, “reach the maturity” (“atingir a maturidade” in Portuguese, “alcanzar la madurez” in Spanish). The results of our experiments show that this method is useful to automatically identify thousands of bilingual collocation equivalents, achieving a precision of 86%. 10 01 JB code scl.90.17gho 281 298 18 Chapter 21 <TitleType>01</TitleType> <TitleText textformat="02">Normalization of shorthand forms in French text messages using word embedding and machine translation</TitleText> 1 A01 Parijat Ghoshal Ghoshal, Parijat Parijat Ghoshal Neue Zürcher Zeitung, KOF Swiss Economic Institute 2 A01 Xi Rao Rao, Xi Xi Rao Neue Zürcher Zeitung, KOF Swiss Economic Institute 20 abbreviation/shorthand form normalization 20 character-based machine translation 20 deep learning 20 distributional semantics 20 French 20 Multivec 20 neural networks 20 parallel corpus 20 SMS 20 unsupervised learning 20 word embeddings 01 This chapter focuses on the normalization of abbreviations and shorthand forms used in French text messages. These forms are difficult to normalize, as they mostly cannot be resolved by typical spell checkers and dictionary lookups. Firstly, we aligned normalized and non-normalized French text messages and built a parallel corpus. We applied two popular approaches for text normalization, namely multilingual word embeddings, and character-based machine translation. We compare our results and observe the efficacy of our models while normalizing deletions, substitutions, repetitions, swaps, and insertions, made to canonical forms. This is the first paper that uses Multivec and the Belgian SMS corpus collected under the SMS4Science Project. The unsupervised machine learning approach makes the system highly flexible, easily adaptable and provides a domain-independent method of text normalization. 10 01 JB code scl.90.index 299 301 3 Index 22 <TitleType>01</TitleType> <TitleText textformat="02">Index</TitleText> 02 JBENJAMINS John Benjamins Publishing Company 01 John Benjamins Publishing Company Amsterdam/Philadelphia NL 04 20190320 2019 John Benjamins B.V. 02 WORLD 13 15 9789027202345 01 JB 3 John Benjamins e-Platform 03 jbe-platform.com 09 WORLD 21 20190315 01 00 99.00 EUR R 01 00 83.00 GBP Z 01 gen 00 149.00 USD S 285019007 03 01 01 JB John Benjamins Publishing Company 01 JB code SCL 90 Hb 15 9789027202345 13 2018047820 BB 01 SCL 02 1388-0373 Studies in Corpus Linguistics 90 <TitleType>01</TitleType> <TitleText textformat="02">Parallel Corpora for Contrastive and Translation Studies</TitleText> <Subtitle textformat="02">New resources and applications</Subtitle> 01 scl.90 01 https://benjamins.com 02 https://benjamins.com/catalog/scl.90 1 B01 Irene Doval Doval, Irene Irene Doval University of Santiago de Compostela 2 B01 M. Teresa Sánchez Nieto Sánchez Nieto, M. Teresa M. Teresa Sánchez Nieto University of Valladolid 01 eng 311 ix 301 LAN009000 v.2006 CFX 2 24 JB Subject Scheme LIN.COMP Comparative linguistics 24 JB Subject Scheme LIN.COMPUT Computational & corpus linguistics 24 JB Subject Scheme LIN.CORP Corpus linguistics 24 JB Subject Scheme TRAN.TRANSL Translation Studies 06 01 This volume assesses the state of the art of parallel corpus research as a whole, reporting on advances in both recent developments of parallel corpora – with some particular references to comparable corpora as well– and in ways of exploiting them for a variety of purposes. The first part of the book is devoted to new roles that parallel corpora can and should assume in translation studies and in contrastive linguistics, to the usefulness and usability of parallel corpora, and to advances in parallel corpus alignment, annotation and retrieval. There follows an up-to-date presentation of a number of parallel corpus projects currently being carried out in Europe, some of them multimodal, with certain chapters illustrating case studies developed on the basis of the corpora at hand. In most of these chapters, attention is paid to specific technical issues of corpus building. The third part of the book reflects on specific applications and on the creation of bilingual resources from parallel corpora. This volume will be welcomed by scholars, postgraduate and PhD students in the fields of contrastive linguistics, translation studies, lexicography, language teaching and learning, machine translation, and natural language processing. 04 09 01 https://benjamins.com/covers/475/scl.90.png 04 03 01 https://benjamins.com/covers/475_jpg/9789027202345.jpg 04 03 01 https://benjamins.com/covers/475_tif/9789027202345.tif 06 09 01 https://benjamins.com/covers/1200_front/scl.90.hb.png 07 09 01 https://benjamins.com/covers/125/scl.90.png 25 09 01 https://benjamins.com/covers/1200_back/scl.90.hb.png 27 09 01 https://benjamins.com/covers/3d_web/scl.90.hb.png 10 01 JB code scl.90.prelim i iv 4 Prelim pages -1 <TitleType>01</TitleType> <TitleText textformat="02">Prelim pages</TitleText> 10 01 JB code scl.90.toc v viii 4 Table of contents 0 <TitleType>01</TitleType> <TitleText textformat="02">Table of contents</TitleText> 10 01 JB code scl.90.ack ix 1 Acknowledgments 1 <TitleType>01</TitleType> <TitleText textformat="02">Acknowledgments</TitleText> 10 01 JB code scl.90.01dov 1 15 15 Introduction 2 <TitleType>01</TitleType> <TitleText textformat="02">Parallel corpora in focus</TitleText> <Subtitle textformat="02">An account of current achievements and challenges</Subtitle> 1 A01 Irene Doval Doval, Irene Irene Doval 2 A01 M. Teresa Sánchez Nieto Sánchez Nieto, M. Teresa M. Teresa Sánchez Nieto 10 01 JB code scl.90.p1 17 90 74 Section header 3 <TitleType>01</TitleType> <TitleText textformat="02">Part I. Parallel corpora</TitleText> <Subtitle textformat="02">Background and processing</Subtitle> 10 01 JB code scl.90.02har 19 38 20 Chapter 4 <TitleType>01</TitleType> <TitleText textformat="02">Comparable parallel corpora</TitleText> <Subtitle textformat="02">A critical review of current practices in corpus-based translation studies</Subtitle> 1 A01 Lidun Hareide Hareide, Lidun Lidun Hareide Møreforsking Molde Norway 20 comparable parallel corpora 20 the Gravitational Pull Hypothesis 20 unique items 01 Are papers presented in corpus-based translation studies truly scientific? These are normally done on only one language pair, often on purpose-made parallel corpora, and can normally not be replicated. Therefore their value is limited in a strictly scientific sense. The use of comparable parallel corpora allows both for the replication of studies, and the testing of complex hypotheses like Halverson’s Gravitational Pull hypothesis. This chapter defines and discusses the concept of comparable parallel corpora, and exemplifies their value by illustrating their use. The chapter also presents hopes for the future, as new groundbreaking technology that will allow the linguist to create her own parallel corpora without the aid of computer scientists is currently being launched at the University of León in Spain. 10 01 JB code scl.90.03mar 39 56 18 Chapter 5 <TitleType>01</TitleType> <TitleText textformat="02">Living with parallel corpora</TitleText> <Subtitle textformat="02">The potentials and limitations of their use in translation research</Subtitle> 1 A01 Josep Marco Marco, Josep Josep Marco University Jaume I 20 comparable corpora 20 COVALT 20 main source of data 20 parallel corpora 20 supplementary source of data 01 Parallel corpora can be used in translation research in at least two ways: as the main source of data or as a supplement to data retrieved from a comparable corpus, enabling data triangulation. In the former scenario, they may throw light on contrastive aspects or on translator techniques and methods. In the latter they will tend to be searched to account for differences perceived between the two components of a comparable corpus. Two case studies will be put forward in order to illustrate these two uses of parallel corpora. Both draw on the English-Catalan subcorpus of COVALT (Valencian Corpus of Translated Literature). The first analyses the translation of meal names whereas the second focuses on the -ment adverb + adjective construction. 10 01 JB code scl.90.04rab 57 78 22 Chapter 6 <TitleType>01</TitleType> <TitleText textformat="02">Working with parallel corpora</TitleText> <Subtitle textformat="02">Usefulness and usability</Subtitle> 1 A01 Rosa Rabadán Rabadán, Rosa Rosa Rabadán University of Leon 20 parallel corpora applications 20 parallel corpora reusability 20 parallel corpora uses 01 Although parallel corpora are vital for cross-linguistic and natural language processing (NLP) research, most have been designed for just one particular purpose, which may unnecessarily restrict their usefulness and usability. My argument is that the usefulness of existing parallel corpora increases exponentially when data so obtained are combined with those yielded by comparable and/or monolingual corpora. Usability criteria such as the choice of processing tools and adherence to international standards, among others, also have an impact on corpus usefulness. This chapter proposes courses of action that serve to improve the recycling and reprocessing of available resources. It also presents a corpus-based, post-editing and quality assessment application as an illustration of the multifarious uses parallel corpora may serve. 10 01 JB code scl.90.05vol 79 90 12 Chapter 7 <TitleType>01</TitleType> <TitleText textformat="02">Innovations in parallel corpus alignment and retrieval</TitleText> 1 A01 Martin Volk Volk, Martin Martin Volk University of Zurich 20 corpus annotation 20 corpus retrieval 20 multiparallel corpora 20 word alignment 01 In this chapter, we give an overview of parallel corpus annotation, alignment and retrieval. We present standard annotation methods such as Part-of-Speech tagging, lemmatization and dependency parsing, but we also introduce language-specific methods, for example for dealing with split verbs or truncated compounds in German. Our corpus annotation includes the identification of code-switching within sentences as a special case of language identification. We argue for careful sentence and word alignment for parallel corpora. And we explain how word alignment is the basis for a wide range of applications from translation variant ranking to lemma disambiguation. 10 01 JB code scl.90.p2 93 247 155 Section header 8 <TitleType>01</TitleType> <TitleText textformat="02">Part II. Parallel corpora</TitleText> <Subtitle textformat="02">Creation, annotation and access</Subtitle> 10 01 JB code scl.90.06cer 93 101 9 Chapter 9 <TitleType>01</TitleType> <TitleText textformat="02">InterCorp</TitleText> <Subtitle textformat="02">A parallel corpus of 40 languages</Subtitle> 1 A01 Petr Čermák Čermák, Petr Petr Čermák Charles University Prague 20 comparison of languages 20 Czech National Corpus 20 InterCorp 20 parallel corpus 20 Spanish 01 This chapter presents the current version of InterCorp, a parallel corpus created at the Faculty of Arts, Charles University in Prague. The corpus contains texts in Czech aligned with one or more foreign-language version(s), including Czech and 39 other languages. The chapter analyses its structure and technical parameters, and describes some technical tools used with the corpus (Kontext, a corpus query interface, and InterText, a parallel text alignment editor created specifically for the project). Similarly, the contribution discusses Treq (Translation Equivalents Database), a collection of bilingual Czech-foreign language dictionaries built automatically from InterCorp. In the last section of the chapter, the possibilities for methodological and linguistic exploitation of the corpus are discussed. 10 01 JB code scl.90.07dov 103 121 19 Chapter 10 <TitleType>01</TitleType> <TitleText textformat="02">Corpus PaGeS</TitleText> <Subtitle textformat="02">A multifunctional resource for language learning, translation and cross-linguistic research</Subtitle> 1 A01 Irene Doval Doval, Irene Irene Doval University of Santiago de Compostela 2 A01 Santiago Fernández Lanza Lanza, Santiago Fernández Santiago Fernández Lanza University of Santiago de Compostela 3 A01 Tomás Jiménez Juliá Juliá, Tomás Jiménez Tomás Jiménez Juliá University of Santiago de Compostela 4 A01 Elsa Liste Lamas Liste Lamas, Elsa Elsa Liste Lamas University of Santiago de Compostela 5 A01 Barbara Lübke Lübke, Barbara Barbara Lübke University of Santiago de Compostela 20 corpus alignment 20 corpus visualization 20 German 20 parallel corpora 20 Spanish 01 This chapter presents the bilingual parallel corpus PaGeS, compiled by the research group SpatiAlEs from the University of Santiago de Compostela. PaGeS currently amounts to nearly 20 million tokens and consists of texts originally written in German and in Spanish and their correspondent translations into the other language, as well as a small portion of German and Spanish translations from third languages. The present contribution introduces the main characteristics of the PaGeS corpus, focusing on its design and compilation. It first explains the criteria for the selection of the texts and the details of text pre-processing, automatic alignment and manual review. It then addresses the search and display features describing the server architecture and indexing process. Finally, the intended development of the PaGeS corpus is briefly discussed. 10 01 JB code scl.90.08fer 123 139 17 Chapter 11 <TitleType>01</TitleType> <TitleText textformat="02">Building EPTIC</TitleText> <Subtitle textformat="02">A many-sided, multi-purpose corpus of EU parliament proceedings</Subtitle> 1 A01 Adriano Ferraresi Ferraresi, Adriano Adriano Ferraresi University of Bologna 2 A01 Silvia Bernardini Bernardini, Silvia Silvia Bernardini University of Bologna 20 corpus annotation 20 intermodal corpora 20 loan words 20 text-to-text alignment 20 text-to-video alignment 01 This chapter describes the steps involved in the construction of EPTIC, an intermodal corpus of European Parliament speeches. Despite its limited size, this corpus has features that justify its labour-intensive building process, in particular its multiple alignments. The text-to-text alignments allow users to compare interpretations and translations of source speeches and their written-up reports, while text-to-video alignments allow them to access the multimedia components from concordance lines. To illustrate the potential of EPTIC, a case study is presented of English loan words in original, translated and interpreted Italian and French. Results suggest that borrowing is more likely to occur in translated Italian than in any of the other corpus components. 10 01 JB code scl.90.09gom 141 158 18 Chapter 12 <TitleType>01</TitleType> <TitleText textformat="02">Enriching parallel corpora with multimedia and lexical semantics</TitleText> <Subtitle textformat="02">From the CLUVI Corpus to WordNet and SemCor</Subtitle> 1 A01 Xavier Gomez Guinovart Gomez Guinovart, Xavier Xavier Gomez Guinovart University of Vigo 20 lexical semantics 20 multimedia 20 parallel corpora 20 SemCor 20 WordNet 01 In this chapter, I present the main characteristics of the CLUVI Corpus, an open collection of sentence-level aligned parallel corpora with over 44 million words in nine specialised domains (fiction, computing, popular science, biblical texts, law, consumer information, economy, tourism, and film subtitling) and different language combinations including Galician, Spanish, English, French, Portuguese, Catalan, Italian, Basque and Latin. Then, I present the methodology developed for extending the film subtitles section of the CLUVI Corpus with multimedia data. Finally, I discuss the resources and methods used to build the SensoGal Corpus, a SemCor-based English-Galician parallel corpus semantically annotated based on WordNet and aligned at the sentence and word levels. 10 01 JB code scl.90.10lop 159 182 24 Chapter 13 <TitleType>01</TitleType> <TitleText textformat="02">Discourse annotation in the MULTINOT corpus</TitleText> <Subtitle textformat="02">Issues and challenges</Subtitle> 1 A01 Julia Lavid-López Lavid-López, Julia Julia Lavid-López Complutense University of Madrid 20 annotation 20 corpus 20 discourse 20 English 20 Spanish 01 This chapter summarises and discusses recent work on the development of a bilingual (English-Spanish) corpus consisting of original comparable and parallel texts from a variety of genres and annotated with complex linguistic features such as modality and evidentiality, metadiscourse markers, and thematization, as carried out within the framework of the MULTINOT project. The annotation of these complex features in bilingual parallel texts poses important challenges for the researcher at the different stages of the corpus development, from the preprocessing phases to the manual annotation phase, but, at the same time, it allows the investigation of complex linguistic research questions which could not be addressed on the basis of raw corpora or even with the help of an automatic part-of-speech tagging system. 10 01 JB code scl.90.11mik 183 195 13 Chapter 14 <TitleType>01</TitleType> <TitleText textformat="02">PEST</TitleText> <Subtitle textformat="02">A parallel electronic corpus of state treaties</Subtitle> 1 A01 Mikhail Mikhailov Mikhailov, Mikhail Mikhail Mikhailov University of Tampere, Finland 2 A01 Miia Santalahti Santalahti, Miia Miia Santalahti University of Tampere 3 A01 Julia Souma Souma, Julia Julia Souma University of Tampere 20 balanced corpus 20 compiling parallel corpora 20 language of state treaties 20 legal language 01 This chapter introduces the Parallel Electronic corpus of State Treaties (PEST). The current plan is to compile a parallel corpus, which will include treaties concluded between Russia and Finland, Finland and Sweden, and Sweden and Russia. In addition, there will be a subcorpus of international conventions in all three languages plus English, to be used as reference data. The chapter describes the structure of the subcorpora (number of documents, their chronological distribution and topics featured), and it also addresses the challenges of balancing such a corpus. In the future, this material can be used for studies ranging from lexicon and semantics to grammar, style, discourse, translation studies, and language for special purposes. 10 01 JB code scl.90.12mol 197 214 18 Chapter 15 <TitleType>01</TitleType> <TitleText textformat="02">Indexation and analysis of a parallel corpus using CQPweb</TitleText> <Subtitle textformat="02">The COVALT PAR_ES Corpus (EN/FR/DE > ES)</Subtitle> 1 A01 Teresa Molés-Cases Molés-Cases, Teresa Teresa Molés-Cases Universitat Politècnica de València, Universitat Jaume I 2 A01 Ulrike Oster Oster, Ulrike Ulrike Oster Universitat Politècnica de València, Universitat Jaume I 20 corpus compilation 20 corpus indexation 20 COVALT corpus 20 CQPweb 01 This contribution presents a section of the Corpus Valencià de Literatura Traduïda (COVALT), created by the research group of the same name (Department of Translation and Communication, Universitat Jaume I, Spain). The COVALT corpus is a four-million word corpus made up of narrative works originally written in English, French, and German and their Catalan translations published in the autonomous community of Valencia between 1990 and 2000. Since the members of the Covalt group are interested in translation research, and more specifically in the investigation of translated Catalan and Spanish, this corpus has recently been extended to include translations into Spanish published in Spain (COVALT PAR_ES corpus). This chapter presents the COVALT PAR_ES corpus, as well as its process of compilation and analysis with CQPweb. 10 01 JB code scl.90.13san 215 231 17 Chapter 16 <TitleType>01</TitleType> <TitleText textformat="02">P-ACTRES 2.0</TitleText> <Subtitle textformat="02">A parallel corpus for cross-linguistic research</Subtitle> 1 A01 Hugo Sanjurjo-González Sanjurjo-González, Hugo Hugo Sanjurjo-González University of Huddersfield 2 A01 Marlén Izquierdo Izquierdo, Marlén Marlén Izquierdo University of the Basque Country (UPV/EHU) 20 (parallel) corpus compilation 20 ACTRES 20 corpus analysis software 20 web interface 01 This chapter describes an updated version of the ACTRES Parallel Corpus (P-ACTRES 2.0), an English-Spanish bidirectional corpus that contains over 4 million words. The composition of the corpus is recounted, regarding the number of words in each direction, and the types of texts included together with the linguistic variants that users will find in the corpus. Its composition is shaped by research purposes as well as availability issues. The computerization process is also explained, while commenting on the text processing, alignment and tagging. The chapter concludes with a brief demonstration of the usefulness and usability of P-ACTRES 2.0 in cross-linguistic research, be it contrastive linguistics or translation studies either independently or, most importantly, jointly. 10 01 JB code scl.90.14san 233 247 15 Chapter 17 <TitleType>01</TitleType> <TitleText textformat="02">An overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus</TitleText> <TitlePrefix>An </TitlePrefix> <TitleWithoutPrefix textformat="02">overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus</TitleWithoutPrefix> 1 A01 Zuriñe Sanz-Villar Sanz-Villar, Zuriñe Zuriñe Sanz-Villar University of the Basque Country (UPV/EHU) 20 Aleuska corpus 20 Basque corpora 20 Basque MWEs 20 TAligner 01 Since the 1980s, considerable efforts have been made to create different types of Basque corpora. However, to systematically analyse the Basque translations of German literary texts, it was necessary to create a corpus from the ground up. Intermediary versions were included in this corpus whenever the Basque target text was not a translation from the German original but came instead from a translation into another language (Spanish in most cases). A tool called TAligner was used to align the bitexts and the tritexts. The aim of this chapter is, firstly, to provide the reader with an overview of the main Basque corpora. Secondly, I will describe the design and compilation process of a parallel and multilingual corpus using TAligner 3.0. Thirdly, I will present how the corpus has been lemmatized and annotated at the level of part-of-speech. Finally, the process of extracting potential Basque multi-word expressions will be shown. 10 01 JB code scl.90.p3 249 298 50 Section header 18 <TitleType>01</TitleType> <TitleText textformat="02">Part III. Parallel corpora</TitleText> <Subtitle textformat="02">Tools and applications</Subtitle> 10 01 JB code scl.90.15gam 251 265 15 Chapter 19 <TitleType>01</TitleType> <TitleText textformat="02">Strategies for building high quality bilingual lexicons from comparable corpora</TitleText> 1 A01 Pablo Gamallo Otero Gamallo Otero, Pablo Pablo Gamallo Otero University of Santiago de Compostela 20 bilingual lexicons 20 cognates 20 comparable corpora 20 distributional similarity 20 extraction of translation candidates 01 This chapter outlines two strategies to automatically build bilingual dictionaries: One is based on the use of a pivot language and existing bilingual dictionaries, while the other relies on string similarity and cognate extraction. Both strategies have in common the use of translation equivalents extracted from comparable corpora to filter out odd bilingual pairs and validate the correct ones. The correctness of the entries validated with comparable corpora is very high, close to that achieved by using parallel corpora. The chapter reports several case studies describing how to build new high-quality bilingual lexicons, namely English-Galician, English-Portuguese, and Portuguese-Spanish dictionaries with more than 90% precision. This outperforms state-of-the-art systems on bilingual extraction from comparable corpora, whose best scores hardly reach 70 or 80%. 10 01 JB code scl.90.16gon 267 279 13 Chapter 20 <TitleType>01</TitleType> <TitleText textformat="02">Discovering bilingual collocations in parallel corpora</TitleText> <Subtitle textformat="02">A first attempt at using distributional semantics</Subtitle> 1 A01 Marcos Garcia Garcia, Marcos Marcos Garcia University of A Coruña 2 A01 Marcos García-Salido García-Salido, Marcos Marcos García-Salido University of A Coruña 3 A01 Margarita Alonso-Ramos Alonso-Ramos, Margarita Margarita Alonso-Ramos University of A Coruña 20 collocations 20 distributional semantics 20 parallel corpora 20 phraseology 01 This chapter presents a method that exploits parallel corpora to automatically extract bilingual collocation equivalents. First, we use dependency parsing and statistical measures to identify collocation candidates in corpora. Then, we leverage the parallel corpora to extract bilingual word-embeddings. Finally, we use these distributional models as probabilistic dictionaries in order to identify bilingual collocation equivalents. To evaluate our strategy we carry out a set of experiments in Portuguese and Spanish focusing on verb-object collocations, for example, “reach the maturity” (“atingir a maturidade” in Portuguese, “alcanzar la madurez” in Spanish). The results of our experiments show that this method is useful to automatically identify thousands of bilingual collocation equivalents, achieving a precision of 86%. 10 01 JB code scl.90.17gho 281 298 18 Chapter 21 <TitleType>01</TitleType> <TitleText textformat="02">Normalization of shorthand forms in French text messages using word embedding and machine translation</TitleText> 1 A01 Parijat Ghoshal Ghoshal, Parijat Parijat Ghoshal Neue Zürcher Zeitung, KOF Swiss Economic Institute 2 A01 Xi Rao Rao, Xi Xi Rao Neue Zürcher Zeitung, KOF Swiss Economic Institute 20 abbreviation/shorthand form normalization 20 character-based machine translation 20 deep learning 20 distributional semantics 20 French 20 Multivec 20 neural networks 20 parallel corpus 20 SMS 20 unsupervised learning 20 word embeddings 01 This chapter focuses on the normalization of abbreviations and shorthand forms used in French text messages. These forms are difficult to normalize, as they mostly cannot be resolved by typical spell checkers and dictionary lookups. Firstly, we aligned normalized and non-normalized French text messages and built a parallel corpus. We applied two popular approaches for text normalization, namely multilingual word embeddings, and character-based machine translation. We compare our results and observe the efficacy of our models while normalizing deletions, substitutions, repetitions, swaps, and insertions, made to canonical forms. This is the first paper that uses Multivec and the Belgian SMS corpus collected under the SMS4Science Project. The unsupervised machine learning approach makes the system highly flexible, easily adaptable and provides a domain-independent method of text normalization. 10 01 JB code scl.90.index 299 301 3 Index 22 <TitleType>01</TitleType> <TitleText textformat="02">Index</TitleText> 02 JBENJAMINS John Benjamins Publishing Company 01 John Benjamins Publishing Company Amsterdam/Philadelphia NL 04 20190320 2019 John Benjamins B.V. 02 WORLD 08 700 gr 01 JB 1 John Benjamins Publishing Company +31 20 6304747 +31 20 6739773 bookorder@benjamins.nl 01 https://benjamins.com 01 WORLD US CA MX 21 20190315 18 01 02 JB 1 00 99.00 EUR R 02 02 JB 1 00 104.94 EUR R 01 JB 10 bebc +44 1202 712 934 +44 1202 712 913 sales@bebc.co.uk 03 GB 21 20190315 02 02 JB 1 00 83.00 GBP Z 01 JB 2 John Benjamins North America +1 800 562-5666 +1 703 661-1501 benjamins@presswarehouse.com 01 https://benjamins.com 01 US CA MX 21 20190315 1 01 gen 02 JB 1 00 149.00 USD