219-7677
10
7500817
John Benjamins Publishing Company
Marketing Department / Karin Plijnaar, Pieter Lamers
onix@benjamins.nl
201903221314
ONIX title feed
eng
01
EUR
943019008
03
01
01
JB
John Benjamins Publishing Company
01
JB code
SCL 90 Eb
15
9789027262844
06
10.1075/scl.90
13
2019001810
DG
002
02
01
SCL
02
1388-0373
Studies in Corpus Linguistics
90
01
Parallel Corpora for Contrastive and Translation Studies
New resources and applications
01
scl.90
01
https://benjamins.com
02
https://benjamins.com/catalog/scl.90
1
B01
Irene Doval
Doval, Irene
Irene
Doval
University of Santiago de Compostela
2
B01
M. Teresa Sánchez Nieto
Sánchez Nieto, M. Teresa
M. Teresa
Sánchez Nieto
University of Valladolid
01
eng
311
ix
301
LAN009000
v.2006
CFX
2
24
JB Subject Scheme
LIN.COMP
Comparative linguistics
24
JB Subject Scheme
LIN.COMPUT
Computational & corpus linguistics
24
JB Subject Scheme
LIN.CORP
Corpus linguistics
24
JB Subject Scheme
TRAN.TRANSL
Translation Studies
06
01
This volume assesses the state of the art of parallel corpus research as a whole, reporting on advances in both recent developments of parallel corpora – with some particular references to comparable corpora as well– and in ways of exploiting them for a variety of purposes. The first part of the book is devoted to new roles that parallel corpora can and should assume in translation studies and in contrastive linguistics, to the usefulness and usability of parallel corpora, and to advances in parallel corpus alignment, annotation and retrieval. There follows an up-to-date presentation of a number of parallel corpus projects currently being carried out in Europe, some of them multimodal, with certain chapters illustrating case studies developed on the basis of the corpora at hand. In most of these chapters, attention is paid to specific technical issues of corpus building. The third part of the book reflects on specific applications and on the creation of bilingual resources from parallel corpora. This volume will be welcomed by scholars, postgraduate and PhD students in the fields of contrastive linguistics, translation studies, lexicography, language teaching and learning, machine translation, and natural language processing.
04
09
01
https://benjamins.com/covers/475/scl.90.png
04
03
01
https://benjamins.com/covers/475_jpg/9789027202345.jpg
04
03
01
https://benjamins.com/covers/475_tif/9789027202345.tif
06
09
01
https://benjamins.com/covers/1200_front/scl.90.hb.png
07
09
01
https://benjamins.com/covers/125/scl.90.png
25
09
01
https://benjamins.com/covers/1200_back/scl.90.hb.png
27
09
01
https://benjamins.com/covers/3d_web/scl.90.hb.png
10
01
JB code
scl.90.ack
Miscellaneous
1
01
Acknowledgments
10
01
JB code
scl.90.01dov
1
15
15
Chapter
2
01
Parallel corpora in focus
An account of current achievements and challenges
1
A01
Irene Doval
Doval, Irene
Irene
Doval
2
A01
M. Teresa Sánchez Nieto
Sánchez Nieto, M. Teresa
M. Teresa
Sánchez Nieto
10
01
JB code
scl.90.p1
17
90
74
Section header
3
01
Part I. Parallel corpora
Background and processing
10
01
JB code
scl.90.02har
19
38
20
Chapter
4
01
Comparable parallel corpora
A critical review of current practices in corpus-based translation studies
1
A01
Lidun Hareide
Hareide, Lidun
Lidun
Hareide
Møreforsking Molde Norway
20
comparable parallel corpora
20
the Gravitational Pull Hypothesis
20
unique items
01
Are papers presented in corpus-based translation studies truly scientific? These are normally done on only one language pair, often on purpose-made parallel corpora, and can normally not be replicated. Therefore their value is limited in a strictly scientific sense. The use of comparable parallel corpora allows both for the replication of studies, and the testing of complex hypotheses like Halverson’s Gravitational Pull hypothesis. This chapter defines and discusses the concept of comparable parallel corpora, and exemplifies their value by illustrating their use. The chapter also presents hopes for the future, as new groundbreaking technology that will allow the linguist to create her own parallel corpora without the aid of computer scientists is currently being launched at the University of León in Spain.
10
01
JB code
scl.90.03mar
39
56
18
Chapter
5
01
Living with parallel corpora
The potentials and limitations of their use in translation research
1
A01
Josep Marco
Marco, Josep
Josep
Marco
University Jaume I
20
comparable corpora
20
COVALT
20
main source of data
20
parallel corpora
20
supplementary source of data
01
Parallel corpora can be used in translation research in at least two ways: as the main source of data or as a supplement to data retrieved from a comparable corpus, enabling data triangulation. In the former scenario, they may throw light on contrastive aspects or on translator techniques and methods. In the latter they will tend to be searched to account for differences perceived between the two components of a comparable corpus. Two case studies will be put forward in order to illustrate these two uses of parallel corpora. Both draw on the English-Catalan subcorpus of COVALT (Valencian Corpus of Translated Literature). The first analyses the translation of meal names whereas the second focuses on the -ment adverb + adjective construction.
10
01
JB code
scl.90.04rab
57
78
22
Chapter
6
01
Working with parallel corpora
Usefulness and usability
1
A01
Rosa Rabadán
Rabadán, Rosa
Rosa
Rabadán
University of Leon
20
parallel corpora applications
20
parallel corpora reusability
20
parallel corpora uses
01
Although parallel corpora are vital for cross-linguistic and natural language processing (NLP) research, most have been designed for just one particular purpose, which may unnecessarily restrict their usefulness and usability. My argument is that the usefulness of existing parallel corpora increases exponentially when data so obtained are combined with those yielded by comparable and/or monolingual corpora. Usability criteria such as the choice of processing tools and adherence to international standards, among others, also have an impact on corpus usefulness. This chapter proposes courses of action that serve to improve the recycling and reprocessing of available resources. It also presents a corpus-based, post-editing and quality assessment application as an illustration of the multifarious uses parallel corpora may serve.
10
01
JB code
scl.90.05vol
79
90
12
Chapter
7
01
Innovations in parallel corpus alignment and retrieval
1
A01
Martin Volk
Volk, Martin
Martin
Volk
University of Zurich
20
corpus annotation
20
corpus retrieval
20
multiparallel corpora
20
word alignment
01
In this chapter, we give an overview of parallel corpus annotation, alignment and retrieval. We present standard annotation methods such as Part-of-Speech tagging, lemmatization and dependency parsing, but we also introduce language-specific methods, for example for dealing with split verbs or truncated compounds in German. Our corpus annotation includes the identification of code-switching within sentences as a special case of language identification. We argue for careful sentence and word alignment for parallel corpora. And we explain how word alignment is the basis for a wide range of applications from translation variant ranking to lemma disambiguation.
10
01
JB code
scl.90.p2
93
247
155
Section header
8
01
Part II. Parallel corpora
Creation, annotation and access
10
01
JB code
scl.90.06cer
93
101
9
Chapter
9
01
InterCorp
A parallel corpus of 40 languages
1
A01
Petr Čermák
Čermák, Petr
Petr
Čermák
Charles University Prague
20
comparison of languages
20
Czech National Corpus
20
InterCorp
20
parallel corpus
20
Spanish
01
This chapter presents the current version of InterCorp, a parallel corpus created at the Faculty of Arts, Charles University in Prague. The corpus contains texts in Czech aligned with one or more foreign-language version(s), including Czech and 39 other languages. The chapter analyses its structure and technical parameters, and describes some technical tools used with the corpus (Kontext, a corpus query interface, and InterText, a parallel text alignment editor created specifically for the project). Similarly, the contribution discusses Treq (Translation Equivalents Database), a collection of bilingual Czech-foreign language dictionaries built automatically from InterCorp. In the last section of the chapter, the possibilities for methodological and linguistic exploitation of the corpus are discussed.
10
01
JB code
scl.90.07dov
103
121
19
Chapter
10
01
Corpus PaGeS
A multifunctional resource for language learning, translation and cross-linguistic research
1
A01
Irene Doval
Doval, Irene
Irene
Doval
University of Santiago de Compostela
2
A01
Santiago Fernández Lanza
Lanza, Santiago Fernández
Santiago Fernández
Lanza
University of Santiago de Compostela
3
A01
Tomás Jiménez Juliá
Juliá, Tomás Jiménez
Tomás Jiménez
Juliá
University of Santiago de Compostela
4
A01
Elsa Liste Lamas
Liste Lamas, Elsa
Elsa
Liste Lamas
University of Santiago de Compostela
5
A01
Barbara Lübke
Lübke, Barbara
Barbara
Lübke
University of Santiago de Compostela
20
corpus alignment
20
corpus visualization
20
German
20
parallel corpora
20
Spanish
01
This chapter presents the bilingual parallel corpus PaGeS, compiled by the research group SpatiAlEs from the University of Santiago de Compostela. PaGeS currently amounts to nearly 20 million tokens and consists of texts originally written in German and in Spanish and their correspondent translations into the other language, as well as a small portion of German and Spanish translations from third languages. The present contribution introduces the main characteristics of the PaGeS corpus, focusing on its design and compilation. It first explains the criteria for the selection of the texts and the details of text pre-processing, automatic alignment and manual review. It then addresses the search and display features describing the server architecture and indexing process. Finally, the intended development of the PaGeS corpus is briefly discussed.
10
01
JB code
scl.90.08fer
123
139
17
Chapter
11
01
Building EPTIC
A many-sided, multi-purpose corpus of EU parliament proceedings
1
A01
Adriano Ferraresi
Ferraresi, Adriano
Adriano
Ferraresi
University of Bologna
2
A01
Silvia Bernardini
Bernardini, Silvia
Silvia
Bernardini
University of Bologna
20
corpus annotation
20
intermodal corpora
20
loan words
20
text-to-text alignment
20
text-to-video alignment
01
This chapter describes the steps involved in the construction of EPTIC, an intermodal corpus of European Parliament speeches. Despite its limited size, this corpus has features that justify its labour-intensive building process, in particular its multiple alignments. The text-to-text alignments allow users to compare interpretations and translations of source speeches and their written-up reports, while text-to-video alignments allow them to access the multimedia components from concordance lines. To illustrate the potential of EPTIC, a case study is presented of English loan words in original, translated and interpreted Italian and French. Results suggest that borrowing is more likely to occur in translated Italian than in any of the other corpus components.
10
01
JB code
scl.90.09gom
141
158
18
Chapter
12
01
Enriching parallel corpora with multimedia and lexical semantics
From the CLUVI Corpus to WordNet and SemCor
1
A01
Xavier Gomez Guinovart
Gomez Guinovart, Xavier
Xavier
Gomez Guinovart
University of Vigo
20
lexical semantics
20
multimedia
20
parallel corpora
20
SemCor
20
WordNet
01
In this chapter, I present the main characteristics of the CLUVI Corpus, an open collection of sentence-level aligned parallel corpora with over 44 million words in nine specialised domains (fiction, computing, popular science, biblical texts, law, consumer information, economy, tourism, and film subtitling) and different language combinations including Galician, Spanish, English, French, Portuguese, Catalan, Italian, Basque and Latin. Then, I present the methodology developed for extending the film subtitles section of the CLUVI Corpus with multimedia data. Finally, I discuss the resources and methods used to build the SensoGal Corpus, a SemCor-based English-Galician parallel corpus semantically annotated based on WordNet and aligned at the sentence and word levels.
10
01
JB code
scl.90.10lop
159
182
24
Chapter
13
01
Discourse annotation in the MULTINOT corpus
Issues and challenges
1
A01
Julia Lavid-López
Lavid-López, Julia
Julia
Lavid-López
Complutense University of Madrid
20
annotation
20
corpus
20
discourse
20
English
20
Spanish
01
This chapter summarises and discusses recent work on the development of a bilingual (English-Spanish) corpus consisting of original comparable and parallel texts from a variety of genres and annotated with complex linguistic features such as modality and evidentiality, metadiscourse markers, and thematization, as carried out within the framework of the MULTINOT project. The annotation of these complex features in bilingual parallel texts poses important challenges for the researcher at the different stages of the corpus development, from the preprocessing phases to the manual annotation phase, but, at the same time, it allows the investigation of complex linguistic research questions which could not be addressed on the basis of raw corpora or even with the help of an automatic part-of-speech tagging system.
10
01
JB code
scl.90.11mik
183
195
13
Chapter
14
01
PEST
A parallel electronic corpus of state treaties
1
A01
Mikhail Mikhailov
Mikhailov, Mikhail
Mikhail
Mikhailov
University of Tampere, Finland
2
A01
Miia Santalahti
Santalahti, Miia
Miia
Santalahti
University of Tampere
3
A01
Julia Souma
Souma, Julia
Julia
Souma
University of Tampere
20
balanced corpus
20
compiling parallel corpora
20
language of state treaties
20
legal language
01
This chapter introduces the Parallel Electronic corpus of State Treaties (PEST). The current plan is to compile a parallel corpus, which will include treaties concluded between Russia and Finland, Finland and Sweden, and Sweden and Russia. In addition, there will be a subcorpus of international conventions in all three languages plus English, to be used as reference data. The chapter describes the structure of the subcorpora (number of documents, their chronological distribution and topics featured), and it also addresses the challenges of balancing such a corpus. In the future, this material can be used for studies ranging from lexicon and semantics to grammar, style, discourse, translation studies, and language for special purposes.
10
01
JB code
scl.90.12mol
197
214
18
Chapter
15
01
Indexation and analysis of a parallel corpus using CQPweb
The COVALT PAR_ES Corpus (EN/FR/DE > ES)
1
A01
Teresa Molés-Cases
Molés-Cases, Teresa
Teresa
Molés-Cases
Universitat Politècnica de València, Universitat Jaume I
2
A01
Ulrike Oster
Oster, Ulrike
Ulrike
Oster
Universitat Politècnica de València, Universitat Jaume I
20
corpus compilation
20
corpus indexation
20
COVALT corpus
20
CQPweb
01
This contribution presents a section of the Corpus Valencià de Literatura Traduïda (COVALT), created by the research group of the same name (Department of Translation and Communication, Universitat Jaume I, Spain). The COVALT corpus is a four-million word corpus made up of narrative works originally written in English, French, and German and their Catalan translations published in the autonomous community of Valencia between 1990 and 2000. Since the members of the Covalt group are interested in translation research, and more specifically in the investigation of translated Catalan and Spanish, this corpus has recently been extended to include translations into Spanish published in Spain (COVALT PAR_ES corpus). This chapter presents the COVALT PAR_ES corpus, as well as its process of compilation and analysis with CQPweb.
10
01
JB code
scl.90.13san
215
231
17
Chapter
16
01
P-ACTRES 2.0
A parallel corpus for cross-linguistic research
1
A01
Hugo Sanjurjo-González
Sanjurjo-González, Hugo
Hugo
Sanjurjo-González
University of Huddersfield
2
A01
Marlén Izquierdo
Izquierdo, Marlén
Marlén
Izquierdo
University of the Basque Country (UPV/EHU)
20
(parallel) corpus compilation
20
ACTRES
20
corpus analysis software
20
web interface
01
This chapter describes an updated version of the ACTRES Parallel Corpus (P-ACTRES 2.0), an English-Spanish bidirectional corpus that contains over 4 million words. The composition of the corpus is recounted, regarding the number of words in each direction, and the types of texts included together with the linguistic variants that users will find in the corpus. Its composition is shaped by research purposes as well as availability issues. The computerization process is also explained, while commenting on the text processing, alignment and tagging. The chapter concludes with a brief demonstration of the usefulness and usability of P-ACTRES 2.0 in cross-linguistic research, be it contrastive linguistics or translation studies either independently or, most importantly, jointly.
10
01
JB code
scl.90.14san
233
247
15
Chapter
17
01
An overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus
An
overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus
1
A01
Zuriñe Sanz-Villar
Sanz-Villar, Zuriñe
Zuriñe
Sanz-Villar
University of the Basque Country (UPV/EHU)
20
Aleuska corpus
20
Basque corpora
20
Basque MWEs
20
TAligner
01
Since the 1980s, considerable efforts have been made to create different types of Basque corpora. However, to systematically analyse the Basque translations of German literary texts, it was necessary to create a corpus from the ground up. Intermediary versions were included in this corpus whenever the Basque target text was not a translation from the German original but came instead from a translation into another language (Spanish in most cases). A tool called TAligner was used to align the bitexts and the tritexts. The aim of this chapter is, firstly, to provide the reader with an overview of the main Basque corpora. Secondly, I will describe the design and compilation process of a parallel and multilingual corpus using TAligner 3.0. Thirdly, I will present how the corpus has been lemmatized and annotated at the level of part-of-speech. Finally, the process of extracting potential Basque multi-word expressions will be shown.
10
01
JB code
scl.90.p3
249
298
50
Section header
18
01
Part III. Parallel corpora
Tools and applications
10
01
JB code
scl.90.15gam
251
265
15
Chapter
19
01
Strategies for building high quality bilingual lexicons from comparable corpora
1
A01
Pablo Gamallo Otero
Gamallo Otero, Pablo
Pablo
Gamallo Otero
University of Santiago de Compostela
20
bilingual lexicons
20
cognates
20
comparable corpora
20
distributional similarity
20
extraction of translation candidates
01
This chapter outlines two strategies to automatically build bilingual dictionaries: One is based on the use of a pivot language and existing bilingual dictionaries, while the other relies on string similarity and cognate extraction. Both strategies have in common the use of translation equivalents extracted from comparable corpora to filter out odd bilingual pairs and validate the correct ones. The correctness of the entries validated with comparable corpora is very high, close to that achieved by using parallel corpora. The chapter reports several case studies describing how to build new high-quality bilingual lexicons, namely English-Galician, English-Portuguese, and Portuguese-Spanish dictionaries with more than 90% precision. This outperforms state-of-the-art systems on bilingual extraction from comparable corpora, whose best scores hardly reach 70 or 80%.
10
01
JB code
scl.90.16gon
267
279
13
Chapter
20
01
Discovering bilingual collocations in parallel corpora
A first attempt at using distributional semantics
1
A01
Marcos Garcia
Garcia, Marcos
Marcos
Garcia
University of A Coruña
2
A01
Marcos García-Salido
García-Salido, Marcos
Marcos
García-Salido
University of A Coruña
3
A01
Margarita Alonso-Ramos
Alonso-Ramos, Margarita
Margarita
Alonso-Ramos
University of A Coruña
20
collocations
20
distributional semantics
20
parallel corpora
20
phraseology
01
This chapter presents a method that exploits parallel corpora to automatically extract bilingual collocation equivalents. First, we use dependency parsing and statistical measures to identify collocation candidates in corpora. Then, we leverage the parallel corpora to extract bilingual word-embeddings. Finally, we use these distributional models as probabilistic dictionaries in order to identify bilingual collocation equivalents. To evaluate our strategy we carry out a set of experiments in Portuguese and Spanish focusing on verb-object collocations, for example, “reach the maturity” (“atingir a maturidade” in Portuguese, “alcanzar la madurez” in Spanish). The results of our experiments show that this method is useful to automatically identify thousands of bilingual collocation equivalents, achieving a precision of 86%.
10
01
JB code
scl.90.17gho
281
298
18
Chapter
21
01
Normalization of shorthand forms in French text messages using word embedding and machine translation
1
A01
Parijat Ghoshal
Ghoshal, Parijat
Parijat
Ghoshal
Neue Zürcher Zeitung, KOF Swiss Economic Institute
2
A01
Xi Rao
Rao, Xi
Xi
Rao
Neue Zürcher Zeitung, KOF Swiss Economic Institute
20
abbreviation/shorthand form normalization
20
character-based machine translation
20
deep learning
20
distributional semantics
20
French
20
Multivec
20
neural networks
20
parallel corpus
20
SMS
20
unsupervised learning
20
word embeddings
01
This chapter focuses on the normalization of abbreviations and shorthand forms used in French text messages. These forms are difficult to normalize, as they mostly cannot be resolved by typical spell checkers and dictionary lookups. Firstly, we aligned normalized and non-normalized French text messages and built a parallel corpus. We applied two popular approaches for text normalization, namely multilingual word embeddings, and character-based machine translation. We compare our results and observe the efficacy of our models while normalizing deletions, substitutions, repetitions, swaps, and insertions, made to canonical forms. This is the first paper that uses Multivec and the Belgian SMS corpus collected under the SMS4Science Project. The unsupervised machine learning approach makes the system highly flexible, easily adaptable and provides a domain-independent method of text normalization.
10
01
JB code
scl.90.ind
Miscellaneous
22
01
Index
02
JBENJAMINS
John Benjamins Publishing Company
01
John Benjamins Publishing Company
Amsterdam/Philadelphia
NL
04
20190320
2019
John Benjamins B.V.
02
WORLD
13
15
9789027202345
01
JB
3
John Benjamins e-Platform
03
jbe-platform.com
09
WORLD
21
20190315
01
00
99.00
EUR
R
01
00
83.00
GBP
Z
01
gen
00
149.00
USD
S
285019007
03
01
01
JB
John Benjamins Publishing Company
01
JB code
SCL 90 Hb
15
9789027202345
13
2018047820
BB
01
SCL
02
1388-0373
Studies in Corpus Linguistics
90
01
Parallel Corpora for Contrastive and Translation Studies
New resources and applications
01
scl.90
01
https://benjamins.com
02
https://benjamins.com/catalog/scl.90
1
B01
Irene Doval
Doval, Irene
Irene
Doval
University of Santiago de Compostela
2
B01
M. Teresa Sánchez Nieto
Sánchez Nieto, M. Teresa
M. Teresa
Sánchez Nieto
University of Valladolid
01
eng
311
ix
301
LAN009000
v.2006
CFX
2
24
JB Subject Scheme
LIN.COMP
Comparative linguistics
24
JB Subject Scheme
LIN.COMPUT
Computational & corpus linguistics
24
JB Subject Scheme
LIN.CORP
Corpus linguistics
24
JB Subject Scheme
TRAN.TRANSL
Translation Studies
06
01
This volume assesses the state of the art of parallel corpus research as a whole, reporting on advances in both recent developments of parallel corpora – with some particular references to comparable corpora as well– and in ways of exploiting them for a variety of purposes. The first part of the book is devoted to new roles that parallel corpora can and should assume in translation studies and in contrastive linguistics, to the usefulness and usability of parallel corpora, and to advances in parallel corpus alignment, annotation and retrieval. There follows an up-to-date presentation of a number of parallel corpus projects currently being carried out in Europe, some of them multimodal, with certain chapters illustrating case studies developed on the basis of the corpora at hand. In most of these chapters, attention is paid to specific technical issues of corpus building. The third part of the book reflects on specific applications and on the creation of bilingual resources from parallel corpora. This volume will be welcomed by scholars, postgraduate and PhD students in the fields of contrastive linguistics, translation studies, lexicography, language teaching and learning, machine translation, and natural language processing.
04
09
01
https://benjamins.com/covers/475/scl.90.png
04
03
01
https://benjamins.com/covers/475_jpg/9789027202345.jpg
04
03
01
https://benjamins.com/covers/475_tif/9789027202345.tif
06
09
01
https://benjamins.com/covers/1200_front/scl.90.hb.png
07
09
01
https://benjamins.com/covers/125/scl.90.png
25
09
01
https://benjamins.com/covers/1200_back/scl.90.hb.png
27
09
01
https://benjamins.com/covers/3d_web/scl.90.hb.png
10
01
JB code
scl.90.ack
Miscellaneous
1
01
Acknowledgments
10
01
JB code
scl.90.01dov
1
15
15
Chapter
2
01
Parallel corpora in focus
An account of current achievements and challenges
1
A01
Irene Doval
Doval, Irene
Irene
Doval
2
A01
M. Teresa Sánchez Nieto
Sánchez Nieto, M. Teresa
M. Teresa
Sánchez Nieto
10
01
JB code
scl.90.p1
17
90
74
Section header
3
01
Part I. Parallel corpora
Background and processing
10
01
JB code
scl.90.02har
19
38
20
Chapter
4
01
Comparable parallel corpora
A critical review of current practices in corpus-based translation studies
1
A01
Lidun Hareide
Hareide, Lidun
Lidun
Hareide
Møreforsking Molde Norway
20
comparable parallel corpora
20
the Gravitational Pull Hypothesis
20
unique items
01
Are papers presented in corpus-based translation studies truly scientific? These are normally done on only one language pair, often on purpose-made parallel corpora, and can normally not be replicated. Therefore their value is limited in a strictly scientific sense. The use of comparable parallel corpora allows both for the replication of studies, and the testing of complex hypotheses like Halverson’s Gravitational Pull hypothesis. This chapter defines and discusses the concept of comparable parallel corpora, and exemplifies their value by illustrating their use. The chapter also presents hopes for the future, as new groundbreaking technology that will allow the linguist to create her own parallel corpora without the aid of computer scientists is currently being launched at the University of León in Spain.
10
01
JB code
scl.90.03mar
39
56
18
Chapter
5
01
Living with parallel corpora
The potentials and limitations of their use in translation research
1
A01
Josep Marco
Marco, Josep
Josep
Marco
University Jaume I
20
comparable corpora
20
COVALT
20
main source of data
20
parallel corpora
20
supplementary source of data
01
Parallel corpora can be used in translation research in at least two ways: as the main source of data or as a supplement to data retrieved from a comparable corpus, enabling data triangulation. In the former scenario, they may throw light on contrastive aspects or on translator techniques and methods. In the latter they will tend to be searched to account for differences perceived between the two components of a comparable corpus. Two case studies will be put forward in order to illustrate these two uses of parallel corpora. Both draw on the English-Catalan subcorpus of COVALT (Valencian Corpus of Translated Literature). The first analyses the translation of meal names whereas the second focuses on the -ment adverb + adjective construction.
10
01
JB code
scl.90.04rab
57
78
22
Chapter
6
01
Working with parallel corpora
Usefulness and usability
1
A01
Rosa Rabadán
Rabadán, Rosa
Rosa
Rabadán
University of Leon
20
parallel corpora applications
20
parallel corpora reusability
20
parallel corpora uses
01
Although parallel corpora are vital for cross-linguistic and natural language processing (NLP) research, most have been designed for just one particular purpose, which may unnecessarily restrict their usefulness and usability. My argument is that the usefulness of existing parallel corpora increases exponentially when data so obtained are combined with those yielded by comparable and/or monolingual corpora. Usability criteria such as the choice of processing tools and adherence to international standards, among others, also have an impact on corpus usefulness. This chapter proposes courses of action that serve to improve the recycling and reprocessing of available resources. It also presents a corpus-based, post-editing and quality assessment application as an illustration of the multifarious uses parallel corpora may serve.
10
01
JB code
scl.90.05vol
79
90
12
Chapter
7
01
Innovations in parallel corpus alignment and retrieval
1
A01
Martin Volk
Volk, Martin
Martin
Volk
University of Zurich
20
corpus annotation
20
corpus retrieval
20
multiparallel corpora
20
word alignment
01
In this chapter, we give an overview of parallel corpus annotation, alignment and retrieval. We present standard annotation methods such as Part-of-Speech tagging, lemmatization and dependency parsing, but we also introduce language-specific methods, for example for dealing with split verbs or truncated compounds in German. Our corpus annotation includes the identification of code-switching within sentences as a special case of language identification. We argue for careful sentence and word alignment for parallel corpora. And we explain how word alignment is the basis for a wide range of applications from translation variant ranking to lemma disambiguation.
10
01
JB code
scl.90.p2
93
247
155
Section header
8
01
Part II. Parallel corpora
Creation, annotation and access
10
01
JB code
scl.90.06cer
93
101
9
Chapter
9
01
InterCorp
A parallel corpus of 40 languages
1
A01
Petr Čermák
Čermák, Petr
Petr
Čermák
Charles University Prague
20
comparison of languages
20
Czech National Corpus
20
InterCorp
20
parallel corpus
20
Spanish
01
This chapter presents the current version of InterCorp, a parallel corpus created at the Faculty of Arts, Charles University in Prague. The corpus contains texts in Czech aligned with one or more foreign-language version(s), including Czech and 39 other languages. The chapter analyses its structure and technical parameters, and describes some technical tools used with the corpus (Kontext, a corpus query interface, and InterText, a parallel text alignment editor created specifically for the project). Similarly, the contribution discusses Treq (Translation Equivalents Database), a collection of bilingual Czech-foreign language dictionaries built automatically from InterCorp. In the last section of the chapter, the possibilities for methodological and linguistic exploitation of the corpus are discussed.
10
01
JB code
scl.90.07dov
103
121
19
Chapter
10
01
Corpus PaGeS
A multifunctional resource for language learning, translation and cross-linguistic research
1
A01
Irene Doval
Doval, Irene
Irene
Doval
University of Santiago de Compostela
2
A01
Santiago Fernández Lanza
Lanza, Santiago Fernández
Santiago Fernández
Lanza
University of Santiago de Compostela
3
A01
Tomás Jiménez Juliá
Juliá, Tomás Jiménez
Tomás Jiménez
Juliá
University of Santiago de Compostela
4
A01
Elsa Liste Lamas
Liste Lamas, Elsa
Elsa
Liste Lamas
University of Santiago de Compostela
5
A01
Barbara Lübke
Lübke, Barbara
Barbara
Lübke
University of Santiago de Compostela
20
corpus alignment
20
corpus visualization
20
German
20
parallel corpora
20
Spanish
01
This chapter presents the bilingual parallel corpus PaGeS, compiled by the research group SpatiAlEs from the University of Santiago de Compostela. PaGeS currently amounts to nearly 20 million tokens and consists of texts originally written in German and in Spanish and their correspondent translations into the other language, as well as a small portion of German and Spanish translations from third languages. The present contribution introduces the main characteristics of the PaGeS corpus, focusing on its design and compilation. It first explains the criteria for the selection of the texts and the details of text pre-processing, automatic alignment and manual review. It then addresses the search and display features describing the server architecture and indexing process. Finally, the intended development of the PaGeS corpus is briefly discussed.
10
01
JB code
scl.90.08fer
123
139
17
Chapter
11
01
Building EPTIC
A many-sided, multi-purpose corpus of EU parliament proceedings
1
A01
Adriano Ferraresi
Ferraresi, Adriano
Adriano
Ferraresi
University of Bologna
2
A01
Silvia Bernardini
Bernardini, Silvia
Silvia
Bernardini
University of Bologna
20
corpus annotation
20
intermodal corpora
20
loan words
20
text-to-text alignment
20
text-to-video alignment
01
This chapter describes the steps involved in the construction of EPTIC, an intermodal corpus of European Parliament speeches. Despite its limited size, this corpus has features that justify its labour-intensive building process, in particular its multiple alignments. The text-to-text alignments allow users to compare interpretations and translations of source speeches and their written-up reports, while text-to-video alignments allow them to access the multimedia components from concordance lines. To illustrate the potential of EPTIC, a case study is presented of English loan words in original, translated and interpreted Italian and French. Results suggest that borrowing is more likely to occur in translated Italian than in any of the other corpus components.
10
01
JB code
scl.90.09gom
141
158
18
Chapter
12
01
Enriching parallel corpora with multimedia and lexical semantics
From the CLUVI Corpus to WordNet and SemCor
1
A01
Xavier Gomez Guinovart
Gomez Guinovart, Xavier
Xavier
Gomez Guinovart
University of Vigo
20
lexical semantics
20
multimedia
20
parallel corpora
20
SemCor
20
WordNet
01
In this chapter, I present the main characteristics of the CLUVI Corpus, an open collection of sentence-level aligned parallel corpora with over 44 million words in nine specialised domains (fiction, computing, popular science, biblical texts, law, consumer information, economy, tourism, and film subtitling) and different language combinations including Galician, Spanish, English, French, Portuguese, Catalan, Italian, Basque and Latin. Then, I present the methodology developed for extending the film subtitles section of the CLUVI Corpus with multimedia data. Finally, I discuss the resources and methods used to build the SensoGal Corpus, a SemCor-based English-Galician parallel corpus semantically annotated based on WordNet and aligned at the sentence and word levels.
10
01
JB code
scl.90.10lop
159
182
24
Chapter
13
01
Discourse annotation in the MULTINOT corpus
Issues and challenges
1
A01
Julia Lavid-López
Lavid-López, Julia
Julia
Lavid-López
Complutense University of Madrid
20
annotation
20
corpus
20
discourse
20
English
20
Spanish
01
This chapter summarises and discusses recent work on the development of a bilingual (English-Spanish) corpus consisting of original comparable and parallel texts from a variety of genres and annotated with complex linguistic features such as modality and evidentiality, metadiscourse markers, and thematization, as carried out within the framework of the MULTINOT project. The annotation of these complex features in bilingual parallel texts poses important challenges for the researcher at the different stages of the corpus development, from the preprocessing phases to the manual annotation phase, but, at the same time, it allows the investigation of complex linguistic research questions which could not be addressed on the basis of raw corpora or even with the help of an automatic part-of-speech tagging system.
10
01
JB code
scl.90.11mik
183
195
13
Chapter
14
01
PEST
A parallel electronic corpus of state treaties
1
A01
Mikhail Mikhailov
Mikhailov, Mikhail
Mikhail
Mikhailov
University of Tampere, Finland
2
A01
Miia Santalahti
Santalahti, Miia
Miia
Santalahti
University of Tampere
3
A01
Julia Souma
Souma, Julia
Julia
Souma
University of Tampere
20
balanced corpus
20
compiling parallel corpora
20
language of state treaties
20
legal language
01
This chapter introduces the Parallel Electronic corpus of State Treaties (PEST). The current plan is to compile a parallel corpus, which will include treaties concluded between Russia and Finland, Finland and Sweden, and Sweden and Russia. In addition, there will be a subcorpus of international conventions in all three languages plus English, to be used as reference data. The chapter describes the structure of the subcorpora (number of documents, their chronological distribution and topics featured), and it also addresses the challenges of balancing such a corpus. In the future, this material can be used for studies ranging from lexicon and semantics to grammar, style, discourse, translation studies, and language for special purposes.
10
01
JB code
scl.90.12mol
197
214
18
Chapter
15
01
Indexation and analysis of a parallel corpus using CQPweb
The COVALT PAR_ES Corpus (EN/FR/DE > ES)
1
A01
Teresa Molés-Cases
Molés-Cases, Teresa
Teresa
Molés-Cases
Universitat Politècnica de València, Universitat Jaume I
2
A01
Ulrike Oster
Oster, Ulrike
Ulrike
Oster
Universitat Politècnica de València, Universitat Jaume I
20
corpus compilation
20
corpus indexation
20
COVALT corpus
20
CQPweb
01
This contribution presents a section of the Corpus Valencià de Literatura Traduïda (COVALT), created by the research group of the same name (Department of Translation and Communication, Universitat Jaume I, Spain). The COVALT corpus is a four-million word corpus made up of narrative works originally written in English, French, and German and their Catalan translations published in the autonomous community of Valencia between 1990 and 2000. Since the members of the Covalt group are interested in translation research, and more specifically in the investigation of translated Catalan and Spanish, this corpus has recently been extended to include translations into Spanish published in Spain (COVALT PAR_ES corpus). This chapter presents the COVALT PAR_ES corpus, as well as its process of compilation and analysis with CQPweb.
10
01
JB code
scl.90.13san
215
231
17
Chapter
16
01
P-ACTRES 2.0
A parallel corpus for cross-linguistic research
1
A01
Hugo Sanjurjo-González
Sanjurjo-González, Hugo
Hugo
Sanjurjo-González
University of Huddersfield
2
A01
Marlén Izquierdo
Izquierdo, Marlén
Marlén
Izquierdo
University of the Basque Country (UPV/EHU)
20
(parallel) corpus compilation
20
ACTRES
20
corpus analysis software
20
web interface
01
This chapter describes an updated version of the ACTRES Parallel Corpus (P-ACTRES 2.0), an English-Spanish bidirectional corpus that contains over 4 million words. The composition of the corpus is recounted, regarding the number of words in each direction, and the types of texts included together with the linguistic variants that users will find in the corpus. Its composition is shaped by research purposes as well as availability issues. The computerization process is also explained, while commenting on the text processing, alignment and tagging. The chapter concludes with a brief demonstration of the usefulness and usability of P-ACTRES 2.0 in cross-linguistic research, be it contrastive linguistics or translation studies either independently or, most importantly, jointly.
10
01
JB code
scl.90.14san
233
247
15
Chapter
17
01
An overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus
An
overview of Basque corpora and the extraction of certain multi-word expressions from a translational corpus
1
A01
Zuriñe Sanz-Villar
Sanz-Villar, Zuriñe
Zuriñe
Sanz-Villar
University of the Basque Country (UPV/EHU)
20
Aleuska corpus
20
Basque corpora
20
Basque MWEs
20
TAligner
01
Since the 1980s, considerable efforts have been made to create different types of Basque corpora. However, to systematically analyse the Basque translations of German literary texts, it was necessary to create a corpus from the ground up. Intermediary versions were included in this corpus whenever the Basque target text was not a translation from the German original but came instead from a translation into another language (Spanish in most cases). A tool called TAligner was used to align the bitexts and the tritexts. The aim of this chapter is, firstly, to provide the reader with an overview of the main Basque corpora. Secondly, I will describe the design and compilation process of a parallel and multilingual corpus using TAligner 3.0. Thirdly, I will present how the corpus has been lemmatized and annotated at the level of part-of-speech. Finally, the process of extracting potential Basque multi-word expressions will be shown.
10
01
JB code
scl.90.p3
249
298
50
Section header
18
01
Part III. Parallel corpora
Tools and applications
10
01
JB code
scl.90.15gam
251
265
15
Chapter
19
01
Strategies for building high quality bilingual lexicons from comparable corpora
1
A01
Pablo Gamallo Otero
Gamallo Otero, Pablo
Pablo
Gamallo Otero
University of Santiago de Compostela
20
bilingual lexicons
20
cognates
20
comparable corpora
20
distributional similarity
20
extraction of translation candidates
01
This chapter outlines two strategies to automatically build bilingual dictionaries: One is based on the use of a pivot language and existing bilingual dictionaries, while the other relies on string similarity and cognate extraction. Both strategies have in common the use of translation equivalents extracted from comparable corpora to filter out odd bilingual pairs and validate the correct ones. The correctness of the entries validated with comparable corpora is very high, close to that achieved by using parallel corpora. The chapter reports several case studies describing how to build new high-quality bilingual lexicons, namely English-Galician, English-Portuguese, and Portuguese-Spanish dictionaries with more than 90% precision. This outperforms state-of-the-art systems on bilingual extraction from comparable corpora, whose best scores hardly reach 70 or 80%.
10
01
JB code
scl.90.16gon
267
279
13
Chapter
20
01
Discovering bilingual collocations in parallel corpora
A first attempt at using distributional semantics
1
A01
Marcos Garcia
Garcia, Marcos
Marcos
Garcia
University of A Coruña
2
A01
Marcos García-Salido
García-Salido, Marcos
Marcos
García-Salido
University of A Coruña
3
A01
Margarita Alonso-Ramos
Alonso-Ramos, Margarita
Margarita
Alonso-Ramos
University of A Coruña
20
collocations
20
distributional semantics
20
parallel corpora
20
phraseology
01
This chapter presents a method that exploits parallel corpora to automatically extract bilingual collocation equivalents. First, we use dependency parsing and statistical measures to identify collocation candidates in corpora. Then, we leverage the parallel corpora to extract bilingual word-embeddings. Finally, we use these distributional models as probabilistic dictionaries in order to identify bilingual collocation equivalents. To evaluate our strategy we carry out a set of experiments in Portuguese and Spanish focusing on verb-object collocations, for example, “reach the maturity” (“atingir a maturidade” in Portuguese, “alcanzar la madurez” in Spanish). The results of our experiments show that this method is useful to automatically identify thousands of bilingual collocation equivalents, achieving a precision of 86%.
10
01
JB code
scl.90.17gho
281
298
18
Chapter
21
01
Normalization of shorthand forms in French text messages using word embedding and machine translation
1
A01
Parijat Ghoshal
Ghoshal, Parijat
Parijat
Ghoshal
Neue Zürcher Zeitung, KOF Swiss Economic Institute
2
A01
Xi Rao
Rao, Xi
Xi
Rao
Neue Zürcher Zeitung, KOF Swiss Economic Institute
20
abbreviation/shorthand form normalization
20
character-based machine translation
20
deep learning
20
distributional semantics
20
French
20
Multivec
20
neural networks
20
parallel corpus
20
SMS
20
unsupervised learning
20
word embeddings
01
This chapter focuses on the normalization of abbreviations and shorthand forms used in French text messages. These forms are difficult to normalize, as they mostly cannot be resolved by typical spell checkers and dictionary lookups. Firstly, we aligned normalized and non-normalized French text messages and built a parallel corpus. We applied two popular approaches for text normalization, namely multilingual word embeddings, and character-based machine translation. We compare our results and observe the efficacy of our models while normalizing deletions, substitutions, repetitions, swaps, and insertions, made to canonical forms. This is the first paper that uses Multivec and the Belgian SMS corpus collected under the SMS4Science Project. The unsupervised machine learning approach makes the system highly flexible, easily adaptable and provides a domain-independent method of text normalization.
10
01
JB code
scl.90.ind
Miscellaneous
22
01
Index
02
JBENJAMINS
John Benjamins Publishing Company
01
John Benjamins Publishing Company
Amsterdam/Philadelphia
NL
04
20190320
2019
John Benjamins B.V.
02
WORLD
08
700
gr
01
JB
1
John Benjamins Publishing Company
+31 20 6304747
+31 20 6739773
bookorder@benjamins.nl
01
https://benjamins.com
01
WORLD
US CA MX
21
20190315
19
01
02
JB
1
00
99.00
EUR
R
02
02
JB
1
00
104.94
EUR
R
01
JB
10
bebc
+44 1202 712 934
+44 1202 712 913
sales@bebc.co.uk
03
GB
21
20190315
02
02
JB
1
00
83.00
GBP
Z
01
JB
2
John Benjamins North America
+1 800 562-5666
+1 703 661-1501
benjamins@presswarehouse.com
01
https://benjamins.com
01
US CA MX
21
20190315
1
01
gen
02
JB
1
00
149.00
USD