Working with parallel corpora
Usefulness and usability
Although parallel corpora are vital for cross-linguistic and natural language processing (NLP) research, most have been designed for just one particular purpose, which may unnecessarily restrict their usefulness and usability. My argument is that the usefulness of existing parallel corpora increases exponentially when data so obtained are combined with those yielded by comparable and/or monolingual corpora. Usability criteria such as the choice of processing tools and adherence to international standards, among others, also have an impact on corpus usefulness. This chapter proposes courses of action that serve to improve the recycling and reprocessing of available resources. It also presents a corpus-based, post-editing and quality assessment application as an illustration of the multifarious uses parallel corpora may serve.
Article outline
- 1.Introduction
- 2.Concepts
- 3.Resources
- 4.Uses of parallel corpora
- 5.Needs analysis
- 6.Parallel corpora: Building or using
- 7.Applications
- 8.Useful strategies
- 9.Conclusions
-
Acknowledgment
-
Notes
-
References
References
Anthony, Laurence
2014 AntPConc (Version 1.1.0) [Computer Software]. Tokyo, Japan: Waseda University.
[URL]> (
7 July 2017).

Badia, Toni, Boleda, Gema, Brumme, Jenny, Colominas, Carme, Garmendia, Mireia & Quixal, Martí
2002 BancTrad: un banco de corpus anotados con interfaz web.
Procesamiento del lenguaje natural 29: 293–294 <
[URL] (13 November 2018).

BancTrad
2002 <
[URL]> (
11 July 2017).
Biber, Douglas
1993 Representativeness in corpus design.
Literary and Linguistic Computing 8(4): 243–257.


Biber, Douglas
1998 Variation across Speech and Writing. Cambridge: CUP.

Bowker, Lynne
2002 Computer-Aided Translation Technology: A Practical Introduction. Ottawa: University of Ottawa Press.

Chen, Jian & Nie, Jian-Yun
2000 Parallel text mining for cross-language IR. In
Proceedings of the 6th International Conference on Computer-assisted Information Retrieval (RIAO 2000), 62–77.

Chesterman, Andrew
2004 Hypotheses about translation universals. In
Claims, Changes and Challenges in Translation Studies, [
Benjamins Translation Library 50],
Gyde Hansen,
Kirsten Malmkjaer &
Daniel Gile (eds), 113. Amsterdam: John Benjamins.


CLARIN. European Common Language Resources and Technology Infrastructure
2012 <
[URL]> (
7 July 2017).
COCA. Corpus of Contemporary American English
201 <
[URL]> (
19 July 2018).
CORPES XXI. Corpus del Español del Siglo XXI
2016 < <
[URL]> (
19 July 2018).
Corpuscle
2017 <
[URL]> (
7 July 2017).
Coseriu, Eugenio
1981 Los conceptos de ‘dialecto’, ‘nivel’ y ‘estilo de lengua’ y el sentido propio de la dialectologia.
Lingüística española actual 3: 1–32.

Coulthard, Malcolm
2004 Author identification, idiolect and linguistic uniqueness.
Applied Linguistics 25(4): 431–447.


COVALT
2005 Corpus Valencià de Literatura Traduïda.
[URL]> (
7 July 2017).
CWB. IMS Open Corpus Workbench
2013: <
[URL]> (
7 July 2017).
Ebeling, Jarle
1998 Contrastive linguistics, translation, and parallel corpora.
Meta 43: 602–615.


ENPC
1996 English –Norwegian Parallel Corpus.
[URL]>(
19 July 2018).
Europarl
2012 Release v7 [URL]> (
11 July 2017).
European Language Resources Association (ELRA)
2015 <
[URL]> (
11 July 2017).
Evert, Stefan
2016 CQP query language tutorial. CWB Version 3.4 [URL]> (
11 July 2017).
Germann, Ulrich
2017 Aligned Hansards of the 36th Parliament of Canada release 2001–1a.
[URL]> (
30 June 2017).
Ghadessy, Mohsen, Roseberry, Robert L. & Henry, Alex
Granger, Sylviane & Lefer, Marie-Aude
2016 From general to learners’ bilingual dictionaries: Towards a more effective fulfilment of advanced learners’ phraseological needs.
International Journal of Lexicography 29(3): 279–295.


Granger, Sylvianne, Lerot, Jacques & Petch-Tyson, Stephanie
(eds) 2003 Corpus-based Approaches to Contrastive Linguistics and Translation Studies. Amsterdam: Rodopi

Halverson, Sandra
1998 Translation studies and representative corpora: Establishing links between translation corpora, theoretical/descriptive categories and a conception of the object of study.
Meta 43(4): 494–514.


Hareide, Lidun & Hofland, Knut
Hareide, Lidun
2013 The Norwegian–Spanish parallel corpus, common language resources and technology infrastructure Norway (CLARINO) Bergen Repository [URL]> (
4 July 2017).
Hofland, Knut & Johansson, Stig
1998 The Translation Corpus Aligner: A program for automatic alignment of parallel texts. In
Corpora and Crosslinguistic Research: Theory, Method, and Case Studies,
S. Johansson &
S. Oksefjell (eds), 87–100. Amsterdam: Rodopi.

Hofland, Knut & Reigem, Øysten
2017 Translation Corpus Aligner, version 2. An interactive sentence aligner [URL]> (
7 July 2017).
Hu, Xinhui, Isotani, Ryosuke & Nakamura, Satoshi
2009 Construction of Chinese conversational corpora for spontaneous speech recognition and comparative study on the trilingual parallel corpora. In
ALR7 Proceedings of the 7th Workshop on Asian Language Resources Suntec, Singapore – August 06–07, 2009. 70–75. Stroudsburg PA: Association for Computational Linguistics.
[URL]> (
12 July 2017).

Huddleston, Rodney & Pullum, Geoffrey K.
2002 The Cambridge Grammar of the English Language.Cambridge: CUP.


Izquierdo, Marlén, Hofland, Knut & Reigem, Øysten
2008 The ACTRES parallel corpus: an English–Spanish translation corpus.
Corpora 3: 31–41.


Koehn, Philipp
2005 Europarl: A parallel corpus for statistical machine translation, MT Summit, 79–86.
[URL]> (
4 July 2017).
Laboratorio de Lingüística Informática (LLI-UAM)
2017 <
[URL]> (
11 July 2017).
Labrador, Belén, Ramón, Noelia, Alaiz-Moretón, Héctor & Sanjurjo-González, Hugo
2014 Rhetorical structure and persuasive language in the subgenre of online advertisements.
English for Specific Purposes 34(1): 38–47.


Lavid, Julia
2017 Annotating complex linguistic features in bilingual corpora: The case of MULTINOT. In
Proceedings of the Workshop on Corpora in the Digital Humanities (CDH 201),
Thierry Declerck &
Sandra Kübler (eds), 19–28. Bloomington, IN.
[URL]> (
11 July 2017).

Marco, Josep
2012 An analysis of explicitation in the COVALT corpus: The case of the substituting pronoun one(s) and its translation into Catalan.
Across Languages and Cultures 13(2): 229–246.


Multinot Corpus
2015 <
[URL]> (
7 July 2017).
Norwegian–Spanish Parallel Corpus (NSPC)
2013 <
[URL]> (
7 July 2017).
OMC. Oslo Multilingual Corpus
2008 <
[URL]> (
19 July 2018).
Open Parallel Corpus (OPUS)
2012 <
[URL]> (
7 July 2017).
P-ACTRES 2.0 Corpus
2018 Demo.
[URL]> (
12 November 2018).
Peters, Carol, Braschler, Martin & Clough, Paul
2012 Cross-language information retrieval. In
Multilingual Information Retrieval. From Research To Practice, by
Carol Peters,
Martin Braschler,
Paul Clough, 57–84. Berlin: Springer.


Piao, Scott, Bianchi, Francesca, Dayrell, Carmen, D’egidio, Angela & Rayson, Paul
2015 Development of the multilingual semantic annotation system. In
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), Denver, Colorado, United States
, 1268–1274.
[URL]> (
7 July 2017).

Piskorski, Jakub & Yangarber, Roman
2013 Information extraction: past, present and future. In
Multisource, Multilingual Information Extraction and Summarization,
Thierry Poibeau,
Horacio Saggion,
Jakub Piskorski &
Roman Yangarber (eds), 23–49. Berlin: Springer.


Prentice, Sheryl, Rayson, Paul & Taylor, Paul J.
Pustejovsky, James & Stubbs, Amber
2012 Natural Language Annotation for Machine Learning. A Guide to Corpus-Building for Applications. Sebastopol CA: O’Reilly Media.

Rabadán, Rosa, Labrador, Belén & Ramón, Noelia
Rabadán, Rosa & Izquierdo, Marlén
Rabadán, Rosa, Alaiz-Moretón, Héctor, Fernández, Ramón-Ángel, García-Gallego, Ana, Gutiérrez-Lanza, Camino, Labrador, Belén, Ramón, Noelía & Sanjurjo-González, Hugo
2014 Procedimiento de evaluación de la calidad gramatical de las traducciones al español de textos en lengua inglesa (PETRA 1.0) [URL]

Rabadán, Rosa, Pizarro, Isabel & Sanjurjo-González, Hugo
2015 GEDIRE©: A directors’ reports writing tool. Paper presented at
CILC 2015. 7th International Conference on Corpus Linguistics. Valladolid, 5–7
March 2015.

Rabadán, Rosa, Colwell, Veronica & Sanjurjo-González, Hugo
2016 Bi-texting your food: Helping the gastro industry reach the global market. In
CILC 2016. 8th International Conference on Corpus Linguistics [
EPiC Series in Language and Linguistics 1].
Antonio Moreno Ortiz &
Chantal Pérez-Hernández (eds), 361–371.

Rabadán, Rosa
2008 Refining the idea of ‘applied extensions’. In
Beyond Descriptive Translation Studies: Investigations in homage to Gideon Toury [
Benjamins Translation Library 75],
Anthony Pym,
Miriam Shlesinger &
Daniel Simeoni (eds), 103–117. Amsterdam: John Benjamins.


Rabadán, Rosa
2010 Applied Translation Studies. In Handbook of Translation Studies 1,
Yves Gambier and
Luc van Doorslaer (eds).
[URL]> (
7 July 2017).

Rabadán, Rosa
2010a English–Spanish contrastive analysis for translation applications.
Quaderns de Filologia. Anejo n.° 73: 161–180.

Rabadán, Rosa
2011 Any into Spanish: A corpus-based analysis of a translation problem.
Linguistica Pragensia 21(2): 57–69.


Rabadán, Rosa
2015 A corpus-based study of aspect: Still and already + verb phrase constructions into Spanish. In
Cross-linguistic Studies at the Interface between Lexis and Grammar,
Karin Aijmer &
Hilde Hasselgård (eds).
Nordic Journal of English Studies 14(1): 34–61.

Rafalovitch, Alexandre & Dale, Robert
2009 United Nations general assembly resolutions: A six-language parallel corpus. In
MT Summit XII, 292–299. Ottawa: AMTA.
[URL]> (
7 July 2017).

Ramón, Noelia
2009 Translating epistemic adverbs from English into Spanish: Evidence from a parallel corpus Meta 54(1): 73–96.


Real Academia Española (RAE)
2009 Nueva gramática de la lengua española. Madrid: Espasa.

Resnik, Philip & Smith, Noah A.
2003 The web as a parallel corpus.
Computational Linguistics 29(3): 349–380.


Samy, Doaa & González-Ledesma, Ana
2008 Pragmatic annotation of discourse markers in a multilingual parallel corpus (Arabic–Spanish–English).
Proceedings of the VI Language Resources and Evaluation Conference (LREC). Marrakech, Morocco.
[URL]> (
7 July 2017).

San Vicente, Iñaki & Manterola, Iker
2012 PaCo2: A fully automated tool for gathering parallel corpora from the Web.
Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12).
[URL]> (
19 July 2018).

Schmid, Helmut
1994 TreeTagger – a part-of-speech tagger for many languages.
[URL]> (
7 July 2017).
Sinclair, John
2004 Corpus and text. Basic principles. In
Developing Linguistic Corpora: a Guide to Good Practice. Corpus and Text – Basic Principles,
Martin Wynne (ed.).
[URL]> (
11 July 2017).

TAUS
2016 <
[URL]> (
4 July 2017).
Tiedemann, Jörg
2012 Parallel data, tools and interfaces in OPUS. In
Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012) [URL]> (
4 July 2017).

Wilson, Paul & Foulkes, Kim
2014 Borders, variation, and identity: Language analysis for the determination of origin (LADO). In
Language, Borders and Identiy,
Dominic Watt &
Carmen Llamas (eds), 218–229. Edinburgh: EUP.

Ziemski, Michał, Junczys-Dowmunt, Marcin & Pouliquen, Bruno
2016 The United Nations Parallel Corpus v1.0.
In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016),
Nicoletta Calzolari,
Khalid Choukri,
Thierry Declerck,
Sara Goggi,
Marko Grobelnik,
Bente Maegaard,
Joseph Mariani,
Helene Mazo,
Asuncion Moreno,
Jan Odijk &
Stelios Piperidis (eds), 3530–3534. ELRA.<
[URL] (13 November 2018).

Cited by
Cited by 1 other publications
Pérez Blanco, María & Marlén Izquierdo
This list is based on CrossRef data as of 8 march 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.