Chapter published in:
Parallel Corpora for Contrastive and Translation Studies: New resources and applicationsEdited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 57–78
Working with parallel corpora
Usefulness and usability
Rosa Rabadán | University of Leon
Although parallel corpora are vital for cross-linguistic and natural language processing (NLP) research, most have been designed for just one particular purpose, which may unnecessarily restrict their usefulness and usability. My argument is that the usefulness of existing parallel corpora increases exponentially when data so obtained are combined with those yielded by comparable and/or monolingual corpora. Usability criteria such as the choice of processing tools and adherence to international standards, among others, also have an impact on corpus usefulness. This chapter proposes courses of action that serve to improve the recycling and reprocessing of available resources. It also presents a corpus-based, post-editing and quality assessment application as an illustration of the multifarious uses parallel corpora may serve.
Article outline
- 1.Introduction
- 2.Concepts
- 3.Resources
- 4.Uses of parallel corpora
- 5.Needs analysis
- 6.Parallel corpora: Building or using
- 7.Applications
- 8.Useful strategies
- 9.Conclusions
-
Acknowledgment -
Notes -
References
Published online: 20 March 2019
https://doi.org/10.1075/scl.90.04rab
https://doi.org/10.1075/scl.90.04rab
References
Anthony, Laurence
2014 AntPConc (Version 1.1.0) [Computer Software]. Tokyo, Japan: Waseda University. http://www.laurenceanthony.net/> (7 July 2017).
Badia, Toni, Boleda, Gema, Brumme, Jenny, Colominas, Carme, Garmendia, Mireia & Quixal, Martí
2002 BancTrad: un banco de corpus anotados con interfaz web. Procesamiento del lenguaje natural 29: 293–294 < http://www.sepln.org/revistaSEPLN/revista/29/29-Pag293.pdf (13 November 2018).
BancTrad
Biber, Douglas
Bowker, Lynne
Chen, Jian & Nie, Jian-Yun
Chesterman, Andrew
CLARIN. European Common Language Resources and Technology Infrastructure
COCA. Corpus of Contemporary American English
CORPES XXI. Corpus del Español del Siglo XXI
Corpuscle
Coseriu, Eugenio
Coulthard, Malcolm
COVALT
2005 Corpus Valencià de Literatura Traduïda. http://cwbcovalt.xtrad.uji.es/cqpweb/> (7 July 2017).
CWB. IMS Open Corpus Workbench
ENPC
1996 English –Norwegian Parallel Corpus. http://www.hf.uio.no/ilos/english/services/omc/enpc/>(19 July 2018).
Europarl
European Language Resources Association (ELRA)
Evert, Stefan
2016 CQP query language tutorial. CWB Version 3.4 http://cwb.sourceforge.net/documentation.php> (11 July 2017).
Germann, Ulrich
2017 Aligned Hansards of the 36th Parliament of Canada release 2001–1a. https://www.isi.edu/natural-language/download/hansard/> (30 June 2017).
Ghadessy, Mohsen, Roseberry, Robert L. & Henry, Alex
Gilquin, Gaëtanelle
Granger, Sylviane & Lefer, Marie-Aude
Granger, Sylvianne, Lerot, Jacques & Petch-Tyson, Stephanie
Halverson, Sandra
Hareide, Lidun & Hofland, Knut
Hareide, Lidun
2013 The Norwegian–Spanish parallel corpus, common language resources and technology infrastructure Norway (CLARINO) Bergen Repository http://hdl.handle.net/11509/73> (4 July 2017).
Hofland, Knut & Johansson, Stig
Hofland, Knut & Reigem, Øysten
2017 Translation Corpus Aligner, version 2. An interactive sentence aligner http://clu.uni.no/icame/tca2/tca2-abstract.htm> (7 July 2017).
Hu, Xinhui, Isotani, Ryosuke & Nakamura, Satoshi
2009 Construction of Chinese conversational corpora for spontaneous speech recognition and comparative study on the trilingual parallel corpora. In ALR7 Proceedings of the 7th Workshop on Asian Language Resources Suntec, Singapore – August 06–07, 2009. 70–75. Stroudsburg PA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1690312&CFID=955832540&CFTOKEN=5706203> (12 July 2017).
Huddleston, Rodney & Pullum, Geoffrey K.
Izquierdo, Marlén, Hofland, Knut & Reigem, Øysten
Koehn, Philipp
2005 Europarl: A parallel corpus for statistical machine translation, MT Summit, 79–86. http://www.statmt.org/europarl/> (4 July 2017).
Laboratorio de Lingüística Informática (LLI-UAM)
Labrador, Belén, Ramón, Noelia, Alaiz-Moretón, Héctor & Sanjurjo-González, Hugo
Labrador, Belén
Lavid, Julia
2017 Annotating complex linguistic features in bilingual corpora: The case of MULTINOT. In Proceedings of the Workshop on Corpora in the Digital Humanities (CDH 201), Thierry Declerck & Sandra Kübler (eds), 19–28. Bloomington, IN. http://ceur-ws.org/Vol-1786/> (11 July 2017).
Marco, Josep
Mauranen, Anna
Multinot Corpus
Norwegian–Spanish Parallel Corpus (NSPC)
OMC. Oslo Multilingual Corpus
Open Parallel Corpus (OPUS)
P-ACTRES 2.0 Corpus
Peters, Carol, Braschler, Martin & Clough, Paul
Piao, Scott, Bianchi, Francesca, Dayrell, Carmen, D’egidio, Angela & Rayson, Paul
2015 Development of the multilingual semantic annotation system. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), Denver, Colorado, United States
, 1268–1274. http://aclweb.org/anthology/N/N15/N15-1137.pdf> (7 July 2017).
Piskorski, Jakub & Yangarber, Roman
Prentice, Sheryl, Rayson, Paul & Taylor, Paul J.
Pustejovsky, James & Stubbs, Amber
Rabadán, Rosa, Labrador, Belén & Ramón, Noelia
Rabadán, Rosa & Izquierdo, Marlén
Rabadán, Rosa, Alaiz-Moretón, Héctor, Fernández, Ramón-Ángel, García-Gallego, Ana, Gutiérrez-Lanza, Camino, Labrador, Belén, Ramón, Noelía & Sanjurjo-González, Hugo
2014 Procedimiento de evaluación de la calidad gramatical de las traducciones al español de textos en lengua inglesa (PETRA 1.0) http://actres.unileon.es/?page_id=50&lang=en

Rabadán, Rosa, Pizarro, Isabel & Sanjurjo-González, Hugo
Rabadán, Rosa, Colwell, Veronica & Sanjurjo-González, Hugo
Rabadán, Rosa
2010 Applied Translation Studies. In Handbook of Translation Studies 1, Yves Gambier and Luc van Doorslaer (eds). https://beta.benjamins.com/online/hts/articles/app1> (7 July 2017).
Rafalovitch, Alexandre & Dale, Robert
2009 United Nations general assembly resolutions: A six-language parallel corpus. In MT Summit XII, 292–299. Ottawa: AMTA. http://uncorpora.org/Rafalovitch_Dale_MT_Summit_2009.pdf> (7 July 2017).
Ramón, Noelia
Resnik, Philip & Smith, Noah A.
Samy, Doaa & González-Ledesma, Ana
2008 Pragmatic annotation of discourse markers in a multilingual parallel corpus (Arabic–Spanish–English). Proceedings of the VI Language Resources and Evaluation Conference (LREC). Marrakech, Morocco. http://www.lrec-conf.org/proceedings/lrec2008/pdf/828_paper.pdf> (7 July 2017).
San Vicente, Iñaki & Manterola, Iker
2012 PaCo2: A fully automated tool for gathering parallel corpora from the Web. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). https://www.researchgate.net/publication/230799614_PaCo2_A_Fully_Automated_tool_for_gathering_Parallel_Corpora_from_the_Web> (19 July 2018).
Schmid, Helmut
1994 TreeTagger – a part-of-speech tagger for many languages. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/> (7 July 2017).
Sinclair, John
2004 Corpus and text. Basic principles. In Developing Linguistic Corpora: a Guide to Good Practice. Corpus and Text – Basic Principles, Martin Wynne (ed.). https://ota.ox.ac.uk/documents/creating/dlc/chapter1.htm> (11 July 2017).
TAUS
Tiedemann, Jörg
2012 Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012) http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf> (4 July 2017).
Toury, Gideon
Wilson, Paul & Foulkes, Kim
Ziemski, Michał, Junczys-Dowmunt, Marcin & Pouliquen, Bruno
2016 The United Nations Parallel Corpus v1.0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 3530–3534. ELRA.< http://www.lreconf.org/proceedings/lrec2016/pdf/1195_Paper.pdf (13 November 2018).
Cited by
Cited by 1 other publications
Pérez Blanco, María & Marlén Izquierdo
This list is based on CrossRef data as of 01 april 2022. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.