Corpus PaGeS
A multifunctional resource for language learning, translation and cross-linguistic research
This chapter presents the bilingual parallel corpus PaGeS, compiled by the research group SpatiAlEs from the University of Santiago de Compostela. PaGeS currently amounts to nearly 20 million tokens and consists of texts originally written in German and in Spanish and their correspondent translations into the other language, as well as a small portion of German and Spanish translations from third languages. The present contribution introduces the main characteristics of the PaGeS corpus, focusing on its design and compilation. It first explains the criteria for the selection of the texts and the details of text pre-processing, automatic alignment and manual review. It then addresses the search and display features describing the server architecture and indexing process. Finally, the intended development of the PaGeS corpus is briefly discussed.
Article outline
- 1.Introduction
- 2.Components and content
- 3.Text preprocessing, textual mark-up and metadata
- 4.Alignment
- 5.Search and display features
- 6.Server architecture and publishing data
- 7.Summary and outlook
-
Acknowledgement
-
Notes
-
References
References (25)
References
Čermák, Petr. This volume. InterCorp. Parallel corpus of 40 languages. In Parallel Corpora for Contrastive and Translation Studies: New Resources and Applications [Studies in Corpus Linguistics 90] Irene Doval & M. Teresa Sánchez (eds). Amsterdam: John Benjamins.
Clematide, Simon, Graën, Johannes & Volk, Martin. 2016. Multilingwis – A multilingual search tool for multi-word units in multiparallel corpora. In Computerised and Corpusbased Approaches to Phraseology: Monolingual and Multilingual Perspectives – Fraseología computacional y basada en corpus: perspectivas monolingües y multilingües, Gloria Corpas Pastor (ed.), 447–455. Geneva: Tradulex.
Danielsson, Pernilla & Ridings, Daniel. 1997. Practical presentation of a Vanilla Aligner. In TELRI Workshop in alignment and exploitation of texts, Ljubljana, Slovenia. <[URL]> (30 May 2017).
Dörk, Marian & Knight, Dawn. 2015. WordWanderer: A navigational approach to text visualisation. Corpora 10(1): 83–94.
Doval, Irene. 2016. PaGeS: Design and compilation of a bilingual parallel corpus German Spanish. Epic Series in Languages and Linguistics 1: 88–96.
Doval, Irene. 2017. POS-tagging a bilingual parallel corpus: Methods and challenges. Research in Corpus Linguistics 5: 35–46.
Doval, Irene. 2018. Das PaGeS-Korpus, ein Parallelkorpus der deutschen und spanischen Gegenwartssprache. Revista de Filología Alemana 26: 181–197.
Łaziński, Marek & Kuratczyk, Magdalena. 2016 Korpus Polsko-Rosyjski Uniwersytetu Warszawskiego / The University of Warsaw Polish-Russian Parallel Corpus. In Polskojęzyczne korpusy równoległe – Polish-language Parallel Corpora, Ewa Gruszczyńska & Anieszka Leńko-Szymańska (eds), 83–95. Warszawa: Instytut Lingwistyki Stosowanej WLS, Uniwersytet Warszawski.
Lübke, Barbara & Liste Lamas, Elsa. 2019. Raumrelationen im Deutschen: Kontrast, Erwerb und Übersetzung. Tübingen: Stauffenburg.
Lüdeling, Anke & Kytö, Merja (eds). 2008. Corpus Linguistics. An International Handbook, Vol. 1. Berlin: Walter de Gruyter.
Macken, Lieve, Trushkina, Julia, Paulussen, Hans, Rura, Lidia, Desmet, Piet & Wandeweghe, Wily. 2007. Dutch Parallel Corpus: A multilingual annotated corpus. In Proceedings of the fourth Corpus Linguistics conference, University of Birmingham. <[URL]> (12 April 2017).
Molés-Cases, Teresa & Oster, Ulrike. This volume. Indexation and analysis of a parallel corpus using CQPweb: The COVALT PAR_ES corpus (EN/FR/DE>ES). In Parallel Corpora for Contrastive and Translation Studies: New Resources and Applications [Studies in Corpus Linguistics 90], Irene Doval & M. Teresa Sánchez (eds). Amsterdam: John Benjamins.
Rosen, Alexandr. 2016. InterCorp – a look behind the façade of a parallel corpus. In Polskojęzyczne korpusy równoległe – Polish-language Parallel Corpora, Ewa Gruszczyńska & Anieszka Leńko-Szymańska (eds), 21–40. Warszawa: Instytut Lingwistyki Stosowanej WLS, Uniwersytet Warszawski.
Steinberger, Ralf. et al. 2006. The JRCAcquis: A multilingual aligned parallel corpus with 20 + languages. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). <[URL]> (12 October 2017).
Steinberger, Ralf. et al. 2014. An overview of the European Union’s highly multilingual parallel corpora. Language Resources and Evaluation 48(4): 679–707.
Tiedemann, Jörg. 2011. Bitext Alignment. San Rafael, CA: Morgan & Claypool Publishers.
Tóth, Krisztina, Farkas, Richárd & Kocsor, András. 2008. Sentence alignment of Hungarian–English parallel corpora using a hybrid algorithm. Acta Cybern 18: 463–478.
Varga, Dániel, Németh, László, Halácsy, Péter, Kornai, András, Trón, Viktor & Nagy, Viktor. 2005. Parallel corpora for medium density languages. In Proceedings of RANLP 2005, 590–596.
Varga, Dániel. 2012. Natural Language Processing of Large Parallel Corpora. PhD dissertation. Budapest: Eötvös Loránd University.
Volk, Martin, Graen, Johannes & Callegaro, Elena. 2014. Innovations in parallel corpus search tools. In Proceedings of LREC, Reykjavik. <[URL]> (13 May 2017)
Volk, Martin, Clematide, Simon, Graen, Johannes, Ströbel, Phillip. 2016. Bi-particle adverbs, pos-tagging and the recognition of German separable prefix verbs. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 296–305.
Wynne, Martin. 2008. Searching and concordancing. In Corpus linguistics. An International Handbook, Anke Lüdeling & Merja Kytö (eds), 706–737. Berlin: de Gruyter.
Zanettin, Federico. 2012. Translation-driven Corpora. London: Routledge.
Cited by (3)
Cited by three other publications
Molés-Cases, Teresa & Ulrike Oster
DOVAL, Irene
2018.
Corpus paralelos en la enseñanza de lenguas extranjeras: un ejemplo de aplicación basado en el corpus PaGeS.
CLINA: Revista Interdisciplinaria de Traducción, Interpretación y Comunicación Intercultural 4:2
► pp. 65 ff.
This list is based on CrossRef data as of 27 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.