Chapter published in:
Parallel Corpora for Contrastive and Translation Studies: New resources and applicationsEdited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 141–158
Enriching parallel corpora with multimedia and lexical semantics
From the CLUVI Corpus to WordNet and SemCor
Xavier Gómez Guinovart | University of Vigo
In this chapter, I present the main characteristics of the CLUVI Corpus, an open collection of sentence-level aligned parallel corpora with over 44 million words in nine specialised domains (fiction, computing, popular science, biblical texts, law, consumer information, economy, tourism, and film subtitling) and different language combinations including Galician, Spanish, English, French, Portuguese, Catalan, Italian, Basque and Latin. Then, I present the methodology developed for extending the film subtitles section of the CLUVI Corpus with multimedia data. Finally, I discuss the resources and methods used to build the SensoGal Corpus, a SemCor-based English-Galician parallel corpus semantically annotated based on WordNet and aligned at the sentence and word levels.
Keywords: parallel corpora, multimedia, lexical semantics, WordNet, SemCor
Article outline
- 1.Introduction
- 2.The CLUVI Corpus
- 2.1Corpus description
- 2.2Tagging the CLUVI Corpus
- 2.3Extending the CLUVI Corpus with multimedia data
- 3.The SensoGal Corpus
- 4.Conclusion
-
Notes -
References
Published online: 20 March 2019
https://doi.org/10.1075/scl.90.09gom
https://doi.org/10.1075/scl.90.09gom
References
Almeida, José João, Araújo, Sílvia, Simões, Alberto & Dias, Idalete
Álvarez de la Granja, María, Gómez Clemente, Xosé María & Gómez Guinovart, Xavier
Álvarez Lugrís, Alberto & Gómez Guinovart Xavier
2014 Lexicografía bilingüe práctica basada en corpus: planificación y elaboración del Dicionario Moderno Inglés-Galego. In Lexicografía de las lenguas románicas: Aproximaciones a la lexicografía moderna y contrastiva, María José Domínguez Vázquez, Xavier Gómez Guinovart Xavier & Valcárcel Riveiro Carlos (eds), 31–48. Berlin/Boston: De Gruyter Mouton.
Crespo Bastos, Ana, Gómez Clemente, Xosé María, Gómez Guinovart Xavier & López Fernández Susana
Girju, Roxana
Gómez Clemente, Xosé María, Gómez Guinovart, Xavier, González Pereira, Andrea & Verónica Taboada Lorenzo
Gómez Guinovart Xavier & Oliver, Antoni
Gómez Guinovart Xavier & Sacau Fontenla Elena
2004b Parallel corpora for the Galician language: building and processing of the CLUVI (Linguistic Corpus of the University of Vigo). In Proceedings of the 4th International Conference on Language Resources and Evaluation, Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa & Raquel Silva (eds), 1179–1182. Paris: ELRA.
Gómez Guinovart Xavier & Simões, Alberto
2009 Parallel corpus-based bilingual terminology extraction. In Proceedings of the 8th International Conference on Terminology and Artificial Intelligence. Toulouse: Université Paul Sabatier. https://www.irit.fr/TIA09/thekey/posters/simoes-guinovart.pdf (28 April 2017).
Gómez Guinovart, Xavier & Simões, Alberto
Gómez Guinovart, Xavier & Torres Padín, Ánxeles
Gómez Guinovart, Xavier, Díaz Rodríguez, Eva & Álvarez Lugrís, Alberto
Gómez Guinovart, Xavier
Keshtkar, Hossein & Mosavi Miangah, Tayebeh
Koehn, Philipp
Landes, Shari, Leacock, Claudia & Tengi, Randee I.
Mikhailov, Mikhail & Cooper, Robert
Miller, George A., Beckwith, Richard, Fellbaum, Christiane, Gross, Derek & Miller, Katherine
Montero Perez, Maribel, Paulussen, Hans Macken, Lieve & Desmet, Piet
Moreira, Adonay
2011b Turigal: compilation of a parallel corpus for bilingual terminology extraction. In Actas del III Congreso Internacional de Lingüística de Corpus: Las tecnologías de la información y las comunicaciones: presente y futuro en el análisis de corpus, María Luisa Carrió & Miguel Ángel Candel (eds), 33–42. València: Universitat Politècnica de València.
2014 A methodology for building a translator- and translation-oriented terminological resource. In inTRAlinea Special Issue: Translation & Lexicography, María Sánchez, María Porciel & Iris Serrat (eds). <
http://www.intralinea.org/specials/article/2032
> (28 April 2017).
Santos, Diana
Savourel, Yves
2005 TMX 1.4b Specification. Localisation Industry Standards Association. https://www.gala-global.org/tmx-14b (28 April 2017).
Simões, Alberto & Gómez Guinovart, Xavier
2009 Terminology extraction from English–Portuguese and English–Galician parallel corpora based on probabilistic translation dictionaries and bilingual syntactic patterns. In Proceedings of the Iberian SLTech 2009 - I Joint SIG-IL/Microsoft Workshop on Speech and Language Technologies for Iberian Languages, António Teixeira, Miguel Sales Dias & Daniela Braga (eds), 13–16. Porto Salvo: Designeed.
Simões, Alberto, Gómez Guinovart, Xavier & Almeida, José João
Solla Portela, Miguel Anxo & Gómez Guinovart, Xavier
Sotelo Dios Patricia & Guinovart Xavier, Gómez
Sotelo Dios, Patricia
2011 Using a multimedia parallel corpus to investigate English–Galician subtitling. In Proceedings of the SDH 2011 Conference: Supporting Digital Humanities, Bente Maegaard (ed). Copenhagen: University of Copenhagen. http://hnk.ffzg.hr/bibl/SDH-2011/proceedings.html (28 April 2017).
Tiedemann, Jörg
2012 Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation, Nicoletta Calzolari, Khalid Choukri,Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 2214–2218. Istanbul: ELRA.
Tufiş, Dan
2007 Exploiting Aligned Parallel Corpora in Multilingual Studies and Applications. In Intercultural Collaboration, Toru Ishida, Susan R. Fussell & Peek Vossen (eds), 103–117. Berlin: Springer.
Véronis, Jean
Vossen, Piek