Special issue articles
The challenges and benefits of annotating oral bilingual corpora
The Spanish in Texas Corpus Project
This article describes efforts to collect, process, and automatically annotate a corpus of Spanish as spoken in Texas. It elaborates the protocols for the development of the corpus and the procedures for automatic annotation, illustrating the common pitfalls to language identification in bilingual corpora and potential methods for circumventing them. The benefits of a comparative corpus approach to contact varieties is illustrated by a case study of a putative verbal calque from the Spanish in Texas data. It is demonstrated that the relative frequency of the verb is much higher than in its source Mexican variety and that the verb selects different complements in Texas than it does in other varieties. The article concludes with a discussion of how computational tools might be fruitfully exploited to resolve long-standing debates about language variation in contact settings.
Article outline
- 1.Introduction
- 2.The Spanish in Texas Corpus
- 2.1Protocols for developing the Corpus
- 2.2Procedures for annotation
- 3.The benefits of a corpus approach to contact phenomena
- 4.Case Study: Is an innovation contact induced or internally motivated?
- 4.1Detection of the potentially innovative uses of the verb
- 4.2Possibilities and limitations of a computational approach to calques
- 5.Discussion and conclusion
- Notes
-
References
References
Adamou, Evangelia
2016 A corpus-driven approach to language contact: Endangered languages in a comparative perspective. Walter de Gruyter GmBH & Co KG.


Balam, Osmer, Ana de Prada Pérez & Damaris Mayans
Bullock, Barbara E. & A. Jacqueline Toribio
2013 The Spanish in Texas Corpus project.
Center for Open Education Resources and Language Learning (COERLL), the University of Texas at Austin.
[URL].
Bybee, Joan L.
2007 Frequency of use and the organization of language. New York & Oxford: Oxford University Press.


Çentinoğlu, Özlem, Sarah Schulz, and Ngoc Thang Vu
. “
Challenges of computational processing of codeswitching.” arXiv preprint arXiv:1610.02213 (
2016).

Coetsem, Frans van
1990 Review of Thomason and Kaufman (1988), Lehiste (1988), and Wardhaugh (1987),
Language in Society 191. 260–268.

Deuchar, Margaret & Jonathan R. Stammers
2012 What IS the “Nonce Borrowing Hypothesis” anyway? Bilingualism: Language and Cognition 151. 649–650.


Davies, Mark
2002 Corpus del Español: 100 million words, 1200s-1900s.
[URL] (
12 March 2014.)
Diab, Mona & Ankit Kamboj
2011 Feasibility of leveraging crowd sourcing for the creation of a large scale annotated resource for Hindi English code switched data: a pilot annotation.
9th Workshop on Asian Language Resources, 36–40. Chiang Mai, Thailand.

Donnelly, Kevin & Margaret Deuchar
2011 Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text. In
Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia, 17–25.

Elfardy, Heba, Mohamed Al-Badrashiny & Mona Diab
2013 Code switch point detection in Arabic. In
Elisabeth Métais,
Farid Meziane,
Mohamad Sararee,
Vijayan Sugumaran &
Sunil Vadera (eds.)
Natural Language Processing and Information Systems: Proceedings of the 18th International Conference on Applications of Natural Language to Information Systems (NLDB2013), Salford, UK, 412–416. Heidelberg: Springer.


González-Vilbazo, Kay & Luis López
2011 Some properties of light verbs in code-switching.
Lingua 1211. 832–850.


Guzmán, Gualberto, Joseph Ricard, Jacqueline Serigos, Barbara E. Bullock & Almeida Jacqueline Toribio
2017 Moving code-switching research towards more empirically grounded methods. CDH 2017 Corpora in the Digital Humanities, CEUR Workshop Proceedings, 1–9.

Guzmán, Gualberto, Joseph Ricard, Jacqueline Serigos, Barbara E. Bullock & Almeida Jacqueline Toribio
2017 Metrics for modeling code-switching across corpora.
Proceedings of Interspeech 2017, 67–71.


Guzmán, Gualberto, Jacqueline Serigos, Barbara E. Bullock & Almeida Jacqueline Toribio
2016 Simple tools for exploring variation in code-switching for linguists.
Proceedings of EMNLP (Empirical Methods in Natural Language Processing 2016),
Second Workshop on Computational Approaches to Code-switching, 12–20. Association for Computational Linguistics.

Jarvis, Scott & Scott Crossley
2012 Approaching language transfer through text classification: Explorations in the detection-based approach. Bristol, UK: Multilingual matters.


Jarvis, Scott & Aneta Pavlenko
2008 Crosslinguistic influence in language and cognition. New York & London: Routledge.


Jenkins, Devin
2003 Bilingual verb constructions in southwestern Spanish.
Bilingual Review 271. 195–204.

King, Ben & Steven Abney
2013 Labeling the languages of words in mixed-language documents using weakly supervised methods. In
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1110–1119. Association for Computational Linguistics.

Koehn, Philipp
2005 Europarl: A parallel corpus for statistical machine translation.
Machine Translation Summit 2005, 79–86.

Li, Ying, Yue Yu & Pascale Fung
2012 A Mandarin-English code-switching corpus.
Proceedings of the Eighth International Conference on Language Resources and Evaluation (
LREC 2012), Istanbul, Turkey, 2515–2519. European Language Resources Association.

LIPPS Group
2000 The LIDES coding manual: A document for preparing and analyzing language interaction data.
International Journal of Bilingualism 41. 131–270.

Lipski, John M.
1985 Linguistic aspects of Spanish-English language switching. Tempe: Arizona State University Center for Latin American Studies.

Lipski, John M.
2008 Varieties of Spanish in the United States. Washington, DC: Georgetown University Press.

Mackey, William F.
1970 Interference, integration and the synchronic fallacy. In
James E. Alatis (ed.)
Bilingualism and Language Contact: Anthropological, Linguistic, Psychological, and Sociological Aspects.
Monograph Series on Languages and Linguistics (Georgetown University Round Table on Languages and Linguistics), vol. 231, 195–227. Washington: Georgetown University School of Languages and Linguistics.

MacWhinney, Brian
2007 The TalkBank Project. In
Joan C. Beal,
Karen P. Corrigan &
Hermann L. Moisl (eds.),
Creating and Digitizing Language Corpora: Synchronic Databases, vol. 11, 163–180. Houndmills, UK: Palgrave-MacMillan.


Mougeon, Raymond, Terry Nadasdi & Katherine Rehner
2005 Contact-induced linguistic innovations on the continuum of language use: The case of French in Ontario.
Bilingualism: Language and Cognition 81. 99–115.


Muysken, Pieter
2000 Bilingual speech: A typology of code-mixing. Cambridge, UK: Cambridge University Press.

Otheguy, Ricardo
1995 When contact speakers talk, linguistic theory listens. In
Ellen Contini-Morava &
Barbara S. Goldberg (eds.),
Meaning as explanation: Advances in linguistic sign theory (
Trends in Linguistics, Studies and Monographs), vol. 841, 213–242. Berlin: Mouton de Gruyter.


Otheguy, Ricardo & Nancy Stern
2011 On so-called Spanglish.
International Journal of Bilingualism 151. 85–100.


Otheguy, Ricardo & Ana Celia Zentella
2012 Spanish in New York: Language contact, dialectal leveling, and structural continuity. New York & Oxford: Oxford University Press.


Polinsky, Maria & Olga Kagan
2007 Heritage languages: In the ‘wild’ and in the classroom.
Language and Linguistics Compass 11. 368–395.


Poplack, Shana
1980 Sometimes I’ll start a sentence in Spanish y termino en español: Toward a typology of code-switching.
Linguistics 181. 581–618.


Poplack, Shana
2012 What does the Nonce Borrowing Hypothesis hypothesize? Bilingualism: Language and Cognition 151. 644–648.


Putnam, Michael T. & Liliana Sánchez
R Development Core Team
2009 R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL
[URL]
Roggia, Aaron B.
2011 Unaccusativity and word order in Mexican Spanish: An examination of syntactic interfaces and the split intransitivity hierarchy. Ph.D. dissertation. State College, Pennsylvania: The Pennsylvania State University.

Schmid, Helmut
1994 Probabilistic part-of-speech tagging using decision trees. In
Proceedings of international conference on new methods in language processing, Manchester, UK, 44–49.

Sebba, Mark
1998 A congruence approach to the syntax of codeswitching.
International Journal of Bilingualism 2(1). 1–19.


Serigos, Jacqueline Larsen
2013 The social stratification of loanwords: A computational and corpus-based approach to Anglicisms in Argentina. Austin, TX: University of Texas at Austin master’s report.

Silva-Corvalán, Carmen
1994/2000 Language contact and change. Oxford: Clarendon Press.

Solorio, Thamar & Yang Liu
2008a Learning to predict code-switching points.
The Conference Empirical Methods on Natural Language Processing, EMNLP 2008, 973–981. Honolulu, HI: Association for Computational Linguistics.


Solorio, Thamar & Yang Liu
2008b Part-of-speech tagging for English-Spanish code-switched text.
The Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, 1051–1060. Honolulu, HI: Association for Computational Linguistics.


Solorio, Thamar, Elizabeth Blair, Suraj Maharjan, Steven Bethard, Mona Diab, Mahmoud Gohneim, Abdelati Hawwari, Fahad AlGhamdi, Julia Hirschberg, Alison Chang & Pascale Fung
2014 Overview for the first shared task on language identification in code-switched data.
First Workshop on Computational Approaches to Code Switching. Proceedings of the Workshop. EMNLP 2014, 62–72. Doha, Qatar: Association for Computational Linguistics.


Stammers, Jonathan & Margaret Deuchar
2012 Testing the Nonce Borrowing Hypothesis: Counter-evidence from English-origin verbs in Welsh.
Bilingualism: Language and Cognition 151. 630–643.


Thomason, Sarah & Terrence Kaufman
1988 Language contact, creolization, and genetic linguistics. Berkeley, CA: University of California Press.

Torres Cacoullos, Rena & Catherine E. Travis
2010 Testing convergence via code-switching: Priming and the structure of variable subject expression.
International Journal of Bilingualism 141. 1–27.

Toribio, Almeida Jacqueline & Barbara E. Bullock
Tortora, Christina, Beatrice Santorini, Frances Blanchette & C. E. A. Diertani
2017 The Audio-Aligned and Parsed Corpus of Appalachian English (AAPCAppE).
[URL].
Villa, Daniel J.
2005 Back to patrás: A process of grammaticalization in a contact variety of Spanish. In
James Cohen,
Kara T. McAlister,
Kellie Rolstad &
Jeff MacSwan (eds.)
Proceedings of the 4th International Symposium on Bilingualism, 2310–2316. Somerville, MA: Cascadilla Press.

Vossen, Piek
(ed.) 1998 EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht: Kluwer.


Wang, William S-Y.
1969 Competing changes as a cause of residue.
Language 451. 9–25.


Wohlgemuth, Jan
2009 A Typology of Verbal Borrowings. New York, Berlin: Mouton de Gruyter


Zenner, Eline, Dirk Speelman & Dirk Geeraerts
2012 Cognitive sociolinguistics meets loanword research: Measuring variation in the success of Anglicisms in Dutch.
Cognitive Linguistics 231. 749–792.


Cited by
Cited by 1 other publications
Parra, María Luisa & Ellen J Serafini
2021.
“Bienvenidxs todes”: el lenguaje inclusivo desde una perspectiva crítica para las clases de español.
Journal of Spanish Language Teaching 8:2
► pp. 143 ff.

This list is based on CrossRef data as of 30 august 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.