The Sociolinguistic Speech Corpus of Chilean Spanish (COSCACH): A socially stratified text, audio and video corpus with multiple speech styles

Sadowsky, Scott

doi:10.1075/ijcl.19103.sad

Article published In:

International Journal of Corpus Linguistics
Vol. 27:1 (2022) ► pp.93–125

The Sociolinguistic Speech Corpus of Chilean Spanish (COSCACH)

A socially stratified text, audio and video corpus with multiple speech styles

Scott Sadowsky | Catholic University of Chile, | Max Planck Institute for the Science of Human History,

This paper presents the Sociolinguistic Speech Corpus of Chilean Spanish (COSCACH) v1.0, a 9.3-million-word corpus containing transcribed, lemmatized and morphologically tagged text, audio recordings and videos from 1,237 L1 speakers of Chilean Spanish, as well as a control sample of 21 non-Chilean L1 Spanish speakers. The COSCACH is the first freely available corpus of spoken Chilean Spanish of substantial size, as well as one of the largest speech corpora of any variety of Spanish. Following a review of other Chilean speech corpora, I describe how the COSCACH was constructed, covering corpus design, speaker recruitment and metadata collection, speech elicitation and recording, transcription, lemmatization and morphological tagging, and corpus compilation. I thereby aim to provide a blueprint for creating modern, large-scale speech corpora suitable for phonetic, sociophonetic and sociolinguistic research, in addition to traditional inquiry into semantics, lexis, grammar, pragmatics and discourse.

Keywords: speech corpora, Chilean Spanish, corpus design and construction, phonetics, sociolinguistics

Article outline

1.Introduction
2.Other Chilean Spanish speech corpora
- 2.1ESECH and PRESEEA-SA
- 2.2King-ASR-290
- 2.3Additional speech corpora
- 2.4Justification for the COSCACH
3.Corpus design and speaker sampling
- 3.1Chilean speaker samples
  - 3.1.1Speaker inclusion variables
    - A.Locality
    - B.Ethnicity
    - C.Lingualism
    - D.Age/Generation
    - E.Sex
    - F.Socioeconomic status
    - G.Year of recording
  - 3.1.2Derived variables
    - A.Region
    - B.Urbanness
    - C.Locality size
    - D.Distance and travel time from Santiago
- 3.2Non-Chilean control sample
4.Data collection
- 4.1Timeframe
- 4.2Fieldworkers
- 4.3Speaker recruitment
  - 4.3.1Recruitment procedures
  - 4.3.2Informed consent and institutional review board (IRB) approval
  - 4.3.3Further criteria for exclusion of speakers
- 4.4Socio-demographic questionnaire
- 4.5Elicitation tasks
  - 4.5.1Sustained pronunciation of isolated vowels
  - 4.5.2Reading of minimal pairs or other word lists
  - 4.5.3Reading of meaningful sentences
  - 4.5.4Reading of meaningful texts
  - 4.5.5Conversational interview
  - 4.5.6Language attitudes interview
- 4.6Recording
  - 4.6.1Audio equipment and configuration
  - 4.6.2Audio post-processing
  - 4.6.3Video recording equipment and procedures
5.Transcription, text processing and corpus compilation
- 5.1Transcription
- 5.2Anonymization and protection of speakers’ privacy
- 5.3Text extraction and annotation
- 5.4Corpus compilation and use
6.Availability and access
7.Conclusions and future directions
Acknowledgements
Notes
References

Published online: 31 January 2022

https://doi.org/10.1075/ijcl.19103.sad

References (40)

References

Academia Chilena de la Lengua (Ed.). (2010). Diccionario de uso del español de Chile (DUECh) [Dictionary of Chileanisms (DUECh)]. MN Editorial / Asociación de Academias de la Lengua Española / Gobierno de Chile / Consejo Nacional de la Cultura y las Artes.

Audacity Development Team. (2018). Audacity: Free Audio Editor and Recorder (2.3.0) [Computer software]. [URL]

Audix Microphones. (2017). Audix HT5 Spec Sheet, version 4.1. [URL]

Bengoa, J. (2018). La comunidad fragmentada: Nación y desigualdad en Chile [The Fragmented Community: Nation and Inequality in Chile]. Editorial Catalonia.

Boersma, P., & Weenink, D. (2018). Praat: Doing phonetics by computer (6.0.42) [Computer software]. [URL]

Evert, S., & Hardie, A. (2011). Twenty-first century corpus workbench: Updating a query architecture for the new millennium. In Proceedings of the Corpus Linguistics 2011 Conference. [URL]

Eyheramendy, S., Martinez, F. I., Manevy, F., Vial, C., & Repetto, G. M. (2015). Genetic structure characterization of Chileans reflects historical immigration patterns. Nature Communications, 6 1, 6472.

Fant, L., & Harvey, A. (2008). Intersubjetividad y consenso en el diálogo: Análisis de un episodio de trabajo en grupo estudiantil [Intersubjectivity and consensus in dialog: Analysis of a student group work session]. Oralia, 11 1, 307–322.

Fernández de Molina Ortés, E. (2017). Estudio contrastivo de la norma culta de tres ciudades peninsulares. Análisis del campo semántico de la vivienda [A contrastive study of educated speech in three Spanish cities: Analysis of the semantic field of housing]. Onomázein, 37 1, 90–111.

Garretón, M. A., & Cumsille, G. (2002). Las percepciones de la desigualdad en Chile [Perceptions of inequality in Chile]. Revista Proposiciones, 34 1, 1–9.

Gille, J. (2015). On the development of the Chilean Spanish discourse marker “cachái.” Revue Romane, 50 (1), 3–29.

Gundermann, H., Caniguan, J., Clavería, A., & Faúndez, C. (2009). Permanencia y desplazamiento, hipótesis acerca de la vitalidad del mapuzugun [Persistence and displacement: A hypothesis on the vitality of Mapudungun]. Revista de Lingüística Teórica y Aplicada, 47 (1), 37–60.

HandBrake Team. (2019). HandBrake (1.2.0) [Computer software]. [URL]

Hardie, A. (2012). CQPweb – combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17 (3), 380–409.

Heggarty, P., Maguire, W., & McMahon, A. (2010). Splits or waves? Trees or webs? How divergence measures and network analysis can unravel language histories. Philosophical Transactions of the Royal Society B: Biological Sciences, 365 (1559), 3829–3843.

Heggarty, P., Shimelman, A., Abete, G., Anderson, C., Sadowsky, S., Paschen, L., Maguire, W., Jocz, L., Aninao, M. J., Wägerle, L., Appelganz, D., Pheula do Couto e Silva, A., Lawyer, L. C., Câmara Cabral, A. S. A., Walworth, M., Michalsky, J., Koile, E., Runge, J., & Bibiko, H.-J. (2019). Sound Comparisons: A new online database and resource for research in phonetic diversity. In S. Calhoun, P. Escudero, M. Tabain, & P. Warren (Eds.), Proceedings of the 19th International Congress of Phonetic Sciences (ICPhS), Melbourne, Australia 2019 (pp. 280–284). Australasian Speech Science and Technology Association. [URL]

Instituto Nacional de Estadísticas. (2018). 1.2 Población total por sexo y área urbana-rural, según grupos de edad [1.2 Total population by sex and urban/rural provenance, by age group]. In Segunda Entrega de Resultados Censo 2017 [Second Report on the Results of the 2017 Census]. Instituto Nacional de Estadísticas. [URL]

Jørgensen, A. M. (n.d.). Corpus Oral de Lenguaje Adolescente (COLA) [Adolescent Spoken Language Corpus (COLA)]. Retrieved December 23, 2021, from [URL]

Labov, W. (2001). Principles of Linguistic Change, vol. 2: Social Factors. Blackwell.

(2006). The Social Stratification of English in New York City (2nd ed.). Cambridge University Press.

Li, M., Song, Q., Li, K., Hao, Y., & Chen, X. (2015). Definition of corpus, scripts, standards and specifications of recording device, environment/speaker coverage for Spanish language, version 1.1 (Technical Report King-ASR-290). SpeechOcean China.

Milroy, L. (1987). Language and Social Networks (2nd ed.). Blackwell.

Padró, L., & Stanilovsky, E. (2012). FreeLing 3.0: Towards Wider Multilinguality. Proceedings of the Language Resources and Evaluation Conference (LREC 2012).

Rabanales, A. (1995). El estudio del habla culta de Santiago de Chile (1967–1993) [The study of educated speech in Santiago, Chile (1967–1993)]. Thesaurus, 50 (1–3), 51–68.

Rabanales, A., & Contreras, L. (1979). El habla culta de Santiago de Chile: Materiales para su estudio, tomo I [Materials for Studying Educated Speech in Santiago, Chile, vol. 1]. Anejo N^o 2 del Boletín de Filología. Editorial Universitaria.

(1990). El habla culta de Santiago de Chile: Materiales para su estudio, tomo II [Materials for Studying Educated Speech in Santiago, Chile, vol. 2]. Instituto Caro y Cuervo.

Real Academia Española. (n.d.-a). Corpus de Referencia del Español Actual (CREA) [Contemporary Spanish Reference Corpus (CREA)]. Retrieved August 28, 2019, from [URL]

. (n.d.-b). Corpus del Español del Siglo XXI [Corpus of 21st Century Spanish]. Retrieved August 28, 2019, from [URL]

Rogers, B. (2016). When Theory and Reality Collide: Exploring Chilean Spanish Intonational Plateaus [Ph.D. dissertation, University of Minnesota]. The University of Minnesota Digital Conservancy. [URL]

Ruiz-Tagle, J. (2016). La persistencia de la segregación y la desigualdad en barrios socialmente diversos: Un estudio de caso en La Florida, Santiago [The persistence of segregation and inequality in socially diverse neighborhoods: A case study from Santiago’s La Florida municipality]. EURE (Santiago), 42 (125), 81–108.

Sadowsky, S. (2016). FreeLing_es-CL: Chilean Spanish version of the FreeLing tagger [Computer software]. [URL]

(2017). MaSCoT-R: The Massive Speech Corpus Tool, Recursive Version (3.2) [Computer software]. [URL]

(2020). Español con (otros) sonidos araucanos: La influencia del mapudungun en el sistema vocálico del castellano chileno [Spanish with (other) Araucanian sounds: The influence of Mapudungun on the Chilean Spanish vowel system]. Boletín de Filología, 55 (2), 33–75.

(2021). EMIS: Sistema de estratificación socioeconómica para la investigación lingüística [EMIS: A socioeconomic stratification system for linguistic research]. In B. M. A. Rogers & M. Figueroa Candia (Eds.), Lingüística del castellano chileno: Estudios sobre variación, innovación, contacto e identidad [Chilean Spanish Linguistics: Studies on Variation, Innovation, Contact, and Identity] (pp. 367–396). Vernon Press. [URL]

Sadowsky, S., & Aninao, M. J. (2019). Internal Migration and Ethnicity in Santiago. In A. Lynch (Ed.), The Routledge Handbook of Spanish in the Global City (pp. 277–311). Routledge.

Sadowsky, S., & Salamanca, G. (2011). El inventario fonético del español de Chile: Principios orientadores, inventario provisorio de consonantes y sistema de representación (AFI-CL) [The phonetic inventory of Chilean Spanish: guiding principles, provisional consonant inventory and system of representation (AFI-CL)]. Onomázein, 24 (2), 61–84. [URL]

San Martín, A., & Guerrero, S. (2015). Estudio Sociolingüístico del Español de Chile (ESECH): Recogida y estratificación del corpus de Santiago [Sociolinguistic Study of Chilean Spanish (ESECH): Collection and stratification of the Santiago Corpus]. Boletín de Filología, 50 (1), 221–247.

San Martín, A., Guerrero, S., & Rojas, C. (2016). PRESEEA-SA: Corpus de Santiago de Chile. Proyecto para el Estudio Sociolingüístico del Español de España y América (PRESEEA) [PRESEEA-SA: The Santiago, Chile Corpus. Project for the Sociolinguistic Study of Iberian and American Spanish (PRESEEA)]. Universidad de Chile.

Trudgill, P. (1974). Linguistic change and diffusion: Description and explanation in sociolinguistic dialect geography. Language in Society, 3 1, 215–246.

Zúñiga, F. (2007). Mapudunguwelaymi am? ‘¿Acaso ya no hablas mapudungun?’ [Mapudunguwelaymi am? ‘By chance do you not speak Mapudungun anymore?’]. Estudios Públicos, 105 1, 9–24.

Cited by (3)

Cited by three other publications

Bonilla, Johnatan E.

2024. Spoken Spanish PoS tagging: gold standard dataset. Language Resources and Evaluation

Xu, Wei

2023. 2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON), ► pp. 1 ff.

Zemicheva, Svetlana, Maxim Gromov, Ludmila Dubtsova, Maria Ugryumova, Anna Vasilchenko & Natalia Zyuz’kova

2023. The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years. Russian Linguistics 47:2 ► pp. 231 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.