The Sociolinguistic Speech Corpus of Chilean Spanish (COSCACH)
A socially stratified text, audio and video corpus with multiple speech styles
This paper presents the Sociolinguistic Speech Corpus of Chilean Spanish (COSCACH) v1.0, a 9.3-million-word corpus
containing transcribed, lemmatized and morphologically tagged text, audio recordings and videos from 1,237 L1 speakers of Chilean
Spanish, as well as a control sample of 21 non-Chilean L1 Spanish speakers. The COSCACH is the first freely available corpus of
spoken Chilean Spanish of substantial size, as well as one of the largest speech corpora of any variety of Spanish. Following a
review of other Chilean speech corpora, I describe how the COSCACH was constructed, covering corpus design, speaker recruitment
and metadata collection, speech elicitation and recording, transcription, lemmatization and morphological tagging, and corpus
compilation. I thereby aim to provide a blueprint for creating modern, large-scale speech corpora suitable for phonetic,
sociophonetic and sociolinguistic research, in addition to traditional inquiry into semantics, lexis, grammar, pragmatics and
discourse.
Article outline
- 1.Introduction
- 2.Other Chilean Spanish speech corpora
- 2.1ESECH and PRESEEA-SA
- 2.2King-ASR-290
- 2.3Additional speech corpora
- 2.4Justification for the COSCACH
- 3.Corpus design and speaker sampling
- 3.1Chilean speaker samples
- 3.1.1Speaker inclusion variables
- A.Locality
- B.Ethnicity
- C.Lingualism
- D.Age/Generation
- E.Sex
- F.Socioeconomic status
- G.Year of recording
- 3.1.2Derived variables
- A.Region
- B.Urbanness
- C.Locality size
- D.Distance and travel time from Santiago
- 3.2Non-Chilean control sample
- 4.Data collection
- 4.1Timeframe
- 4.2Fieldworkers
- 4.3Speaker recruitment
- 4.3.1Recruitment procedures
- 4.3.2Informed consent and institutional review board (IRB) approval
- 4.3.3Further criteria for exclusion of speakers
- 4.4Socio-demographic questionnaire
- 4.5Elicitation tasks
- 4.5.1Sustained pronunciation of isolated vowels
- 4.5.2Reading of minimal pairs or other word lists
- 4.5.3Reading of meaningful sentences
- 4.5.4Reading of meaningful texts
- 4.5.5Conversational interview
- 4.5.6Language attitudes interview
- 4.6Recording
- 4.6.1Audio equipment and configuration
- 4.6.2Audio post-processing
- 4.6.3Video recording equipment and procedures
- 5.Transcription, text processing and corpus compilation
- 5.1Transcription
- 5.2Anonymization and protection of speakers’ privacy
- 5.3Text extraction and annotation
- 5.4Corpus compilation and use
- 6.Availability and access
- 7.Conclusions and future directions
- Acknowledgements
- Notes
-
References
References (40)
References
Academia Chilena de la Lengua
(Ed.). (2010). Diccionario de uso del español de Chile
(DUECh) [Dictionary of Chileanisms
(DUECh)]. MN Editorial / Asociación de Academias de la Lengua Española / Gobierno de Chile / Consejo Nacional de la Cultura y las Artes.
Audacity Development
Team. (2018). Audacity: Free Audio Editor and
Recorder (2.3.0) [Computer software]. [URL]
Audix Microphones. (2017). Audix HT5 Spec
Sheet, version 4.1. [URL]
Bengoa, J. (2018). La comunidad fragmentada: Nación y desigualdad en Chile [The
Fragmented Community: Nation and Inequality in Chile]. Editorial Catalonia.
Boersma, P., & Weenink, D. (2018). Praat:
Doing phonetics by computer (6.0.42) [Computer
software]. [URL]
Evert, S., & Hardie, A. (2011). Twenty-first
century corpus workbench: Updating a query architecture for the new
millennium. In Proceedings of the Corpus Linguistics 2011
Conference. [URL]
Eyheramendy, S., Martinez, F. I., Manevy, F., Vial, C., & Repetto, G. M. (2015). Genetic
structure characterization of Chileans reflects historical immigration patterns. Nature
Communications,
6
1, 6472.
Fant, L., & Harvey, A. (2008). Intersubjetividad y consenso en el diálogo: Análisis de un episodio de trabajo en grupo
estudiantil [Intersubjectivity and consensus in dialog: Analysis of a
student group work
session]. Oralia,
11
1, 307–322.
Fernández de Molina Ortés, E. (2017). Estudio contrastivo de la norma culta de tres ciudades peninsulares. Análisis del campo semántico de la
vivienda [A contrastive study of educated speech in three Spanish cities:
Analysis of the semantic field of
housing]. Onomázein,
37
1, 90–111.
Garretón, M. A., & Cumsille, G. (2002). Las percepciones de la desigualdad en Chile [Perceptions of
inequality in Chile]. Revista
Proposiciones,
34
1, 1–9.
Gundermann, H., Caniguan, J., Clavería, A., & Faúndez, C. (2009). Permanencia y desplazamiento, hipótesis acerca de la vitalidad del
mapuzugun [Persistence and displacement: A hypothesis on the vitality of
Mapudungun]. Revista de Lingüística Teórica y
Aplicada,
47
(1), 37–60.
HandBrake
Team. (2019). HandBrake (1.2.0) [Computer
software]. [URL]
Heggarty, P., Maguire, W., & McMahon, A. (2010). Splits
or waves? Trees or webs? How divergence measures and network analysis can unravel language
histories. Philosophical Transactions of the Royal Society B: Biological
Sciences,
365
(1559), 3829–3843.
Heggarty, P., Shimelman, A., Abete, G., Anderson, C., Sadowsky, S., Paschen, L., Maguire, W., Jocz, L., Aninao, M. J., Wägerle, L., Appelganz, D., Pheula do Couto e Silva, A., Lawyer, L. C., Câmara Cabral, A. S. A., Walworth, M., Michalsky, J., Koile, E., Runge, J., & Bibiko, H.-J. (2019). Sound
Comparisons: A new online database and resource for research in phonetic
diversity. In S. Calhoun, P. Escudero, M. Tabain, & P. Warren (Eds.), Proceedings
of the 19th International Congress of Phonetic Sciences (ICPhS), Melbourne, Australia
2019 (pp. 280–284). Australasian Speech Science and Technology Association. [URL]
Instituto Nacional de
Estadísticas. (2018). 1.2 Población total por sexo y área urbana-rural,
según grupos de edad [1.2 Total population by sex and urban/rural provenance, by age
group]. In Segunda Entrega de Resultados Censo
2017 [Second Report on the Results of the 2017
Census]. Instituto Nacional de Estadísticas. [URL]
Jørgensen, A. M. (n.d.). Corpus Oral de Lenguaje Adolescente (COLA) [Adolescent Spoken
Language Corpus (COLA)]. Retrieved December 23, 2021, from [URL]
Labov, W. (2001). Principles
of Linguistic Change, vol. 2: Social
Factors. Blackwell.
Labov, W. (2006). The
Social Stratification of English in New York City (2nd
ed.). Cambridge University Press.
Li, M., Song, Q., Li, K., Hao, Y., & Chen, X. (2015). Definition
of corpus, scripts, standards and specifications of recording device, environment/speaker coverage for Spanish language,
version 1.1 (Technical Report King-ASR-290). SpeechOcean China.
Milroy, L. (1987). Language
and Social Networks (2nd
ed.). Blackwell.
Padró, L., & Stanilovsky, E. (2012). FreeLing
3.0: Towards Wider Multilinguality. Proceedings of the Language Resources and Evaluation
Conference (LREC 2012).
Rabanales, A. (1995). El estudio del habla culta de Santiago de Chile (1967–1993) [The study of educated speech in Santiago, Chile
(1967–1993)]. Thesaurus,
50
(1–3), 51–68.
Rabanales, A., & Contreras, L. (1979). El habla culta de Santiago de Chile: Materiales para su estudio, tomo I [Materials for Studying Educated Speech in Santiago, Chile, vol. 1]. Anejo
No 2 del Boletín de Filología. Editorial Universitaria.
Rabanales, A., & Contreras, L. (1990). El habla culta de Santiago de Chile: Materiales para su estudio, tomo II [Materials for Studying Educated Speech in Santiago, Chile, vol.
2]. Instituto Caro y Cuervo.
Real Academia Española. (n.d.-a). Corpus de Referencia del Español Actual (CREA) [Contemporary Spanish
Reference Corpus (CREA)]. Retrieved August 28, 2019, from [URL]
Real Academia Española. (n.d.-b). Corpus del Español del Siglo XXI [Corpus of 21st Century
Spanish]. Retrieved August 28,
2019, from [URL]
Rogers, B. (2016). When
Theory and Reality Collide: Exploring Chilean Spanish Intonational Plateaus [Ph.D.
dissertation, University of Minnesota]. The University of Minnesota Digital Conservancy. [URL]
Ruiz-Tagle, J. (2016). La persistencia de la segregación y la desigualdad en barrios socialmente diversos: Un estudio de caso en La
Florida, Santiago [The persistence of segregation and inequality in socially
diverse neighborhoods: A case study from Santiago’s La Florida municipality]. EURE
(Santiago),
42
(125), 81–108.
Sadowsky, S. (2016). FreeLing_es-CL:
Chilean Spanish version of the FreeLing tagger [Computer software]. [URL]
Sadowsky, S. (2017). MaSCoT-R:
The Massive Speech Corpus Tool, Recursive Version (3.2) [Computer
software]. [URL]
Sadowsky, S. (2020). Español con (otros) sonidos araucanos: La influencia del mapudungun en el sistema vocálico del castellano
chileno [Spanish with (other) Araucanian sounds: The influence of Mapudungun
on the Chilean Spanish vowel system]. Boletín de
Filología,
55
(2), 33–75.
Sadowsky, S. (2021). EMIS: Sistema de estratificación socioeconómica para la investigación
lingüística [EMIS: A socioeconomic stratification system for linguistic
research]. In B. M. A. Rogers & M. Figueroa Candia (Eds.), Lingüística del castellano chileno: Estudios sobre variación, innovación, contacto e
identidad [Chilean Spanish Linguistics: Studies on Variation, Innovation, Contact,
and
Identity] (pp. 367–396). Vernon Press. [URL]
Sadowsky, S., & Aninao, M. J. (2019). Internal
Migration and Ethnicity in Santiago. In A. Lynch (Ed.), The
Routledge Handbook of Spanish in the Global
City (pp. 277–311). Routledge.
Sadowsky, S., & Salamanca, G. (2011). El inventario fonético del español de Chile: Principios orientadores, inventario provisorio de consonantes y
sistema de representación (AFI-CL) [The phonetic inventory of Chilean
Spanish: guiding principles, provisional consonant inventory and system of representation
(AFI-CL)]. Onomázein,
24
(2), 61–84. [URL]
San Martín, A., & Guerrero, S. (2015). Estudio Sociolingüístico del Español de Chile (ESECH): Recogida y estratificación del corpus de
Santiago [Sociolinguistic Study of Chilean Spanish (ESECH): Collection and
stratification of the Santiago Corpus]. Boletín de
Filología,
50
(1), 221–247.
San Martín, A., Guerrero, S., & Rojas, C. (2016). PRESEEA-SA: Corpus de Santiago de Chile. Proyecto para el Estudio Sociolingüístico del Español de España y
América (PRESEEA) [PRESEEA-SA: The Santiago, Chile Corpus. Project for the
Sociolinguistic Study of Iberian and American Spanish (PRESEEA)]. Universidad de Chile.
Trudgill, P. (1974). Linguistic
change and diffusion: Description and explanation in sociolinguistic dialect
geography. Language in
Society,
3
1, 215–246.
Zúñiga, F. (2007). Mapudunguwelaymi am? ‘¿Acaso ya no hablas mapudungun?’ [Mapudunguwelaymi am? ‘By chance do you not speak Mapudungun anymore?’]. Estudios
Públicos,
105
1, 9–24.
Cited by (3)
Cited by three other publications
Bonilla, Johnatan E.
2024.
Spoken Spanish PoS tagging: gold standard dataset.
Language Resources and Evaluation
Xu, Wei
2023.
2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON),
► pp. 1 ff.
Zemicheva, Svetlana, Maxim Gromov, Ludmila Dubtsova, Maria Ugryumova, Anna Vasilchenko & Natalia Zyuz’kova
2023.
The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years.
Russian Linguistics 47:2
► pp. 231 ff.
This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.