Article published In:
Register Studies: Online-First ArticlesVariation-Based Distance and Similarity Modeling
A new way of measuring distances between registers
We present a corpus-based method — Variation-Based Distance and Similarity Modeling (VADIS) — that calculates
distances between registers as a function of the extent to which the probabilistic conditioning of variation differs across
registers. When language users have a choice between different ways of saying similar things (e.g., cut off the
tops versus cut the tops off), what is the extent to which these choices are regulated differently
in different registers? In this spirit, we re-analyze pre-existing datasets that cover the genitive, dative, and particle
placement alternations in the grammar of English. These datasets cover five broad register categories: spoken informal English,
spoken formal English, written informal English, written formal English, and online/web-based English. Analysis shows that (a) the
registers under analysis are relatively but not entirely homogeneous in terms of the probabilistic grammars conditioning
grammatical choices, and (b) more often than not we see a split between spoken and written registers.
Article outline
- 1.Introduction
- 2.Methods and data
- 2.1Datasets and registers under study
- 2.2Method: Variation-Based Distance and Similarity Modeling (VADIS)
- Step 1
- Step 2
- Step 3
- Step 4
- Step 5
- Step 6
- Step 7
- 2.3Questions and answers about our methodology
- 3.Results
- 3.1Quantification via similarity coefficients
- 3.2Mapping out (dis)similarity relationship between registers
- 4.Discussion and conclusion
- Acknowledgements
- Notes
-
References
Available under the Creative Commons Attribution (CC BY) 4.0 license.
For any use beyond this license, please contact the publisher at [email protected].
Published online: 11 October 2024
https://doi.org/10.1075/rs.23011.zha
https://doi.org/10.1075/rs.23011.zha
References (46)
Bartels, B., & Szmrecsanyi, B. (to
appear). Correlating linguistic and language-external distances: Future temporal reference in
spoken World Englishes. World Englishes.
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting
Linear Mixed-Effects Models Using lme4. Journal of Statistical
Software,
67
(1).
Biber, D., & Egbert, J. (2023). What
is a register?: Accounting for linguistic and situational variation within — and outside of — textual
varieties. Register
Studies,
5
(1), 1–22.
Biber, D., Egbert, J., Keller, D., & Wizner, S. (2021). Towards
a taxonomy of conversational discourse types: An empirical corpus-based analysis. Journal of
Pragmatics,
171
1, 20–35.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman
Grammar of Spoken and Written
English. Harlow: Longman.
Bresnan, J., Cueni, A., Nikitina, T., & Baayen, R. H. (2007). Predicting
the dative alternation. In Cognitive foundations of
interpretation (pp. 69–94). KNAW.
Bresnan, J., & Ford, M. (2010). Predicting
syntax: Processing dative constructions in American and Australian varieties of
English. Language,
86
(1), 168–213.
Chen, P. (1986). Discourse
and particle movement in English. Studies in
Language,
10
(1), 79–95.
Cysouw, M. (2013). Disentangling
geography from genealogy. In P. Auer, M. Hilpert, A. Stukenbrock, & B. Szmrecsanyi (Eds.), Space
in Language and
Linguistics (pp. 21–37). Berlin, Boston: DE GRUYTER.
Davies, M., & Fuchs, R. (2015). Expanding
horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus
(GloWbE). English World-Wide. A Journal of Varieties of
English,
36
(1), 1–28.
Egbert, J., Biber, D., & Davies, M. (2015). Developing
a bottom-up, user-based method of web register classification. Journal of the Association for
Information Science and
Technology,
66
(9), 1817–1831.
Egbert, J., Wizner, S., Keller, D., Biber, D., McEnery, T., & Baker, P. (2021). Identifying
and describing functional discourse units in the BNC Spoken 2014. Text &
Talk,
41
(5–6), 715–737.
Engel, A., Grafmiller, J., Rosseel, L., & Szmrecsanyi, B. (2022). Assessing
the complexity of lectal competence: The register-specificity of the dative alternation after
give
. Cognitive
Linguistics,
0
(0).
Engel, A., & Szmrecsanyi, B. (2023). Variable
grammars are variable across registers: Future temporal reference in English. Language
Variation and Change, 1–24.
Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). SWITCHBOARD:
Telephone speech corpus for research and development. In IEEE
International Conference on Speech, and Signal Processing,
ICASSP-92 (Vol. 11, pp. 517–520).
Grafmiller, J. (2014). Variation
in English genitives across modality and genres. English Language and
Linguistics,
18
(3), 471–496.
Grafmiller, J., & Szmrecsanyi, B. (2018). Mapping
out particle placement in Englishes around the world. A study in comparative sociolinguistic
analysis.
Grafmiller, J., Szmrecsanyi, B., Röthlisberger, M., & Heller, B. (2018). General
introduction: A comparative perspective on probabilistic variation in grammar. Glossa: A
Journal of General Linguistics,
3
(1).
Gries, Stefan Th. (2017). Syntactic alternation
research: Taking stock and some suggestions for the future. Belgian Journal of
Linguistics,
31
1, 8–29.
Gries, Stefan Thomas. (2003). Multifactorial analysis in
corpus linguistics: A study of particle placement. New York: Continuum.
Heller, B. (2018). Stability
and Fluidity in Syntactic Variation World-Wide: The Genitive Alternation Across Varieties of
English (PhD dissertation). KU Leuven, Leuven.
Heller, D.-B., Szmrecsanyi, B., Mukherjee, J., & Grafmiller, J. (2018). Stability
and Fluidity in Syntactic Variation World-Wide: The Genitive Alternation Across Varieties of
English (PhD Thesis).
Hinrichs, L., & Szmrecsanyi, B. (2007). Recent
changes in the function and frequency of Standard English genitive constructions: A multivariate analysis of tagged
corpora. English Language and
Linguistics,
11
(3), 437–474.
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased
Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and
Graphical
Statistics,
15
(3), 651–674.
Koch, P., & Oesterreicher, W. (1985). Sprache
der Nähe — Sprache der Distanz: Mündlichkeit und Schriftlichkeit im Spannungsfeld von Sprachtheorie und
Sprachgeschichte. Romanistisches
Jahrbuch
36
1, 15–43.
(2012). Language
of Immediacy — Language of Distance: Orality and Literacy from the Perspective of Language Theory and Linguistic
History. In C. Lange, B. Weber, & G. Wolf (Eds.), Communicative
spaces: Papers in honour of Ursula
Schaefer (pp. 441–473). Frankfurt, M: Lang.
Kruskal, J. B., & Wish, M. (1978). Multidimensional
Scaling. Newbury Park, London, New Delhi: Sage Publications.
La Peruta, R. (2022). Using
VADIS to weigh competing epicentral influence. World
Englishes,
41
(3), 400–413.
Levshina, N. (2015). How
to do linguistics with R: Data exploration and statistical analysis. Amsterdam ; Philadelphia: John Benjamins Publishing Company.
Li, Y., Szmrecsanyi, B., & Zhang, W. (2024). Across
time, space, and genres: Measuring probabilistic grammar distances between varieties of
Mandarin. Linguistics Vanguard.
Nerbonne, J., Heeringa, W., & Kleiweg, P. (1999). Edit
Distance and Dialect Proximity. In D. Sankoff & J. B. Kruskal (Eds.), Time
Warps, String Edits and Macromolecules: The Theory and Practice of Sequence
Comparison. Stanford: CSLI Press.
Rosenbach, A. (2008). Animacy
and grammatical variation — Findings from English genitive
variation. Lingua,
118
(2), 151–171.
(2014). English
genitive variation — the state of the art. English Language and
Linguistics,
18
(2), 215–262.
Röthlisberger, M. (2018). Regional
variation in probabilistic grammars: A multifactorial study of the English dative
alternation (PhD dissertation, KU Leuven). KU Leuven, Leuven. Retrieved from [URL]
Röthlisberger, M., Szmrecsanyi, B., Hundt, M., & Grafmiller, J. (2018). Regional
variation in probabilistic grammars: A multifactorial study of the English dative
alternation (PhD Thesis).
Szmrecsanyi, B. (2022). Measuring
distance-based coherence. In K. V. Beaman & G. R. Guy (Eds.), The
Coherence of Linguistic Communities Orderly Heterogeneity and Social Meaning (1st
ed., pp. 127–142). New York: Routledge.
Szmrecsanyi, B. & Engel, A. (2022). A
variationist perspective on the comparative complexity of four registers at the intersection of mode and
formality. Corpus Linguistics and Linguistic
Theory,
19
(1), 79–113.
Szmrecsanyi, B., & Grafmiller, J. (2023). Comparative
variation analysis: Grammatical alternations in world Englishes. Cambridge, New York: Cambridge University Press.
Szmrecsanyi, B., Grafmiller, J., Bresnan, J., Rosenbach, A., Tagliamonte, S., & Todd, S. (2017). Spoken
syntax in a comparative perspective: The dative and genitive alternation in varieties of
English. Glossa: A Journal of General
Linguistics,
2
(1).
Szmrecsanyi, B., Grafmiller, J., & Rosseel, L. (2019). Variation-Based
Distance and Similarity Modeling: A Case Study in World Englishes. Frontiers in Artificial
Intelligence,
2
1, 23.