Statistical features, semantic profiles, word spectra: Quantitative analysis of bibliographic corpora

Pawłowski, Adam; Topolski, Krzysztof; Herden, Elżbieta

doi:10.1075/cilt.356.16paw

Part of

Language and Text: Data, models, information and applications
Edited by Adam Pawłowski, Jan Mačutek, Sheila Embleton and George Mikros
[Current Issues in Linguistic Theory 356] 2021
► pp. 239–256

Quantitative analysis of bibliographic corpora

Statistical features, semantic profiles, word spectra

Adam Pawłowski | University of Wrocław

Krzysztof Topolski | University of Wrocław

Elżbieta Herden | University of Wrocław

The subject of this chapter is bibliographic corpus analysis, with data from the Polish national bibliography from the period 1801–2019. The research allowed us to discover and compare quantitative characteristics of the bibliographic corpus and of the reference corpus of general language. It was shown that the two corpora differ significantly. In particular, differences in the share of particular parts of speech and of the frequency distribution of lexemes were demonstrated. The statistical distributions of word spectra were also studied. The best fit was obtained for generalized inverse Gauss-Poisson and Zipf-Mandelbrot distributions. The analysis of parameters of both distributions for bibliographic and reference corpora also revealed differences between them. The best perspective for future research on bibliographic corpora is, apart from quantitative linguistics, semantic analysis and text-mining.

Keywords: quantitative linguistics, corpus linguistics, word spectra, statistical distributions, MARC, bibliography, Polish, book titles

Article outline

1.Large-scale bibliographies as text corpora
2.Data and hypotheses
3.Research method
4.Results: An overview
5.Results: Statistical distributions
6.Conclusions
Notes
Sources
References

Published online: 22 December 2021

https://doi.org/10.1075/cilt.356.16paw

References (21)

Sources

BN Data

[URL]

CLARIN-PL infrastructure

[URL]

NKJP

[URL]

WCRFT2 morphosyntactic tagger

[URL]

ZipfR package

[URL]

Baayen, R. Harald

2001 Word frequency distributions. Dordrecht: Kluwer.

CHBB

1999–2019 The Cambridge history of the book in Britain. Volumes 1–7. Cambridge: Cambridge University Press.

Cressie, Noel & Timothy R. C. Read

1984 Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society. Series B (Methodological) 46(3). 440–464. [URL].

Evert, Stefan & Marco Baroni

2007 ZipfR: Word frequency distributions in R. In Sophia Ananiadou (ed.), Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Session, 29–32. Prague: Association for Computational Linguistics. [URL] (9 September 2020.)

Febvre, Lucien & Henri-Jean Martin

1958 L’apparition du livre. Paris: Albin Michel.

Green, Jonathan, Frank McIntyre & Paul Needham

2011 The shape of incunable survival and statistical estimation of lost editions. The Papers of the Bibliographical Society of America 105(2). 141–175.

Grotjahn, Rüdiger & Gabriel Altmann

1993 Modelling the distribution of word length: Some methodological problems. In Reinhard Köhler & Burghard B. Rieger (eds.), Contributions to quantitative linguistics, 141–153. Dordrecht: Kluwer.

Lahti, Leo, Jani Marjanen, Hege Roivainen & Mikko Tolonen

2019 Bibliographic data science and the history of the book (c. 1500–1800). Cataloguing & Classification Quarterly 57(1). 5–23.

Mačutek, Ján & Gejza Wimmer

2013 Evaluating goodness-of-fit of discrete distribution models in quantitative linguistics. Journal of Quantitative Linguistics 20(3). 227–240.

Mandelbrot, Benoît

1962 On the theory of word frequencies and on related Markovian models of discourse. In Roman Jacobson (ed.), Structure of Language and its Mathematical Aspects (Proceedings of Symposia in Applied Mathematics 12), 190–219. Providence, RI: AMS.

Schwetschke, Gustav

1850 Codex nundinarius Germaniae literatae bisecularis. Teil: 1564 – 1765. Halle: G. Schwetschke’s Verlags-Handlung und Buchdruckerei. [URL] (9 September 2020.)

1877 Codex nundinarius Germaniae literatae bisecularis. Teil: Forts. 1766 bis 1846. Halle: G. Schwetschke’s Verlags-Handlung und Buchdruckerei. [URL] (9 September 2020.)

Sichel, Herbert S.

1975 On a distribution law for word frequencies. Journal of the American Statistical Association 70. 542–547.

1982 Asymptotic efficiency of the three methods of estimation for the inverse Gaussian-Poisson distribution. Biometrika 69. 467–472.

Tolonen, Mikko, Jani Marjanen, Hege Roivainen & Leo Lahti

2019a Quantitative approach to book-printing in Sweden and Finland, 1640–1828. Historical Methods: A Journal of Quantitative and Interdisciplinary History 52(1). 57–78.

2019b Scaling up bibliographic data science. In Costanza Navarretta, Manex Agirrezabal & Bente Maegaard (eds.), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference, 450–456. Copenhagen: University of Copenhagen. [URL] (9 September 2020.)

Cited by (1)

Cited by 1 other publications

ROSZKOWSKI, MARCIN

2022. BIBLIOGRAPHIC DATA SCIENCE – KONCEPTUALIZACJA OBSZARU BADAWCZEGO. Przegląd Biblioteczny 90:1 ► pp. 5 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.