Quantitative analysis of bibliographic corpora
Statistical features, semantic profiles, word spectra
The subject of this chapter is bibliographic corpus analysis, with data from the Polish national bibliography from the period 1801–2019. The research allowed us to discover and compare quantitative characteristics of the bibliographic corpus and of the reference corpus of general language. It was shown that the two corpora differ significantly. In particular, differences in the share of particular parts of speech and of the frequency distribution of lexemes were demonstrated. The statistical distributions of word spectra were also studied. The best fit was obtained for generalized inverse Gauss-Poisson and Zipf-Mandelbrot distributions. The analysis of parameters of both distributions for bibliographic and reference corpora also revealed differences between them. The best perspective for future research on bibliographic corpora is, apart from quantitative linguistics, semantic analysis and text-mining.
Article outline
- 1.Large-scale bibliographies as text corpora
- 2.Data and hypotheses
- 3.Research method
- 4.Results: An overview
- 5.Results: Statistical distributions
- 6.Conclusions
-
Notes
-
Sources
-
References
References
Sources
WCRFT2 morphosyntactic tagger
Baayen, R. Harald
2001 Word frequency distributions. Dordrecht: Kluwer.


CHBB
1999–2019 The Cambridge history of the book in Britain. Volumes 1–7. Cambridge: Cambridge University Press.

Cressie, Noel & Timothy R. C. Read
1984 Multinomial goodness-of-fit tests.
Journal of the Royal Statistical Society. Series B (Methodological) 46(3). 440–464.
[URL].

Evert, Stefan & Marco Baroni
2007 ZipfR: Word frequency distributions in R. In
Sophia Ananiadou (ed.),
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Session, 29–32. Prague: Association for Computational Linguistics.
[URL] (
9 September 2020.)

Febvre, Lucien & Henri-Jean Martin
1958 L’apparition du livre. Paris: Albin Michel.

Green, Jonathan, Frank McIntyre & Paul Needham
2011 The shape of incunable survival and statistical estimation of lost editions.
The Papers of the Bibliographical Society of America 105(2). 141–175.


Grotjahn, Rüdiger & Gabriel Altmann
1993 Modelling the distribution of word length: Some methodological problems. In
Reinhard Köhler &
Burghard B. Rieger (eds.),
Contributions to quantitative linguistics, 141–153. Dordrecht: Kluwer.


Lahti, Leo, Jani Marjanen, Hege Roivainen & Mikko Tolonen
2019 Bibliographic data science and the history of the book (c. 1500–1800).
Cataloguing & Classification Quarterly 57(1). 5–23.


Mačutek, Ján & Gejza Wimmer
2013 Evaluating goodness-of-fit of discrete distribution models in quantitative linguistics.
Journal of Quantitative Linguistics 20(3). 227–240.


Mandelbrot, Benoît
1962 On the theory of word frequencies and on related Markovian models of discourse. In
Roman Jacobson (ed.),
Structure of Language and its Mathematical Aspects (
Proceedings of Symposia in Applied Mathematics 12), 190–219. Providence, RI: AMS.

Schwetschke, Gustav
1850 Codex nundinarius Germaniae literatae bisecularis. Teil: 1564 – 1765. Halle: G. Schwetschke’s Verlags-Handlung und Buchdruckerei.
[URL] (
9 September 2020.)
Schwetschke, Gustav
1877 Codex nundinarius Germaniae literatae bisecularis. Teil: Forts. 1766 bis 1846. Halle: G. Schwetschke’s Verlags-Handlung und Buchdruckerei.
[URL] (
9 September 2020.)
Sichel, Herbert S.
1975 On a distribution law for word frequencies.
Journal of the American Statistical Association 70. 542–547.


Sichel, Herbert S.
1982 Asymptotic efficiency of the three methods of estimation for the inverse Gaussian-Poisson distribution.
Biometrika 69. 467–472.


Tolonen, Mikko, Jani Marjanen, Hege Roivainen & Leo Lahti
2019a Quantitative approach to book-printing in Sweden and Finland, 1640–1828.
Historical Methods: A Journal of Quantitative and Interdisciplinary History 52(1). 57–78.


Tolonen, Mikko, Jani Marjanen, Hege Roivainen & Leo Lahti
2019b Scaling up bibliographic data science. In
Costanza Navarretta,
Manex Agirrezabal &
Bente Maegaard (eds.),
Proceedings of the Digital Humanities in the Nordic Countries 4th Conference, 450–456. Copenhagen: University of Copenhagen.
[URL] (
9 September 2020.)
Cited by
Cited by 1 other publications
ROSZKOWSKI, MARCIN
2022.
BIBLIOGRAPHIC DATA SCIENCE – KONCEPTUALIZACJA OBSZARU BADAWCZEGO.
Przegląd Biblioteczny 90:1
► pp. 5 ff.

This list is based on CrossRef data as of 2 january 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.