Chapter published in:
Language and Text: Data, models, information and applicationsEdited by Adam Pawłowski, Jan Mačutek, Sheila Embleton and George Mikros
[Current Issues in Linguistic Theory 356] 2021
► pp. 239–256
Quantitative analysis of bibliographic corpora
Statistical features, semantic profiles, word spectra
Adam Pawłowski | University of Wrocław
Krzysztof Topolski | University of Wrocław
Elżbieta Herden | University of Wrocław
The subject of this chapter is bibliographic corpus analysis, with data from the Polish national bibliography from the period 1801–2019. The research allowed us to discover and compare quantitative characteristics of the bibliographic corpus and of the reference corpus of general language. It was shown that the two corpora differ significantly. In particular, differences in the share of particular parts of speech and of the frequency distribution of lexemes were demonstrated. The statistical distributions of word spectra were also studied. The best fit was obtained for generalized inverse Gauss-Poisson and Zipf-Mandelbrot distributions. The analysis of parameters of both distributions for bibliographic and reference corpora also revealed differences between them. The best perspective for future research on bibliographic corpora is, apart from quantitative linguistics, semantic analysis and text-mining.
Keywords: quantitative linguistics, corpus linguistics, word spectra, statistical distributions, MARC, bibliography, Polish, book titles
Article outline
- 1.Large-scale bibliographies as text corpora
- 2.Data and hypotheses
- 3.Research method
- 4.Results: An overview
- 5.Results: Statistical distributions
- 6.Conclusions
-
Notes -
Sources -
References
Published online: 22 December 2021
https://doi.org/10.1075/cilt.356.16paw
https://doi.org/10.1075/cilt.356.16paw
References
Sources
BN Data
CLARIN-PL infrastructure
WCRFT2 morphosyntactic tagger
ZipfR package
CHBB
Cressie, Noel & Timothy R. C. Read
1984 Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society. Series B (Methodological) 46(3). 440–464. https://www.jstor.org/stable/2345686. 
Evert, Stefan & Marco Baroni
2007 ZipfR: Word frequency distributions in R. In Sophia Ananiadou (ed.), Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Session, 29–32. Prague: Association for Computational Linguistics. http://www.stefan-evert.de/PUB/EvertBaroni2007.pdf (9 September 2020.) 
Green, Jonathan, Frank McIntyre & Paul Needham
Grotjahn, Rüdiger & Gabriel Altmann
Lahti, Leo, Jani Marjanen, Hege Roivainen & Mikko Tolonen
Mačutek, Ján & Gejza Wimmer
Mandelbrot, Benoît
Schwetschke, Gustav
1850 Codex nundinarius Germaniae literatae bisecularis. Teil: 1564 – 1765. Halle: G. Schwetschke’s Verlags-Handlung und Buchdruckerei. https://reader.digitale-sammlungen.de//resolve/display/bsb11199701.html (9 September 2020.)
1877 Codex nundinarius Germaniae literatae bisecularis. Teil: Forts. 1766 bis 1846. Halle: G. Schwetschke’s Verlags-Handlung und Buchdruckerei. https://digital.slub-dresden.de/werkansicht/dlf/102071/1/0/ (9 September 2020.)
Sichel, Herbert S.
Tolonen, Mikko, Jani Marjanen, Hege Roivainen & Leo Lahti
2019b Scaling up bibliographic data science. In Costanza Navarretta, Manex Agirrezabal & Bente Maegaard (eds.), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference, 450–456. Copenhagen: University of Copenhagen. https://cst.dk/DHN2019Pro/papers/40_2019DHNBDS.pdf (9 September 2020.)