A corpus of contemporary Dutch texts written by primary school children
This short paper introduces BasiScript, a 9-million-word corpus of contemporary Dutch texts written by primary school children. The data were collected over three years with 17,216 children contributing texts throughout this period. Each word token in the corpus is annotated with the correct orthographical form, the associated lemma and the part of speech. The most frequent polysemous words have been annotated for word meaning, while all words in the lexicon that was derived from the BasiScript corpus have been annotated for corpus and subcorpora frequency, dispersion, length, family size, family frequency, orthographic neighborhood size, and orthographic neighborhood frequency. Images of the texts are available to researchers. The present article describes the corpus and presents a comparison of BasiScript with BasiLex (a Dutch corpus with texts primary school children are likely to read, completed in 2015) by means of frequency profiling.
Keywords: child language corpus, children’s written output, primary school, word properties, children’s written input
Published online: 27 December 2018
Balota, D., Yap, M., & Cortese, M. J.
Bracken, S., & Fischel, J. E.
Chiu, S. I., Hong, F. Y., & Hu, H. Y.
Clark, C., & Teravainen, A.[ p. 507 ]
Drijbooms, E., Groen, M., & Verhoeven, L.
Evers-Vermeul, J., & Sanders, T.
Fayol, M., & Mouchon, S.
Johannes, K., Wilson, C., & Landau, B.
Kent, S., & Wanzek, J.
Meints, K., Plunkett, K., Harris, P. L., & Dimmock, D.
Penning de Vries, B., & Tellings, A.
forthcoming). Development of connective frequency in Dutch child-directed texts: a corpus analysis.
Perfetti, C. A., & Hart, L.
Peterson, C., & McCabe, A.
Rayson, P., & Garside, R.
Tellings, A., Hulsbosch, M., Vermeer, A., & van den Bosch, A.
Van den Bosch, A., Busser, G. J., Daelemans, W., & Canisius, S.
(2007) An efficient memory-based morphosyntactic tagger and parser for Dutch. In F. van Eynde, P. Dirix, I. Schuurman, & V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting (CLIN-17, Leuven), (pp. 99–114). Utrecht: LOT. Retrieved from https://ilk.uvt.nl/downloads/pub/papers/tadpole-final.pdf (last accessed September 2018).
Van Gompel, M.[ p. 508 ]