Edited by Silvia Perpiñán, David Heap, Itziri Moreno-Villamar and Adriana Soto-Corominas
[Romance Languages and Linguistic Theory 11] 2017
► pp. 169–188
We developed an automated algorithm to retrieve direct object clitic doubling (DOCLD) examples in Spanish data from texts and the web. We focused on the Rioplatense dialect, where this kind of doubling is rather common. Given an electronic text, our procedure has two steps: first, tagging the text with an available part-of speech (PoS) tagger (TreeTagger), then inputing the tagged text into java-based code that extracts all sentences containing direct object clitics and attempts to match each clitic to a candidate doubled NP in its sentence. Identification of DOCLD cases in a short story (edited text) was 100%, whereas on unedited, raw text it was only 50%. Missing DOCLD cases are mainly caused by misspellings and lack of punctuation in the raw texts. We discuss how to improve accuracy mainly by reducing the number of false negatives.