Chapter 8. Automatic detection of syntactic patterns from texts with application to Spanish clitic doubling

Estigarribia, Bruno

doi:10.1075/rllt.11.08est

Part of

Romance Languages and Linguistic Theory 11: Selected papers from the 44th Linguistic Symposium on Romance Languages (LSRL), London, Ontario
Edited by Silvia Perpiñán, David Heap, Itziri Moreno-Villamar and Adriana Soto-Corominas
[Romance Languages and Linguistic Theory 11] 2017
► pp. 169–188

Chapter 8
Automatic detection of syntactic patterns from texts with application to Spanish clitic doubling

Bruno Estigarribia | University of North Carolina Chapel Hill

We developed an automated algorithm to retrieve direct object clitic doubling (DOCLD) examples in Spanish data from texts and the web. We focused on the Rioplatense dialect, where this kind of doubling is rather common. Given an electronic text, our procedure has two steps: first, tagging the text with an available part-of speech (PoS) tagger (TreeTagger), then inputing the tagged text into java-based code that extracts all sentences containing direct object clitics and attempts to match each clitic to a candidate doubled NP in its sentence. Identification of DOCLD cases in a short story (edited text) was 100%, whereas on unedited, raw text it was only 50%. Missing DOCLD cases are mainly caused by misspellings and lack of punctuation in the raw texts. We discuss how to improve accuracy mainly by reducing the number of false negatives.

Keywords: Web-as-corpus, automatic syntactic analysis, Rioplatense Spanish, corpus linguistics, parsing, pattern identification

Article outline

1.Corpus linguistics and the World Wide Web
2.Our case study: Identifying DOCLD from web texts
3.Precision and recall
4.Limitations of off-the-shelf tools (corpora and parsers)
5.Pattern identification vs. parsing
6.Curated vs. raw text
7.Our strategy
- 7.1Description of pattern matching algorithm (CLDFinder)
8.Results
- 8.1Edited, curated text
- 8.2Raw text from the web
  - 8.2.1False positives
  - 8.2.2False negatives
9.Discussion
Notes
References

Published online: 19 October 2017

https://doi.org/10.1075/rllt.11.08est

References

Alonso, Jaime, Juan José del Coz, Jorge Díez, Oscar Luaces, and Antonio Bahamonde

2008 “Learning to Predict One or More Ranks in Ordinal Regression Tasks.” In Machine Learning and Knowledge Discovery in Databases: European Conference, Antwerp, Belgium, September 15–19, 2008, Proceedings, edited by Walter Daelemans, Bart Goethals, and Katharina Morik, 39–54. Berlin Heidelberg: Springer Science & Business Media.

Añez, Juancarlo

2011 “Reply to ‘Efficient Context-Free Grammar Parser, Preferably Python-Friendly.’” Stackoverflow. [URL].

Baker, Paul

2010 Sociolinguistics and Corpus Linguistics. Edinburgh University Press.

Baroni, Marco, and Adam Kilgarriff

2006 “Large Linguistically-Processed Web Corpora for Multiple Languages.” EACL 2006–11th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 87.

Barrenechea, Ana María

1987 El habla culta de la ciudad de Buenos Aires: materiales para su estudio. edited by Instituto de Filología y Literaturas Hispánicas “Dr. Amado Alonso.” 2 vols. Buenos Aires: Universidad nacional de Buenos Aires, Facultad de filosofía y letras.

Belloro, Valeria A.

2007 “Spanish Clitic Doubling: A Study of the Syntax-Pragmatics Interface.” PhD dissertation, Buffalo, NY: State University of New York at Buffalo. [URL].

2011 “Dislocaciones Y Doblados: Entre La Concordancia Anafórica Y La Gramatical.” Hechos Y Proyecciones Del Lenguaje 20: 127–49.

2012 “Encoding Information Structure via Object Agreement in Spanish Interactions.” In Proceedings of BLS 34, 391–402. Berkeley, CA.

Davies, Mark

2002 “Corpus Del Español.” Corpus of Spanish. [URL].

Dufter, Andreas

2009 “Clefting and Discourse Organization: Comparing Germanic and Romance.” In Focus and Background in Romance Languages, edited by Andreas Dufter and Daniel Jacob, 83–121. Amsterdam: John Benjamins Publishing. [URL].

Estigarribia, Bruno

2005 “Direct Object Clitic Doubling in OT-LFG: A New Look at Rioplatense Spanish.” In The Proceedings of the LFG ’05 Conference, edited by Miriam Butt and Tracy Holloway King. University of Bergen, Norway. [URL].

2006 “Why Clitic Doubling? A Functional Analysis for Rioplatense Spanish.” In Selected Proceedings of the 8th Hispanic Linguistics Symposium, edited by Timothy L. Face and Carol A. Klee, 123–36. Somerville, MA: Cascadilla Proceedings Project.

2013 “Rioplatense Spanish Clitic Doubling and ‘Tripling’ in Lexical-Functional Grammar.” In Selected Proceedings of the 15th Hispanic Linguistics Symposium, edited by Chad Howe, Sarah E. Blackwell, and Margaret Lubbers Quesada, 297–309. University of Georgia, Athens: Cascadilla Proceedings Project.

2014 “La estructura informacional en la triplicación con clíticos del español rioplatense.” Signo y Seña | Revista del Instituto de Lingüística, no. 25: 105–32.

Forthcoming. The semantics of Spanish Clitic Left-Dislocations with epithets. To appear in Probus.

Fletcher, William H.

2012 “Corpus Analysis of the World Wide Web.” In The Encyclopedia of Applied Linguistics, n.p. Blackwell Publishing Ltd. [URL].

Fontanarrosa, Roberto

1995a “Beto.” In La Mesa De Los Galanes y otros cuentos, 42–50. Buenos Aires: Ediciones De La Flor.

1995b “Periodismo investigativo.” In La Mesa De Los Galanes y otros cuentos, 7–26. Buenos Aires: Ediciones De La Flor.

Gatto, Maristella

2014 Web As Corpus: Theory and Practice. A&C Black.

Gimpel, K., N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith

2011 “Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments.” In Proc. of ACL.

Gries, Stefan Thomas

2009 Quantitative Corpus Linguistics with R: A Practical Introduction. 1st ed. Routledge.

Gries, Stefan Thomas, Stefanie Wulff, and Mark Davies

eds. 2009 Corpus-Linguistic Applications: Current Studies, New Directions. Rodopi.

Gutiérrez-Rexach, Javier

1999 “The Formal Semantics of Clitic Doubling.” Journal of Semantics 16 (4): 315–80.

. Gutiérrez-Rexach 1999

Hopper, Paul J., and Sandra A. Thompson

1980 “Transitivity in Grammar and Discourse.” Language 56 (2): 251–99.

.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani

2013 An Introduction to Statistical Learning (with Applications in R). Springer Texts in Statistics 417. New York-Heidelberg-Dordrecht-London: Springer. [URL].

Keller, Frank, and Mirella Lapata

2003 “Using the Web to Obtain Frequencies for Unseen Bigrams.” Computational Linguistics 29 (3): 459–84.

.

Kilgarriff, Adam, and Gregory Grefenstette

2003 “Introduction to the Special Issue on the Web As Corpus.” Computational Linguistics 29 (3): 333–47.

.

Ligatto, Dolorès

1996 Matériau pour l’étude de l’espagnol parlé: la variante argentine. Presses Univ. Limoges.

López, Luis

2009 A Derivational Syntax for Information Structure. Oxford-New York: Oxford University Press.

Mazzuchino, María Gabriela

2013 “El doblado de acusativo en el español de Argentina: definitud, especificidad, presuposicionalidad y otras nociones conexas.” Lengua y Habla 17 (0): 118–52.

Real Academia Española

2014 “Corpus de Referencia Del Español Actual. Banco de Datos (CREA) [en Línea].” Corpus of Spanish. [URL].

Russell, Matthew A.

2013 Mining the Social Web. Second edition. Sebastopol, CA: O’Reilly Media. [URL].

Schmid, Helmut

1994 Probabilistic Part-of-Speech Tagging Using Decision Trees.

1995 “Improvements in Part-of-Speech Tagging with an Application to German.” In Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland.

Sportiche, Dominique

1996 “Clitic Constructions.” In Phrase Structure and the Lexicon, edited by Johan Rooryck and Laurie Ann Zaring, 213–76. Dordrecht, Netherlands: Kluwer Academic Publishers.

Subirats, Carlos, and Marc Ortega

2014 “Corpus Del Español Actual (CEA).” Corpus of Spanish. [URL].

Suñer, Margarita

1988 “The Role of Agreement in Clitic-Doubled Constructions.” Natural Language & Linguistic Theory 6 (3): 391–434.

.

2006 “Left Dislocations with and without Epithets.” Probus 18 (1): 127–58.

.

Torrego, Esther

1992 “Case and Argument Structure.” Unpublished manuscript. Boston, University of Massachussets.

1995 On the Nature of Clitic Doubling. [URL].

Uriagereka, Juan

1995 “Aspects of the Syntax of Clitic Placement in Western Romance.” Linguistic Inquiry 26 (1): 79–123.

Chapter 8Automatic detection of syntactic patterns from texts with application to Spanish clitic doubling

Chapter 8
Automatic detection of syntactic patterns from texts with application to Spanish clitic doubling