Article published in:Linguistics in the Netherlands 2019
Edited by Janine Berns and Elena Tribushinina
[Linguistics in the Netherlands 36] 2019
► pp. 147–161
A filter for syntactically incomparable parallel sentences
Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve “free” translations. In this paper we explore four possible filters: the Damerau-Levenshtein distance between POS-tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results.
Keywords: filter, parallel corpus, syntactic comparability, dependency parses
Published online: 05 November 2019
[ p. 160 ]References
Abu-Aisheh, Zeina, Romain Raveaux, Jean-Yves Ramel & Patrick Martineau
Abzianidze, Lasha, Johannes Bjerva, Kilian Evang, Hessel Haagsma, Rik van Noord, Pierre Ludmann & Johan Bos
Bard, Gregory V.
Fleiss, J. L. & Jacob Cohen
Hagberg, Aric, Daniel Schult & Pieter Swart
Klis, van der, Martijn, Bert Le Bruyn & Henriëtte de Swart
Levenshtein, Vladimir I.
Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan Mc Donald et al.
Straka, Milan & Jana Straková
Wiersma, Wybo, John Nerbonne & Timo Lauttamus[ p. 161 ]