Article published in:Linguistics in the Netherlands 2019
Edited by Janine Berns and Elena Tribushinina
[Linguistics in the Netherlands 36] 2019
► pp. 147–161
Part II: Selected papers presented at the Dutch Annual Linguistics Day of 2019
A filter for syntactically incomparable parallel sentences
Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve “free” translations. In this paper we explore four possible filters: the Damerau-Levenshtein distance between POS-tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results.
- 2.Syntactic comparability
- 4.1Levenshtein distance on POS-tags
- 4.2Sentence-length ratio
- 4.3Graph edit distance on dependency trees
- 4.4Combination filter
- 4.5Automatically setting a threshold
- 5.Evaluation of the filters
Published online: 05 November 2019
Abu-Aisheh, Zeina, Romain Raveaux, Jean-Yves Ramel & Patrick Martineau
Abzianidze, Lasha, Johannes Bjerva, Kilian Evang, Hessel Haagsma, Rik van Noord, Pierre Ludmann & Johan Bos
Bard, Gregory V.
Fleiss, J. L. & Jacob Cohen
Hagberg, Aric, Daniel Schult & Pieter Swart
Klis, van der, Martijn, Bert Le Bruyn & Henriëtte de Swart
Levenshtein, Vladimir I.
Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan Mc Donald et al.
Straka, Milan & Jana Straková
Wiersma, Wybo, John Nerbonne & Timo Lauttamus