Sentence splitting in Arabic to Spanish translation
Modern Standard Arabic makes extensive use of coordination particles whereas punctuation marks are scarce and
erratic, leading to long clauses. This is generally assumed to hinder Sentence Boundary Detection and to promote sentence
splitting when translating from Arabic into English. Previous literature on translation from Arabic to Spanish is practically
inexistent. We have tested this hypothesis regarding translation from Arabic to Spanish on a sample of 282,714 graphic words
extracted from a bilingual corpus of 8,681,110 graphic words and found that each Arabic sentence yielded an average of 1.5 Spanish
sentences. Furthermore, our data shows the potential impact of directionality in that sentence splitting when translating from
Arabic into Spanish is 50% more frequent than from English into Arabic. We also determined statistically that five elements
(wa [و], ḥaythu [حيث], kamā [كما],
wa-qad [وقد], and wa-dhalika [وذلك]) are the most salient potential markers for sentence splitting in the resulting Spanish
translations. Our findings should be particularly interesting for Computational Linguistics and translator training.
Article outline
- 1.Introduction
- 2.State of the art
- 2.1Automatic segmentation and alignment in Arabic
- 2.2Translation studies
- 2.3Conclusions
- 3.Methodology
- 3.1Corpus
- 3.2Sample
- 3.3Elements of distortion
- 3.4Segmentation
- 4.Data
- 5.Discussion
- 5.1Representativity, sentences, words, and average number of words per sentence
- 5.2Splitting types
- 6.Conclusions and future work
-
References