The quest for croatian idioms as multiword units
Idiomatic expressions are types of MWUs in which the meaning of the unit does not equal the cummulative meaning of its
parts. They are culturally dependent, so the translation cannot be inferred from the expression itself. Croatian
language has a very rich idiomatic structure. A few such expressions can be understood in direct translation but most
are different from the literal translations. As the idioms are rooted in the tradition of the language and society
from which they hail, they need special treatment in computational linguistics. Using NooJ as an NLP tool, we describe
different types of Croatian idioms that will help us recognize them in texts. Idioms recognition should be given
special treatment, being the major task in translation.
Article outline
- 1.Introduction
- 2.Theoretical background
- 2.1Idioms as a type of MWU in Croatian language
- 2.2Importance of idiom detection in translation
- 2.3Previous work
- 3.Corpus of croatian idioms
- 4.NooJ – NLP tool of our choice
- 5.Dictionaries and syntactic grammars
- 6.Classification of idioms
- 6.1Idioms of Type 1
- 6.2Idioms of Type 2
- 6.3Idioms of Type 3
- 6.4Idioms of Type 4
- 6.5Idioms of Type 5
- 7.Results
- 8.Conclusion
-
Acknowledgements
-
Notes
-
References
References (20)
References
Agić, Ž, & Ljubešić. N. (2014). The SETimes.HR Linguistically Annotated Corpus of Croatian. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, (pp.1724–1727). Reykjavik.
Bekavac, B., & Tadić, M. (2008). A Generic Method for Multi Word Extraction from Wikipedia. In 30th International Conference on Information Technology Interfaces, 2008. ITI 2008, (pp.663–668). doi:.
Fink, Ž. & Menac, A. (2008). Hrvatska frazeologija – staro i novo (en. Croatian Phraseology – old and new). In W. Mokienko, & H. Walter (Eds.) Frazeologia. Komparacja spółczesnychjęzyków słowiańskich, 3 (pp.88–100). Opole: Universität Greifswald – Institutfür Slawistik, Uniwersytet Opolski – Instytut Filologii Polskiej.
Gavriilidou Z., Papadopoulou E. & Chadjipapa E. (2012). Processing Greek Frozen Expressions with NooJ. In K. Vučković, B. Bekavac, & M. Silberztein (Eds.)
Formalising Natural Languages with NooJ: Selected Papers from the NooJ 2011
International Conference. (pp.63–74). Cambridge Scholars Publishing, Newcastle., UK.
Machonis, P. A. (2010). English Phrasal Verbs: from Lexicon-Grammar to Natural Language Processing. Southern Journal of Linguistics 34.1: 21–48.
Machonis, P. A. (2012). Sorting NooJ out to take Multiword Expressions into account. In K. Vučković, B. Bekavac, & M. Silberztein (Eds.)
Formalising Natural Languages with NooJ: Selected Papers from the NooJ 2011
International Conference. (pp.152–165). Cambridge Scholars Publishing, Newcastle., UK.
Matešić, J. (1982). Frazeološki rječnik hrvatskoga ili srpskog jezika. Zagreb: Školska knjiga.
Menac, A. (1978). Neka pitanja u vezi s klasifikacijom frazeologije. In Filologija 8. (pp.219–255), Zagreb.
Menac, A., Fink-Arsovski, Ž. & Venturin, R. (2003). Hrvatski frazeološki rječnik. Zagreb: Naklada Ljevak.
Menac-Mihalić, M.. (2007). Hrvatski dijalektni frazemi s antroponimom kao sastavnicom. In Folia Onomastica Croatica, no. 12/13, (pp.361–385).
Ljubešić, N., Dobrovoljc, K., Krek, S., Peršurić Antonić, M. & Fišer, D.(2014). hrMWELex – A MWE lexicon of Croatian extracted from a parsed gigacorpus. In Language technologies: Proceedings of the 17th International Multiconference Information Society
IS2014. (pp.25–31). Ljubljana, Slovenia.
Rittgasser, S. & Fink-Arsovski, Ž. Hrvatski frazeološki korpus, [URL]
Sag, I. A., Baldwin, T., Copestake, A. & Flickinger, D. (2002). Multiword Expressions: A Pain in the Neck for NLP. In A. F. Gelbukh (Ed.) Proceedings of the Third International Conference on Intelligent Text Processing and Computational
Linguistics (CICLING 2002), (pp.1–15), Springer.
Silberztein, M.. (2003). NooJ Manual, [URL].
Spohr, D. (2008). Requirements for the Design of Electronic Dictionaries and a Proposal for their
Formalism. In R. V. Fjeld, & J. M. Torjusen (Eds.) Proceedings of the EURALEX International Congress 2008, (pp.617–629), Oslo, Norway.
Tadić M, & Šojat K. (2003). Finding Multiword Term Candidates in Croatian, In Proceedings of IESL2003 Workshop. (pp.102–107), Borovets, Bulgaria
Todorova M. (2008). Morpho-Syntactic Properties of Bulgarian Verbal Idiomatic Expressions, In X. Blanco, & M. Silberztein (Eds.) Proceedings of the 2007 International NooJ Conference. (pp.273–279). Cambridge Scholars Publishing, Newcastle.
Vietri S. (2012). Transformations and Frozen Sentences. In K. Vučković, B. Bekavac, & M. Silberztein (Eds.)
Formalising Natural Languages with NooJ: Selected Papers from the NooJ 2011
International Conference. (pp.166–181). Cambridge Scholars Publishing, Newcastle.
Vitas, D., Krstev, C. & Koeva, S. (2007) Towards a Complex Model for Morpho-Syntactic Annotation. In E. Paskaleva, & M. Slavcheva (Eds.) Proceedings of the Workshop on a Common Natural Language Processing Paradigm for Balkan Languages (pp.65–71), Borovets, Bulgaria.
Wasow, T., Sag, I. A. & Nunberg G. (1983). Idioms: An interim report. In S. Hattori, & K. Inoue (Eds.) Proceedings of the XIIIth International Congress of Linguistics (pp.102–115), Tokyo, Japan.