A flexible framework for collocation retrieval and translation from parallel and comparable corpora
This paper outlines a methodology and a system for collocation retrieval and translation from parallel and comparable
corpora, developed with translators and language learners in mind. It is based on a phraseology framework, applies
statistical techniques, and employs source tools and online resources. The collocation retrieval and translation has
proved successful for English and Spanish and can be easily adapted to other languages. The evaluation results are
promising and future goals are proposed. Furthermore, conclusions are drawn on the nature of comparable corpora and
how they can be better exploited to suit particular needs of target users.
Article outline
- 1.Introduction
- 2.Phraseology
- 2.1Typologies of collocations
- 2.2Transfer rules
- 3.Related work
- 3.1Collocation retrieval
- 3.2Parallel corpora
- 3.3Comparable corpora
- 4.System
- 4.1Candidate selection module
- 4.2Candidate filtering module
- 4.3Dictionary look-up module
- 4.4Parallel corpora module
- 4.5Comparable corpora module
- 5.Evaluation
- 5.1Experimental setup
- 5.2Experimental results
- 5.3Discussion and future work
-
Acknowledgements
-
Notes
-
References
References
Baldwin, T., and Kim, S. N.
(
2010)
Multiword Expressions. In:
Handbook of Natural Language Processing, second edition. Boca Raton, FL.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E.
(
1999)
Grammar of spoken and written English. Edimburgh: Pearson Education Limited.
Bradford, W., and Hill, S.
(
2000)
Bilingual Grammar of English-Spanish Syntax. University Press of America.
Brown, P., Lai, J., and Mercer, R.
(
1991)
Aligning Sentences in Parallel Corpora. In:
Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (pp.169–176). Berkeley, Canada.
Cardey, S., Chan, R., & Greenfield, P.
(
2006)
The Development of a Multilingual Collocation Dictionary. In:
Proceedings of the Workshop on Multilingual Language Resources and Interoperability, Sydney,32–39.
Choueka, Y., Klein, T., and Neuwitz, E.
(
1983)
Automatic Retrieval of Frequent Idiomatic and Collocational Expressions in a Large
Corpus. In:
Journal for Literary and Linguistic Computing, 4(1): 34–38.
Church, K. W., and Hanks, P.
(
1989)
Word Association Norms, Mutual Information, and Lexicography. In:
Proceedings of the 27th annual meeting on Association for Computational Linguistics, 76–83.
Corpas Pastor, G.
(
1995)
Un Estudio Paralelo de los Sistemas Fraseológicos del Inglés y del Español. Málaga: SPICUM.
Corpas Pastor, G.
(
1996)
Manual de Fraseología Española. Madrid, Gredos.
Corpas Pastor, G.
(
2013)
Detección, Descripción y Contraste de las Unidades Fraseológicas mediante Tecnologías
Lingüísticas. Manuscript submitted for publication. In
Fraseopragmática,
I. Olza, and
E. Manero (Eds.). Berlin: Frank & Timme.
Fung, P., and Yuen, Y.
(
1998)
An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In:
Proceedings of the 17th International Conference on Computational Linguistics, 414–420.
Gale, W., and Church K.
(
1993)
A Program for Aligning Sentences in Bilingual Corpora. In:
Journal of Computational Linguistics, 19: 75–102.
Gelbukh, A., and Kolesnikova O.
(
2013)
Expressions in NLP: General Survey and a Special Case of Verb-Noun Constructions. In
Emerging Applications of Natural Language Processing: Concepts and New Research,
S. Bandyopadhyay,
S. K. Naskar, and
A. Ekbal (Eds.). Hershey: Information Science Reference. IGI Global.1–21.
Hausmann, F.
(
1985)
Kollokationen im deutschen Wörterbuch. Ein Beitrag zur Theorie des lexikographischen
Beispiels. In:
Lexikographie und Grammatik, (
Lexicographica, series maior 3), Ed.
H. Bergenholtz, and
J. Mugdan. Tübingen: Niemeyer. 175–186.
Hoang, H. H., Kim, S. N., and Kan, M. Y.
(
2009)
A Re-examination of Lexical Association Measures, In
Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP, Singapore: ACL and AFNLP. 31–39.
Jackendoff, R.
(
1997)
The Architecture of the Language Faculty, Cambridge, Mass., MIT Press.
Jackendoff, R.
(
2007)
Language, Consciousness, Culture: Essays on Mental Structure. The MIT Press.
Lea, D. and Runcie, M.
(
2002)
Oxford Collocations Dictionary for Students of English. Oxford University Press.
Lü, Y. and Zhou, M.
(
2004)
Collocation Translation and Acquisition Using Monolingual Corpora. In:
Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL ’04).167–174.
Ramisch, C., Villavicencio, A., and Boitet, C.
(
2010)
MWEToolkit: A Framework for Multiword Expression Identification. In:
Proceedings of LREC’10 (7th International Conference on Language Resources and Evaluation) .
Ramisch, C.
(
2012)
A Generic Framework for Multiword Expressions Treatment: from Acquisition to
Applications. In:
Proceedings of ACL 2012 Student Research Workshop, 61–66.
Rapp, R.
(
1995)
Identifying Word Translations in Nonparallel Texts. In:
Proceedings of the 35th Conference of the Association of Computational Linguistics, 321–322. Boston, Massachusetts.
Sag, I. et al.
(
2002)
Multiword Expressions: A Pain in the Neck for NLP. In:
Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational
Linguistics (COCLing-2002), 1–15.
Santana, O. et al.
(
2011)
Extracción Automática de Colocaciones Terminológicas en un Corpus Extenso de Lengua
General. In:
Procesamiento del Lenguaje Natural, (47),145–152.
Schmid, H.
(
1994)
Probabilistic Part-of-Speech Tagging Using Decision Trees. In:
Proceedings of International Conference on New Methods in Language Processing. Manchester, UK.
Seretan, V.
(
2011)
Syntax-Based Collocation Extraction (Text, Speech and Language Technology). (1st ed.). Springer.
Sharoff, S., Babych, B., and Hartley, A.
(
2009)
“Irrefragable answers” using comparable corpora to retrieve translation equivalents. In:
Language Resources and Evaluation, 43(1).15–25.
Sinclair, J., & Jones, S.
(
1974)
English Lexical Collocations: A study in computational linguistics. In:
Cahiers de lexicologie, 24(2).15–61.
Smadja, F.
(
1993)
Retrieving collocations from text: Xtract. In:
Computational Linguistics, 19(1). 143–177.
Varga, D. et al.
(
2005)
Parallel corpora for medium density languages. In:
Proceedings of the RANLP 2005.590–596.
Wehrli, E., Nerima, L., and Scherrer, Y.
(
2009)
Deep linguistic multilingual translation and bilingual dictionaries. In:
Proceedings of the Fourth Workshop on Statistical Machine Translation.90–94.
Cited by
Cited by 1 other publications
Garcia, Marcos, Marcos García-Salido & Margarita Alonso-Ramos
This list is based on CrossRef data as of 22 march 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.