A corpus-based study of the automatic extraction and validation of V-N Italian oral academic collocations

Peppoloni, Diana

doi:10.1075/li.00022.pep

Article published In:

Lingvisticæ Investigationes
Vol. 41:2 (2018) ► pp.240–268

A corpus-based study of the automatic extraction and validation of V-N Italian oral academic collocations

Diana Peppoloni | University for Foreigners of Perugia

This study describes the outcomes of a POS-based method for the automatic extraction of V-N Italian oral academic collocations from an annotated corpus. A frequency statistical measure is applied to automatically extract the collocations from the POS-tagged corpus. The results reveal that frequency alone is not sufficient to measure the degree of association that connects the two elements of a word pair. In order to detect the real-attested Italian collocations, the data has been further evaluated by 50 Italian native speakers. The results indicate that these combinations are tightly linked to their context of usage. Thus, native speakers should be exposed to these phrasal contexts to activate their mechanisms of explicit reflection and assess the degree of collocativity of these combinations.

Keywords: collocations, academic lexicon, computational methods, applied linguistics, corpus linguistics

Article outline

Introduction
1.Towards a definition of “collocation”
- 1.1Collocations in applied linguistics
2.Data and methodology
- 2.1Collecting data for structuring the ASIC corpus
- 2.2Extracting and filtering collocations from the ASIC corpus
- 2.3Validation of the extracted academic Italian collocation list
  - 2.3.1Results of the crowd sourcing experiment
  - 2.3.2Double validation of the data
Discussion and conclusions
Acknowledgements
Notes
References

Published online: 4 February 2019

https://doi.org/10.1075/li.00022.pep

References

Ackerman, K. & Chen, Y.

2014 The Academic Collocation List. [online] Available at: [URL].

Basili, R., Pazienza, M. T., & Velardi, P.

1992 A shallow syntactic analyzer to extract word associations from corpora. Literary and Linguistic Computing, 71, 113–123.

Benson, M.

1990 Collocations and general-purpose dictionaries. International Journal of Lexicography, 311, 23–35.

Benson, M., Benson, E., & Ilson, R.

1986 The BBI Dictionary of English Word Combinations. Amsterdam: John Benjamins.

Biber, D.

2006 University Language: a corpus-based study of spoken and written registers. Amsterdam: John Benjamins.

Biber, D. & Conrad, S.

2009 Register, Genre and Style. New York: Cambridge University Press.

Callison-Burch, C.

2009 Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 286–295.

Calzolari, N. et al.

2002 Towards Best Practice for Multiword Expressions in Computational Lexicons. Proceedings of the Third International Conference on Language Resources and Evaluation, 1934–1940.

Chan, T. P., & Liou, H. C.

2005 Effects of web-based concordancing instruction on EFL students’ learning of verb-noun collocations. Computer Assisted Language Learning, 18(3), 231–251.

Cowie, A.

1978 The place of illustrative material and collocations in the design of a learner’s dictionary. In P. Strevens (Ed.), In Honour of A.S. Hornby, 127–139. Oxford: Oxford University Press.

1981 The treatment of collocations and idioms in learners’ dictionaries. Applied Linguistics, 21, 223–235.

Coxhead, A.

2000 A New Academic Word List. TESOL Quarterly, 341, 213–38.

Church, K. W. & Hanks, P.

1990 Word association norms, mutual information, and lexicography. Computational Linguistics, 161, 22–29.

Church, K. W., Gale, W., Hanks, P., & Hindle, D.

1991 Parsing, word associations, and typical predicate-argument relations. In M. Tomita (Ed.), Current Issues in Parsing Technology, 75–81. Boston: Kluwer Academic.

Durrant, P.

2008 High frequency collocations and second language learning. Final Thesis Ph.D., University of Nottingham.

Durrant, P. & Schmitt, N.

2009 To what extent do native and nonnative writers make use of collocations?. International Review of Applied Linguistics in Language Teaching, 471, 157–177.

Ellis, N. C., Simpson-Vlach, R. & Maynard, C.

2008 Formulaic language in native and second-language speakers: Psycholinguistics, corpus Linguistics, and TESOL. TESOL Quarterly, 421, 375–396.

Evert, S.

2008 Corpora and collocations. In A. Lüdeling, & M. Kytö (Eds.), Corpus Linguistics. An International Handbook, 223–233. Berlin: de Gruyter.

Evert, S. & Hardie, A.

2011 Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. Proceedings of the Corpus Linguistics 2011 conference.

Firth, J.

1956 Synopsis of linguistic theory 1930–1955. Reprinted in F. R. Palmer (Ed.) 1968, Selected Papers of J. R. Firth, 168–205. Harlow: Longman.

Gao, Z.-M.

2011 Exploring the effects and use of a Chinese-English bilingual concordancer. Computer-Assisted Language Learning, 241, 255–275.

2014 Automatic Extraction of English Collocations and their Chinese-English Bilingual Examples: A Computational Tool for Bilingual Lexicography. Studies in Linguistics, 401, 11, 95–121.

Gardner, D. & Davies, M.

2013 A new academic vocabulary list. Applied Linguistics, 351, 1–24.

Gledhill, C. J.

2000 Collocations in Science Writing. Tübingen: Gunter Narr Verlag.

Granger, S., & Meunier, F.

2008 Phraseology. An interdisciplinary perspective. Amsterdam: John Benjamins.

Granger, S., & Paquot, M.

2009 In search of a General Academic vocabulary: A corpus-driven study. In K. Katsampoxaki-Hodgetts (Ed.), Options and Practices of LSP Practitioners, 94–108. Crete: University of Crete Publications.

Hardie, A.

2012 CQPweb – combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 171, 31, 380–409.

Henriksen, B.

2012 Research on L2 learners’ collocational competence and development – a progress report. In C. Bardel, B. Laufer, & C. Lindqvist (Eds.), L2 vocabulary acquisition, knowledge and use. New perspectives on assessment and corpus analysis, 29–56. Eurosla Monographs Series 2, EUROSLA.

Hoffmann, S., Evert, S., Smith, N., Lee, D. Y. W. & Berglund Prytz, Y.

2008 Corpus Linguistics with BNCWeb – a Practical Guide. Frankfurt am Main: Peter Lang.

Howarth, P.

1996 Phraseology in English academic writing: some implications for language learning and dictionary making. Niemeyer: Tübingen.

Hsueh, P., Melville, P. & Sindhwani, V.

2009 Data quality from crowdsourcing: a study of annotation selection criteria. Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, 27–35.

Hyland, K.

2008 As can be seen: Lexical bundles and disciplinary variation. English for specific purposes, 271, 4–21.

Justeson, J. S. & Katz, S.

1995 Technical terminology: Some linguistic properties and an algorithm for identiﬁcation in text. Natural Language Engineering, 11, 9–27.

Kilgarriff, A., Rychly, P., Smrz, P. & Tugwell, D.

2004 The Sketch Engine. Proceedings of EURALEX, 105–116 Lorient, France.

Kjellmer, G.

1987 Aspects of English collocations. In W. Meijs (Ed.), Corpus Linguistics and Beyond: Proceedings of the Seventh International Conference of English of English Language Research on Computerized Corpora, 133–140. Amsterdam: Rodopi.

Krishnamurthy, R.

2006 Collocations. In K. Brown (Ed.), Encyclopedia of language and linguistics, 2nd Edition, 596–600. Oxford: Elsevier.

Kupiec, J., Pedersen, J. & Chen, F.

1995 A Trainable Document Summarizer. Proceedings of the 18th ACM-SIGIR, 68–73 Seattle.

Laufer, B. & Waldman, T.

2011 Verb-noun collocations in second-language writing: A corpus analysis of learners’ English. Language Learning, 6121, 647–672.

Lewis, M.

1993 The lexical approach. The State of ELT and the Way Forward. Hove: Language Teaching Publications.

Lorenz, G.

1999 Adjective intensification-learners versus native speakers: A corpus study of argumentative writing. Amsterdam: Rodopi.

Manning, C. D. & Schütze, H.

1999 Foundations of Statistical Natural Language Processing. Cambridge: MIT Press.

Nagy, W. & Townsend, D.

2012 Words as Tools: Learning Academic Vocabulary as Language Acquisition. Reading Research Quarterly, 4711, 91–108.

Nation, I. S. P.

2001 Learning vocabulary in another language. Cambridge: Cambridge University Press.

Nesi, H., & Gardner, S.

2012 Genres across the Disciplines: Student writing in higher education. Cambridge: Cambridge University Press.

Nesselhauf, N.

2005 Collocations in a learner corpus. Amsterdam & Philadelphia: Benjamins.

Palmer, H. E.

1933 Second Interim Report on English Collocations. Tokyo: Kaitakusha.

Peppoloni, D.

2012 Linguistic and computational tools in support of non-native Italian speaking students: the development of the Academic Spoken Italian Corpus. In A. Llanes, L. Astrid, L. Gallego, & R. Mateu (Eds.), La lingüística aplicada en la era de la globalización. Lleida: Edicions i Publicacions de la Universitat de Lleida.

Post, M., Callison-Burch, C., & Osborne, M.

2012 Constructing parallel corpora for six indian languages via crowdsourcing. Proceedings of the Seventh Workshop on Statistical Machine Translation. Montréal, 401–409.

Ramisch, C., Villavicencio, A., Moura, L., & Idiart, M.

2008 Picking them up and ﬁguring them out: Verb-particle constructions, noise and idiomaticity. In A. Clark, & K. Toutanova (Eds.), Proceedings of the Twelfth Conference on Natural Language Learning (CoNLL 2008), 49–56. Manchester, UK: Association for Computational Linguistics.

Ross, I. C. & Tukey, J. W.

1975 Introduction to these Volumes. In J. W. Tukey (Ed.), Index to Statistics and Probability, IV–X. Los Altos: R&D Press.

Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D.

2002 Multiword expressions: A pain in the neck for NLP. Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics CICLING 2002, 1–15.

Schütze, H.

1998 Automatic Word Sense Discrimination. Computational Linguistics, 24(1), 97–123.

Shei, C.-C. & Pain, H.

2000 An ESL writer’s collocation aid. Computer-Assisted Language Learning, 131, 167–182.

Shin, D., & Nation, P.

2008 Beyond single words: The most frequent collocations in spoken English. ELT Journal, 62(4), 339–348.

Simpson-Vlach, R. & Ellis, N. C.

2010 An academic formulas list: New methods in phraseology research. Applied Linguistics, 31, 4, 463–512.

Sinclair, J.

1987 Collins Cobuild English Language Dictionary. London: Collins.

1991 Corpus, Concordance, Collocation. Oxford: Oxford University Press.

2004 How to use corpora in language teaching. Amsterdam and Philadelphia: John Benjamins.

Smadja, F.

1993 Retrieving collocations form text: Xtract. Computational Linguistics, 1911, 143–177.

Snow, R., O’connor, B., Jurafsky, D., & Ng, A.

2008 Cheap and fast – but is it good?: evaluating non-expert annotations for natural language tasks. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 254–263.

Tiberii, P.

2012 Il dizionario delle collocazioni. Bologna: Zanichelli.

Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., & Ramisch, C.

2007 Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In J. Eisner (Ed.), Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), 1034–1043. Prague, Czech Republic: Association for Computational Linguistics.

Wang, A., Hoang, C., & Kan, M. Y.

2012 Perspectives on Crowdsourcing Annotations for Natural Language Processing. Language Resources and Evaluation, 47(1), 9–31.

Westbrook, P. & Henriksen, B.

2014 Advanced non-native university lecturers’ collocational competence. Thinking, Doing, Learning: Usage Based Perspectives on Second Language Learning 24–26 April 2013, Odense, Denmark.