Detecting contrast patterns in newspaper articles by combining discourse analysis and text mining

Senja Pollak, Roel Coesemans, Walter Daelemans and Nada Lavrač


Text mining aims at constructing classification models and finding interesting patterns in large text collections. This paper investigates the utility of applying these techniques to media analysis, more specifically to support discourse analysis of news reports about the 2007 Kenyan elections and post-election crisis in local (Kenyan) and Western (British and US) newspapers. It illustrates how text mining methods can assist discourse analysis by finding contrast patterns which provide evidence for ideological differences between local and international press coverage. Our experiments indicate that most significant differences pertain to the interpretive frame of the news events: whereas the newspapers from the UK and the US focus on ethnicity in their coverage, the Kenyan press concentrates on sociopolitical aspects.

Quick links
A browser-friendly version of this article is not yet available. View PDF
Baker, P
(2006) Using Corpora in Discourse Analysis. London: Continuum.Google Scholar
Baker, P., C. Gabrielatos, M. Khosravinik, M. Krzyzanowski, T. McEnery, and R. Wodak
(2008) A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse Society 19.3: 273–306. Crossref  BoPGoogle Scholar
Balahur, A., and R. Steinberger
(2009) Rethinking sentiment analysis in the news: From theory to practice and back. In Proceedings of the 1st Workshop on Opinion Mining and Sentiment Analysis , Satellite to CAEPIA 2009.
Bell, A
(1991) The Language of News Media. Oxford: Blackwell.  BoPGoogle Scholar
Cendrowska, J
(1987) PRISM: An algorithm for inducing modular rules. International Journal of Man- Machine Studies 27.4: 349–370. CrossrefGoogle Scholar
Cohen, W
(1995) Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning , p. 115–123.
Cohen, W., and Y. Singer
(1999) Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems (TOIS) 17.2: 141–173. CrossrefGoogle Scholar
Daelemans, W., S. Bucholz, and J. Veenstra
(1999) Memory-based shallow parsing. In Proceedings of the Computational Natural Language Learning Workshop (CoNLL-99). Demo: http://​www​.cnts​.ua​.ac​.be​/cgi​-bin​/jmeyhi​/MBSP​-instant​-webdemo​.cgi
EU EOM Kenya
(2008) Kenya: Final Report. General Elections 27 December 2007 (3 April 2008) Brussel: EU EOM Kenya, retrieved from http://​www​.eueom​.eu/ [01/03/2010].
Fairclough, N
(1995) Media Discourse. London: Arnold.  BoPGoogle Scholar
Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth
(1996) The KDD process for extracting useful knowledge from volumes of data. Communication of the ACM 39. 11: 27–34. CrossrefGoogle Scholar
Feldman, R., and J. Sanger
(2007) The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data. New York: Cambridge University Press.Google Scholar
Fielding, N.G., and R.M. Lee
(1998) Computer Analysis of Qualitative Research. London: Sage.Google Scholar
Finn, A., and N. Kushmerick
(2006) Learning to classify documents according to genre. In Journal of the American Society for Information Science and Technology 57.11: 1506–1518. CrossrefGoogle Scholar
Fortuna, B., C. Galleguillos, and N. Cristianini
(2009) Detecting the bias in media with statistical learning methods. In N. Ashok, Srivastava and M. Saham (eds.), Text Mining: Theory and Applications. London: Taylor and Francis Publisher. CrossrefGoogle Scholar
Fortuna, B., M. Grobelnik, and D. Mladenić
(2006) System for semi-automatic ontology construction. In Proceedings of the Demo Session at European Semantic Web Conference ESWC (2006).
(2007) OntoGen: Semi-automatic ontology editor. In M.J. Smith, and G. Salvendy (eds.), Proceedings of Human Interface, Part II, HCI International 2007, LNCS 4558, Springer, p. 309–318.
Galtung, J., and M.H. Ruge
(1965) The structure of foreign news: The presentation of the Congo, Cuba and Cyprus crises in four Norwegian newspapers. Journal of Peace Research 2.1: 64–91. CrossrefGoogle Scholar
Gibbs, G.R
(2004) Computer-assisted Qualitative Data Analysis (CAQDAS). In M.S. Lewis-Beck, A. Bryman, and T.F. Liao (eds.), The Sage Encyclopedia of Social Science Research Methods (1). Thousand Oaks: Sage, p. 87–89.Google Scholar
Greevy, E.P., and A.F. Smeaton
(2004) Text categorisation of racist texts using a support vector machine. In Proceedings of 7es Journées internationales d’Analyse statistique des Données Textuelles JADT (1) . Leuven: PUL, p. 533–544.
Harcup, T
(2004) Journalism: Principles and Practice. London: Sage.Google Scholar
Harris, R.J
(2004) A Cognitive Psychology of Mass Communication (4th ed.) Mahwah: Lawrence Erlbaum.Google Scholar
Kennedy, G
(1998) An Introduction to Corpus Linguistics. London: Longman.  TSBGoogle Scholar
Koller, V., and G. Mautner
(2004) Computer applications in critical discourse analysis. In C. Coffin, A. Hewings, and K. O'Halloran (eds.), Applying English Grammar: Functional and Corpus Approaches. London: Arnold, p. 216–228.Google Scholar
Krishnamurty, R
(1996) Ethnic, racial and tribal: The language of racism? In C.R. Caldas-Coulthard, and M. Coulthard (eds.), Texts and Practices: Readings in Critical Discourse Analysis. London/New York: Routledge, p. 129–149.  BoPGoogle Scholar
Lee, C., J.M. Chan, Z. Pan, and C.Y.K. So
(2000) National prisms of a global 'Media Event'. In J. Curran, and M. Gurevitch (eds.), Mass Media and Society (3rd ed.). London: Arnold., p. 295–309.Google Scholar
Lin, W.-H., E. Xing, and A. Hauptmann
(2008) A joint topic and perspective model for ideological discourse. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases , p. 17–32.
Lindlof, T.R., and B.C. Taylor
(2011) Qualitative Communication Research Methods (3rd ed.). Thousand Oaks: Sage.
Liu, S.-Z., and H.-P. Hu
(2007) Text classification using sentential frequent item sets. In Journal of Computer Science and Technology 22.2. Beijing: Institute of Computing Technology, p. 334–337. CrossrefGoogle Scholar
Liu, B
(2010) Sentiment Analysis: A Multi-Faceted Problem. IEEE Intelligent Systems 25.3. CrossrefGoogle Scholar
Lüdeling, A., and M. Kytö
(eds.) (2008) Corpus Linguistics. An International Handbook. Berlin: Mouton de Gruyter. CrossrefGoogle Scholar
Luyckx, K
(2010) Scalability Issues in Authorship Attribution. Brussels: UPA University Press Antwerp.Google Scholar
Luyckx, K., and W. Daelemans
(2008) Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), p. 513–520.
Machin, D
(2008) News discourse I: Understanding the social goings-on behind news texts. In A. Mayr (ed.), Language and Power: An Introduction to Institutional Discourse. London: Continuum, p. 62–89.Google Scholar
MacMillan, K
(2005) More than just coding? Evaluating CAQDAS in a discourse analysis of news texts. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research 6.3, art. 25.Google Scholar
Mahlberg, M
(2007) Lexical items in discourse: Identifying local textual functions of sustainable development. In M. Hoey, M. Mahlberg, M. Stubbs, and W. Teubert (eds.), Text, Discourse and Corpora. Theory and Analysis. London/New York: Continuum, p. 191–218.Google Scholar
Matu, P.M., and H.J. Lubbe
(2007) Investigating language and ideology: A presentation of the ideological square and transitivity in the editorials of three Kenyan newspapers. Journal of Language and Politics 6.3: 401–418. CrossrefGoogle Scholar
Mautner, G
(2007) Mining large corpora for social information: The case of elderly. Language in Society 36.1: 51–72. Crossref  BoPGoogle Scholar
McGee, M.C
(1980) The ‘ideograph’: A link between rhetoric and ideology. The Quarterly Journal of Speech 66.1: 1–16. CrossrefGoogle Scholar
Mitchell, T
(1997) Machine Learning. Boston: McGraw Hill.Google Scholar
Morley, J., and P. Bayley
(2009) Corpus-Assisted Discourse Studies on the Iraq Conflict: Wording the War. New York: Routledge.  BoPGoogle Scholar
Ngonyani, D
(2000) Tools of deception: Media coverage of student protests in Tanzania. Nordic Journal of African Studies 9.2: 22–48.Google Scholar
Ogola, G
(2009) Media at cross-roads: Reflections on the Kenyan news media and the coverage of the 2007 political crisis. Africa Insight 39.1: 58–71.Google Scholar
O’Halloran, K
(2010) How to use corpus linguistics in the study of media discourse. In A. O’Keeffe, and M. McCarthy (eds.), The Routledge Handbook of Corpus Linguistics. London/New York: Routledge, p. 563–577. CrossrefGoogle Scholar
O'Halloran, K., and C. Coffin
(2004) Checking overinterpretation and underinterpretation: Help from corpora in critical linguistics. In C. Coffin, A. Hewings, and K. O'Halloran (eds.), Applying English Grammar: Functional and Corpus Approaches. London: Arnold, p. 275–297.Google Scholar
O’Keeffe, A., B. Clancy, and S. Adolphs
(2011) Introducing Pragmatics in Use. London: Routledge.  BoP. CrossrefGoogle Scholar
Oloo, A.G.R
(2007) The contemporary opposition in Kenya: Between internal traits and state manipulation. In G.R. Murunga, and S.W. Nasong’o (eds.), Kenya: The Struggle for Democracy. Dakar: CODESRIA Books, p. 90–125.Google Scholar
Pape, S., and S. Featherstone
(2005) Newspaper Journalism: A Practical Introduction. London: Sage.Google Scholar
Quinlan, J
(1993) C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann.Google Scholar
Rambaud, B
(2008) Caught between information and condemnation: The Kenyan media in the electoral campaigns of December 2007. In J. Lafargue (ed.), The General Elections in Kenya, 2007 (Special issue of Les Cahiers d’Afrique de l’Est (38)). Nairobi: IFRA, p. 57–107.Google Scholar
Ray, C
(2008) How the word 'tribe' stereotypes Africa. New African 471: 8–9.Google Scholar
Reah, D
(1998) The Language of Newspapers. London/New York: Routledge.  BoPGoogle Scholar
Richardson, J.E
(2007) Analysing Newspapers: An Approach from Critical Discourse Analysis. Basingstoke: Palgrave Macmillan. CrossrefGoogle Scholar
Rühlemann, C
(2010) What can a corpus tell us about pragmatics? In A. O’Keeffe, and M. McCarthy (eds.), The Routledge Handbook of Corpus Linguistics. London/New York: Routledge, p. 288–301. CrossrefGoogle Scholar
Scott, M
(2008) WordSmith Tools version 5, Liverpool: Lexical Analysis Software.Google Scholar
Schönfelder, W
(2011) CAQDAS and qualitative syllogism logic—NVivo 8 and MAXQDA 10 Compared [91 paragraphs]. Forum Qualitative Sozialforschung/Forum: Qualitative Social Research 12(1), art. 21.Google Scholar
Sebastiani, F
(2002) Machine learning in automated text categorization. ACM Computing Surveys 34.1: 1–47. CrossrefGoogle Scholar
Sinclair, J
(1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press.Google Scholar
Stamatatos, E., N. Fakotakis, and G. Kokkinakis
(2000) Automatic text categorization in terms of genre and author. Computational Linguistics 26.4: 471–495. CrossrefGoogle Scholar
Stubbs, M
(1996) Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford: Blackwell.  BoPGoogle Scholar
(2001) Texts, corpora, and problems of interpretation: A response to Widdowson. Applied Linguistics 22.2: 149–172. CrossrefGoogle Scholar
Thornbury, S
(2010) What can a corpus tell us about discourse? In A. O’Keeffe, and M. McCarthy (eds.), The Routledge Handbook of Corpus Linguistics. London/New York: Routledge, p. 270–287. CrossrefGoogle Scholar
Van Dijk, T.A
(1988) News as Discourse. Hillsdale: Lawrence Erlbaum.Google Scholar
(2006) Ideology and discourse analysis. Journal of Political Ideologies 11.2: 115–140. CrossrefGoogle Scholar
Van Ginneken, J
(2002) De schepping van de wereld in het nieuws: De 101 vertekeningen die elk 1 procent verschil maken (2nd ed.). Kluwer: Alphen aan den Rijn.Google Scholar
Van Leeuwen, T
(2008) Discourse and Practice: New Tools for Critical Discourse Analysis. Oxford: Oxford University Press.  BoP CrossrefGoogle Scholar
Verschueren, J
(1996) Contrastive ideology research: Aspects of a pragmatic methodology. Language Sciences 18.3/4: 589–603. Crossref  BoPGoogle Scholar
(1999) Understanding Pragmatics. London: Arnold.  BoPGoogle Scholar
(2008) Context and structure in a theory of pragmatics. Studies of Pragmatics 10: 13–23.Google Scholar
Westerståhl, J., and F. Johansson
(1994) Foreign news: News values and ideologies. European Journal of Communication 9: 71–89. CrossrefGoogle Scholar
Witten, I.H., and E. Frank
(2005) Data Mining Practical Machine Learning Tools and Techniques (2nd ed.). San Francisco: Elsevier.Google Scholar
Wrong, M
(2008) Don’t mention the war. New Statesman 137.4884: 22–23.Google Scholar
Wu, D.H
(2007) A brave new world for international news? Exploring the determinants of the coverage of foreign nations on US websites. The International Communication Gazette 69.6: 539–551. CrossrefGoogle Scholar
Zhao, Y., and J. Zobel
(2005) Effective and scalable authorship attribution using function words, LNCS 3689, p. 174–189. Berlin/Heidelberg: Springer.Google Scholar