A semi-supervised algorithm for detecting extremism propaganda diffusion on social media
Extremist online networks reportedly tend to use Twitter and other Social Networking Sites (SNS) in order to issue
propaganda and recruitment statements. Traditional machine learning models may encounter problems when used in such a context, due
to the peculiarities of microblogging sites and the manner in which these networks interact (both between themselves and with
other networks). Moreover, state-of-the-art approaches have focused on non-transparent techniques that cannot be audited; so,
despite the fact that they are top performing techniques, it is impossible to check if the models are actually fair. In this
paper, we present a semi-supervised methodology that uses our
Discriminatory Expressions algorithm for feature
selection to detect expressions that are biased towards extremist content (
Francisco and
Castro 2020). With the help of human experts, the relevant expressions are filtered and used to retrieve further
extremist content in order to iteratively provide a set of relevant and accurate expressions. These discriminatory expressions
have been proved to produce less complex models that are easier to comprehend, and thus improve model transparency. In the
following, we present close to 70 expressions that were discovered by using this method alongside the validation test of the
algorithm in several different contexts.
Article outline
- Introduction
- Theoretical background
- What is a model and how do we train it?
- How can we check that the model learned correctly?
- Can models be interpreted by humans?
- How do models deal with Natural Language documents?
- Is it possible to reduce the dimensionality of the vector representation?
- What are the reference filtering methods?
- CHI2 (chi-square)
- Information gain (IG)
- Mutual information (MI)
- Odds ratio (OR)
- Expected cross entropy (ECE)
- ANOVA F-value
- Galavotti-sebastiani-simi coefficient (GSS)
- Are filters going to help us comprehend models?
- How can we be sure that this is the way to go?
- Discriminatory expressions (DE)
- Definition (Expression)
- Definition (Discriminatory Expression)
- Methodology
- Experiments
- Performance and comprehensibility tests
- Application-related Tests
- Results and discussion
- Application-specific results
- Limitations
- Conclusions
- Future work
- Notes
-
References
References (37)
References
Alharbi, Ahmed S. M., and Elise de Doncker. 2019. ‘Twitter
Sentiment Analysis with a Deep Neural Network: An Enhanced Approach Using User Behavioral
Information’. Cognitive Systems
Research 541: 50–61.
Al-Salemi, Bassam, Shahrul Azman Mohd Noah, and Mohd Juzaiddin Ab Aziz. 2016. ‘RFBoost:
An Improved Multi-Label Boosting Algorithm and Its Application to Text
Categorisation’. Knowledge-Based
Systems 1031 (July): 104–17.
Alvari, Hamidreza, Soumajyoti Sarkar, and Paulo Shakarian. 2019. ‘Detection
of Violent Extremists in Social Media’.
ArXiv:1902.01577
[Cs]
, February. [URL].
Ashktorab, Zahra, Christopher Brown, Manojit Nandi, and Aron Culotta. 2014. ‘Tweedr:
Mining Twitter to Inform Disaster
Response.’ In ISCRAM.
Benigni, Matthew C., Kenneth Joseph, and Kathleen M. Carley. 2017. ‘Online
Extremism and the Communities That Sustain It: Detecting the ISIS Supporting Community on
Twitter’. PLOS
ONE 12 (12): e0181405.
Caropreso, Maria Fernanda, Stan Matwin, and Fabrizio Sebastiani. 2001. ‘A
Learner-Independent Evaluation of the Usefulness of Statistical Phrases for Automated Text
Categorization’, 151.
Cowan, Nelson. 2001. ‘The
Magical Number 4 in Short-Term Memory: A Reconsideration of Mental Storage Capacity’. The
Behavioral and Brain
Sciences 24 (1): 87–114; discussion 114–185.
Deng, Xuelian, Yuqing Li, Jian Weng, and Jilian Zhang. 2019. ‘Feature
Selection for Text Classification: A Review’. Multimedia Tools and
Applications 78 (3): 3797–3816.
Ding, Jianli, and Liyang Fu. 2018. ‘A
Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating
Search’. Journal of Intelligent
Computing 9 (3): 93.
FAT/ML. n.d. ‘Principles for
Accountable Algorithms and a Social Impact Statement for
Algorithms’. Accessed 8 January
2019. [URL]
Forman, George. 2003. ‘An
Extensive Empirical Study of Feature Selection Metrics for Text Classification [J]’. Journal of
Machine Learning Research – JMLR 31 (March).
Francisco, Manuel, and Juan Luis Castro. 2020. ‘Discriminatory
Expressions to Produce Interpretable Models in Microblogging
Context’.
ArXiv:2012.02104
[Cs]
, November. [URL]
Galavotti, Luigi, Fabrizio Sebastiani, and Maria Simi. 2000. ‘Experiments
on the Use of Feature Selection and Negative Evidence in Automated Text
Categorization’. In Research and Advanced Technology for Digital
Libraries, edited by José Borbinha and Thomas Baker, 59–68. Lecture
Notes in Computer Science. Berlin, Heidelberg: Springer.
Go, Alec, Richa Bhayani, and Lei Huang. 2009. ‘Twitter
Sentiment Classification Using Distant
Supervision’. Processing 1501 (January).
Harris, Zellig S. 1954. ‘Distributional
Structure’. Word 10 (2–3): 146–62.
Kotzias, Dimitrios, Misha Denil, Nando de Freitas, and Padhraic Smyth. 2015. ‘From
Group to Individual Labels Using Deep Features’. In KDD
’15.
Kubat, Miroslav. 2017. An
Introduction to Machine Learning. Cham: Springer International Publishing.
Largeron, Christine, Christophe Moulin, and Mathias Géry. 2011. ‘Entropy
Based Feature Selection for Text Categorization’. In ACM Symposium on
Applied Computing, edited by William C. Chu, W. Eric Wong, Mathew J. Palakal, and Chih-Cheng Hung, 924–28. TaiChung, Taiwan: ACM.
Miller, George A. 1956. ‘The Magical Number Seven, plus
or Minus Two: Some Limits on Our Capacity for Processing Information’. Psychological
Review 63 (2): 81–97.
Misangyi, Vilmos F., Jeffery A. LePine, James Algina, and Jr Francis Goeddeke. 2016. ‘The
Adequacy of Repeated-Measures Regression for Multilevel Research: Comparisons With Repeated-Measures ANOVA, Multivariate
Repeated-Measures ANOVA, and Multilevel Modeling Across Various Multilevel Research
Designs’. Organizational Research Methods, June.
O’Dair, M., and A. Fry. 2019. ‘Beyond
the Black Box in Music Streaming: The Impact of Recommendation Systems upon Artists’. Popular
Communication.
Periñán-Pascual, Carlos, and Francisco Arcas-Túnez. 2019. ‘Detecting
Environmentally-Related Problems on Twitter’. Biosystems
Engineering, Intelligent Systems for Environmental
Applications, 1771 (January): 31–48.
Phillips, Avery. 2018. ‘The
Moral Dilemma of Algorithmic Censorship’. Becoming Human: Artificial Intelligence
Magazine. 27 August 2018. [URL]
Rudin, Cynthia. 2018. ‘Please
Stop Explaining Black Box Models for High Stakes Decisions’.
ArXiv:1811.10154 [Cs,
Stat]
, November. [URL]
Rutkowski, Leszek, Ryszard Tadeusiewicz, Lofti A. Zadeh, and Jacek M. Zurada. 2008. Artificial
Intelligence and Soft Computing – ICAISC 2008: 9th International Conference Zakopane, Poland, June 22–26, 2008,
Proceedings. Springer Science & Business Media.
Senthil, Kumar B. and Varma E. Bhavitha. 2016. ‘A Different Type of
Feature Selection Methods for Text Categorization on Imbalanced
Data’ 5 (9): 7.
Sparck-Jones, Karen. 1972. ‘A
Statistical Interpretation of Term Specificity and Its Application in Retrieval’. Journal of
Documentation 28 (1): 11–21.
Twitter Inc. 2019. ‘Q1 2019 Earning
Report’. [URL]
‘Twitter Usage Statistics – Internet Live Stats’. 2013.
2013. [URL]
Villena-Román, Julio, Sara Lana-Serrano, Eugenio Martínez-Cámara, and José Carlos González-Cristóbal. 2013. ‘TASS –
Workshop on Sentiment Analysis at SEPLN’. Procesamiento del Lenguaje
Natural 50 (0): 37–44.
Wang, Hao, Dogan Can, Abe Kazemzadeh, François Bar, and Shrikanth Narayanan. 2012. ‘A
System for Real-Time Twitter Sentiment Analysis of 2012 U.S. Presidential Election
Cycle’. In Proceedings of the ACL 2012 System
Demonstrations, 115–20. ACL
’12. Stroudsburg, Penn.: Association for Computational Linguistics. [URL]
Wu, Guohua, Liuyang Wang, Nailiang Zhao, and Hairong Lin. 2015. ‘Improved
Expected Cross Entropy Method for Text Feature Selection’. In 2015
International Conference on Computer Science and Mechanical Automation
(CSMA), 49–54.
Xu, Yan, Gareth Jones, Jintao Li, Bin Wang, and Chunming Sun. 2007. ‘A
Study on Mutual Information-Based Feature Selection for Text Categorization’. Journal of
Computational Information Systems 31 (March).
Xue, Bing, Mengjie Zhang, and Will Browne. 2013. ‘Particle
Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach’. IEEE
Transactions on
Cybernetics 431 (December): 1656–71.
Zhao, Z., M. Gao, J. Yu, Y. Song, X. Wang, and M. Zhang. 2018. ‘Impact
of the Important Users on Social Recommendation System’. Lecture Notes of the Institute for
Computer Sciences, Social-Informatics and Telecommunications Engineering,
LNICST 2521: 425–34.
Zheng, Hai-Tao, Zhe Wang, Wei Wang, Arun Kumar Sangaiah, Xi Xiao, and Congzhi Zhao. 2018. ‘Learning-Based
Topic Detection Using Multiple Features’. Concurrency and Computation-Practice &
Experience 30 (15): e4444.
Zheng, Zhaohui, Xiaoyun Wu, and Rohini Srihari. 2004. ‘Feature
Selection for Text Categorization on Imbalanced Data’. ACM SIGKDD Explorations
Newsletter 6 (1): 80–89.
Cited by (1)
Cited by one other publication
Wang, Mengdi, Xiaobing Peng & Liang Zhuang
2023.
Publicity governance in contingency management during the COVID-19 pandemic in China: A “Government-Society” perspective.
PLOS ONE 18:11
► pp. e0293210 ff.
This list is based on CrossRef data as of 5 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.