A semi-supervised algorithm for detecting extremism propaganda diffusion on social media
Extremist online networks reportedly tend to use Twitter and other Social Networking Sites (SNS) in order to issue
propaganda and recruitment statements. Traditional machine learning models may encounter problems when used in such a context, due
to the peculiarities of microblogging sites and the manner in which these networks interact (both between themselves and with
other networks). Moreover, state-of-the-art approaches have focused on non-transparent techniques that cannot be audited; so,
despite the fact that they are top performing techniques, it is impossible to check if the models are actually fair. In this
paper, we present a semi-supervised methodology that uses our Discriminatory Expressions algorithm for feature
selection to detect expressions that are biased towards extremist content (Francisco and
Castro 2020). With the help of human experts, the relevant expressions are filtered and used to retrieve further
extremist content in order to iteratively provide a set of relevant and accurate expressions. These discriminatory expressions
have been proved to produce less complex models that are easier to comprehend, and thus improve model transparency. In the
following, we present close to 70 expressions that were discovered by using this method alongside the validation test of the
algorithm in several different contexts.
Article outline
- Introduction
- Theoretical background
- What is a model and how do we train it?
- How can we check that the model learned correctly?
- Can models be interpreted by humans?
- How do models deal with Natural Language documents?
- Is it possible to reduce the dimensionality of the vector representation?
- What are the reference filtering methods?
- CHI2 (chi-square)
- Information gain (IG)
- Mutual information (MI)
- Odds ratio (OR)
- Expected cross entropy (ECE)
- ANOVA F-value
- Galavotti-sebastiani-simi coefficient (GSS)
- Are filters going to help us comprehend models?
- How can we be sure that this is the way to go?
- Discriminatory expressions (DE)
- Definition (Expression)
- Definition (Discriminatory Expression)
- Methodology
- Experiments
- Performance and comprehensibility tests
- Application-related Tests
- Results and discussion
- Application-specific results
- Limitations
- Conclusions
- Future work
- Notes
-
References
References (37)
Alharbi, Ahmed S. M., and Elise de Doncker
2019 ‘
Twitter
Sentiment Analysis with a Deep Neural Network: An Enhanced Approach Using User Behavioral
Information’.
Cognitive Systems
Research 541: 50–61.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Al-Salemi, Bassam, Shahrul Azman Mohd Noah, and Mohd Juzaiddin Ab Aziz
2016 ‘
RFBoost:
An Improved Multi-Label Boosting Algorithm and Its Application to Text
Categorisation’.
Knowledge-Based
Systems 1031 (
July): 104–17.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Alvari, Hamidreza, Soumajyoti Sarkar, and Paulo Shakarian
2019 ‘
Detection
of Violent Extremists in Social Media’.
ArXiv:1902.01577
[Cs]
,
February.
[URL].
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
Ashktorab, Zahra, Christopher Brown, Manojit Nandi, and Aron Culotta
2014 ‘
Tweedr:
Mining Twitter to Inform Disaster
Response.’ In
ISCRAM.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Benigni, Matthew C., Kenneth Joseph, and Kathleen M. Carley
2017 ‘
Online
Extremism and the Communities That Sustain It: Detecting the ISIS Supporting Community on
Twitter’.
PLOS
ONE 12 (12): e0181405.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Caropreso, Maria Fernanda, Stan Matwin, and Fabrizio Sebastiani
2001 ‘
A
Learner-Independent Evaluation of the Usefulness of Statistical Phrases for Automated Text
Categorization’, 151.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cowan, Nelson
2001 ‘
The
Magical Number 4 in Short-Term Memory: A Reconsideration of Mental Storage Capacity’.
The
Behavioral and Brain
Sciences 24 (1): 87–114; discussion 114–185.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Deng, Xuelian, Yuqing Li, Jian Weng, and Jilian Zhang
2019 ‘
Feature
Selection for Text Classification: A Review’.
Multimedia Tools and
Applications 78 (3): 3797–3816.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Ding, Jianli, and Liyang Fu
2018 ‘
A
Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating
Search’.
Journal of Intelligent
Computing 9 (3): 93.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
FAT/ML
n.d. ‘
Principles for
Accountable Algorithms and a Social Impact Statement for
Algorithms’. Accessed 8 January
2019.
[URL]
Forman, George
2003 ‘
An
Extensive Empirical Study of Feature Selection Metrics for Text Classification [J]’.
Journal of
Machine Learning Research – JMLR 31 (
March).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Francisco, Manuel, and Juan Luis Castro
2020 ‘
Discriminatory
Expressions to Produce Interpretable Models in Microblogging
Context’.
ArXiv:2012.02104
[Cs]
,
November.
[URL]
Galavotti, Luigi, Fabrizio Sebastiani, and Maria Simi
2000 ‘
Experiments
on the Use of Feature Selection and Negative Evidence in Automated Text
Categorization’. In
Research and Advanced Technology for Digital
Libraries, edited by
José Borbinha and
Thomas Baker, 59–68. Lecture
Notes in Computer Science. Berlin, Heidelberg: Springer.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Go, Alec, Richa Bhayani, and Lei Huang
2009 ‘
Twitter
Sentiment Classification Using Distant
Supervision’.
Processing 1501 (
January).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Harris, Zellig S.
1954 ‘
Distributional
Structure’.
Word 10 (2–3): 146–62.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Kotzias, Dimitrios, Misha Denil, Nando de Freitas, and Padhraic Smyth
2015 ‘
From
Group to Individual Labels Using Deep Features’. In
KDD
’15.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Kubat, Miroslav
2017 An
Introduction to Machine Learning. Cham: Springer International Publishing.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Largeron, Christine, Christophe Moulin, and Mathias Géry
2011 ‘
Entropy
Based Feature Selection for Text Categorization’. In
ACM Symposium on
Applied Computing, edited by
William C. Chu,
W. Eric Wong,
Mathew J. Palakal, and
Chih-Cheng Hung, 924–28. TaiChung, Taiwan: ACM.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Miller, George A.
1956 ‘
The Magical Number Seven, plus
or Minus Two: Some Limits on Our Capacity for Processing Information’.
Psychological
Review 63 (2): 81–97.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Misangyi, Vilmos F., Jeffery A. LePine, James Algina, and Jr Francis Goeddeke
2016 ‘
The
Adequacy of Repeated-Measures Regression for Multilevel Research: Comparisons With Repeated-Measures ANOVA, Multivariate
Repeated-Measures ANOVA, and Multilevel Modeling Across Various Multilevel Research
Designs’.
Organizational Research Methods,
June.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
O’Dair, M., and A. Fry
2019 ‘
Beyond
the Black Box in Music Streaming: The Impact of Recommendation Systems upon Artists’.
Popular
Communication.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Periñán-Pascual, Carlos, and Francisco Arcas-Túnez
2019 ‘
Detecting
Environmentally-Related Problems on Twitter’.
Biosystems
Engineering, Intelligent Systems for Environmental
Applications, 1771 (
January): 31–48.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Phillips, Avery
2018 ‘
The
Moral Dilemma of Algorithmic Censorship’.
Becoming Human: Artificial Intelligence
Magazine. 27 August 2018.
[URL]
Rudin, Cynthia
2018 ‘
Please
Stop Explaining Black Box Models for High Stakes Decisions’.
ArXiv:1811.10154 [Cs,
Stat]
,
November.
[URL]
Rutkowski, Leszek, Ryszard Tadeusiewicz, Lofti A. Zadeh, and Jacek M. Zurada
2008 Artificial
Intelligence and Soft Computing – ICAISC 2008: 9th International Conference Zakopane, Poland, June 22–26, 2008,
Proceedings. Springer Science & Business Media.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Senthil, Kumar B. and Varma E. Bhavitha
2016 ‘
A Different Type of
Feature Selection Methods for Text Categorization on Imbalanced
Data’ 5 (9): 7.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sparck-Jones, Karen
1972 ‘
A
Statistical Interpretation of Term Specificity and Its Application in Retrieval’.
Journal of
Documentation 28 (1): 11–21.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Twitter Inc.
2019 ‘
Q1 2019 Earning
Report’.
[URL]
‘
Twitter Usage Statistics – Internet Live Stats’
2013.
2013 [URL]
Villena-Román, Julio, Sara Lana-Serrano, Eugenio Martínez-Cámara, and José Carlos González-Cristóbal
2013 ‘
TASS –
Workshop on Sentiment Analysis at SEPLN’.
Procesamiento del Lenguaje
Natural 50 (0): 37–44.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Wang, Hao, Dogan Can, Abe Kazemzadeh, François Bar, and Shrikanth Narayanan
2012 ‘
A
System for Real-Time Twitter Sentiment Analysis of 2012 U.S. Presidential Election
Cycle’. In
Proceedings of the ACL 2012 System
Demonstrations, 115–20.
ACL
’12. Stroudsburg, Penn.: Association for Computational Linguistics.
[URL]
Wu, Guohua, Liuyang Wang, Nailiang Zhao, and Hairong Lin
2015 ‘
Improved
Expected Cross Entropy Method for Text Feature Selection’. In
2015
International Conference on Computer Science and Mechanical Automation
(CSMA), 49–54.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Xu, Yan, Gareth Jones, Jintao Li, Bin Wang, and Chunming Sun
2007 ‘
A
Study on Mutual Information-Based Feature Selection for Text Categorization’.
Journal of
Computational Information Systems 31 (
March).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Xue, Bing, Mengjie Zhang, and Will Browne
2013 ‘
Particle
Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach’.
IEEE
Transactions on
Cybernetics 431 (
December): 1656–71.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Zhao, Z., M. Gao, J. Yu, Y. Song, X. Wang, and M. Zhang
2018 ‘
Impact
of the Important Users on Social Recommendation System’.
Lecture Notes of the Institute for
Computer Sciences, Social-Informatics and Telecommunications Engineering,
LNICST 2521: 425–34.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Zheng, Hai-Tao, Zhe Wang, Wei Wang, Arun Kumar Sangaiah, Xi Xiao, and Congzhi Zhao
2018 ‘
Learning-Based
Topic Detection Using Multiple Features’.
Concurrency and Computation-Practice &
Experience 30 (15): e4444.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Zheng, Zhaohui, Xiaoyun Wu, and Rohini Srihari
2004 ‘
Feature
Selection for Text Categorization on Imbalanced Data’.
ACM SIGKDD Explorations
Newsletter 6 (1): 80–89.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cited by (1)
Cited by 1 other publications
Wang, Mengdi, Xiaobing Peng & Liang Zhuang
2023.
Publicity governance in contingency management during the COVID-19 pandemic in China: A “Government-Society” perspective.
PLOS ONE 18:11
► pp. e0293210 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
This list is based on CrossRef data as of 5 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.