Using word n-grams to identify authors and idiolects: A corpus approach to a forensic linguistic problem

Wright, David

doi:10.1075/ijcl.22.2.03wri

Article published In:

International Journal of Corpus Linguistics
Vol. 22:2 (2017) ► pp.212–241

Using word n-grams to identify authors and idiolects

A corpus approach to a forensic linguistic problem

David Wright | Nottingham Trent University

Forensic authorship attribution is concerned with identifying the writers of anonymous criminal documents. Over the last twenty years, computer scientists have developed a wide range of statistical procedures using a number of different linguistic features to measure similarity between texts. However, much of this work is not of practical use to forensic linguists who need to explain in reports or in court why a particular method of identifying potential authors works. This paper sets out to address this problem using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus. Drawing on literature positing the idiolectal nature of collocations, phrases and word sequences, this paper tests the accuracy of word n-grams in identifying the authors of anonymised email samples. Moving beyond the statistical analysis, the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n-grams.

Keywords: forensic linguistics, idiolect, authorship attribution, entrenchment, Enron

Article outline

1.The linguistic individual, corpora and forensic linguistics
2.Word strings as features in authorship analysis
- 2.1Word strings, routine and the individual
- 2.2Empirical evidence for idiolectal word strings
- 2.3‘Word n-grams’ in this study
3.Methodology
- 3.1The Enron Email Corpus
- 3.2The authorship attribution experiment
4.Attribution results
- 4.1Effect of sample size
- 4.2Performance of different n-gram lengths
- 4.3Performance across authors
5.Identifying idiolectal word n-grams
6.Conclusions and implications
Acknowledgements
References

Published online: 16 October 2017

https://doi.org/10.1075/ijcl.22.2.03wri

References (51)

References

Argamon, S., & Koppel, M. (2013). A systemic functional approach to automated authorship analysis. Journal of Law and Policy, 21(2), 299–316.

Barlow, M. (2013). Individual differences and usage-based grammar. International Journal of Corpus Linguistics, 18(4), 443–478.

Becker, J. D. (1975). The phrasal lexicon. In B. L. Nash-Webber & R. Shank (Eds.), Theoretical Issues in Natural Language Processing (pp. 60–63). Cambridge, MA: Bolt Beranek and Newman.

Biber, D., Conrad, S., & Cortes, V. (2004). If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.

Bloch, B. (1948). A set of postulates for phonemic analysis. Language, 24(1), 3–46.

Cohen, W. W. (2009). Enron Email Dataset [online]. Retrieved from [URL] (last accessed November 2010).

Coniam, D. (2004). Concordancing oneself: Constructing individual textual profiles. International Journal of Corpus Linguistics, 9(2), 271–298.

Cotterill, J. (2010). How to use corpus linguistics in forensic linguistics. In A. O’Keefe & M. McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp. 578–590). London: Routledge.

Coulthard, M. (1994). On the use of corpora in the analysis of forensic texts. Forensic Linguistics. International Journal of Speech, Language and the Law, 1(1), 27–43.

(2004). Author identification, idiolect, and linguistic uniqueness. Applied Linguistics, 24(4), 431–447.

Coulthard, M., Grant, T., & Kredens, K. (2011). Forensic Linguistics. In R. Wodak, B. Johnstone & P. Kerswill (Eds.), The SAGE Handbook of Sociolinguistics (pp. 531–544). London: Sage.

Coyotl-Morales, R., Villaseñor-Pineda, M. L., Montes-y-Gómez, M., & Rosso, P. (2006). Authorship attribution using word sequences. In J. F. Martínez-Trinidad, J. A. Carrasco Ochoa & J. Kittler (Eds.), Proceedings of the 11th Iberoamerican Congress on Pattern Recognition (pp. 844–853). Berlin: Springer.

Durrant, P., & Doherty, A. (2010). Are high-frequency collocations psychologically real? Investigating the thesis of collocational priming. Corpus Linguistics and Linguistic Theory, 6(2), 125–155.

Eckert, P., & McConnell-Ginet, S. (1998). Communities of practice: Where language, gender and power all live? In J. Coates (Ed.), Language and Gender: A Reader (pp. 484–494). Oxford: Blackwell.

Eder, M. (2015). Does size matter? Authorship attribution, small samples, big problem. Digital Scholarship in the Humanities, 30(2), 167–182.

Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In F. R. Palmer (Ed.), Selected papers of J.R. Firth 1952–1959 (pp. 168–205). London: Longman.

Grant, T. (2007). Quantifying evidence in forensic authorship analysis. International Journal of Speech, Language and the Law, 14(1), 1–25.

(2008). Approaching questions in forensic authorship analysis. In J. Gibbons & M. T. Turell (Eds.), Dimensions of Forensic Linguistics (pp. 215–229). Amsterdam/Philadelphia: John Benjamins.

(2010). Txt 4n6: Idiolect free authorship analysis? In M. Coulthard & A. Johnson (Eds.), The Routledge Handbook of Forensic Linguistics (pp. 508–522) London: Routledge.

(2013). Txt 4N6: Method, consistency and distinctiveness in the analysis of SMS text messages. Journal of Law and Policy, 21(2), 467–494.

Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3), 251–270.

Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. London: Routledge.

Hoover, D. L. (2002). Frequent word sequences and statistical stylistics. Literary and Linguistic Computing, 17(2), 157–180.

Johnson, A. & Wright, D. (2014). Identifying idiolect in forensic authorship attribution: An n-gram textbite approach. Language and Law (Linguagem e Direito) 1(1), 37–69.

Juola, P. (2008). Authorship Attribution. Delft: NOW Publishing.

(2013). Stylometry and immigration: A case study. Journal of Law and Policy, 21(2), 287–298.

Koppel, M., Schler, J., & Argamon, S. (2011). Authorship attribution in the wild. Language Resources and Evaluation, 45(1), 83–94.

Kredens, K. (2002). Towards a corpus-based methodology of forensic authorship attribution: A comparative study of two idiolects. In B. Lewandowska-Tomaszczyk (Ed.), PALC’01: Practical Applications in Language Corpora (pp. 405–437). Peter Lang: Frankfurt am Mein.

Kuiper, K. (2004). Formulaic performance in conventionalised varieties of speech. In N. Schmitt (Ed.), Formulaic Sequences: Acquisition, Processing and Use (pp. 37–54). Amsterdam/Philadelphia: John Benjamins.

Langacker, R. (1988). A usage-based model. In B. Rudzka-Ostyn (Ed.), Topics in Cognitive Linguistics (pp. 127–161). Amsterdam/Philadelphia: John Benjamins.

(2000). A dynamic usage-based model. In M. Barlow & S. Kemmer (Eds.), Usage-Based Models of Language (pp. 1–63). Stanford: CSLI Publications.

Larner, S. (2014). A preliminary investigation into the use of fixed formulaic sequences as a marker of authorship. International Journal of Speech, Language and the Law, 21(1), 1–22.

Love, H. (2002). Attributing Authorship: An Introduction. Cambridge: Cambridge University Press.

Luyckx, K., & Daelemans, W. (2011). The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing, 26(1), 35–55.

Mikros, G. (2012). Authorship attribution and gender identification in Greek blogs. In I. Obradovic, E. Kelih & Reinhard Köhler (Eds.), Methods and Applications of Quantitative Linguistics (pp. 21–32). University of Belgrade: Academic Mind.

Mollin, S. (2009). ‘I entirely understand’ is a Blairism: The methodology of identifying idiolectal collocations. International Journal of Corpus Linguistics, 14(3), 367–392.

Nattinger, J. R., & DeCarrico, J. (1992). Lexical Phrases and Language Teaching. Oxford: Oxford University Press.

Nini, A., & Grant, T. (2013). Bridging the gap between stylistic and cognitive approaches to authorship analysis using Systemic Functional Linguistics and multidimensional analysis. International Journal of Speech, Language and the Law, 20(2), 173–202.

Sanderson, C., & Guenter, S. (2006). Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the International Conference on Empirical Methods in Natural Language Engineering (pp. 482–491). Morristown, NJ: Association for Computational Linguistics.

Schmid, H-J. (2016). A framework for understanding linguistic entrenchment and its psychological foundations. In H-J. Schmid (Ed.), Entrenchment and the Psychology of Language Learning: How We Reorganize and Adapt Linguistic Knowledge (pp. 9–36). Berlin: De Gruyter Mouton.

Schmitt, N., Grandage, S., & Adolphs, S. (2004). Are corpus-derived recurrent clusters psycholinguistically valid? In N. Schmitt (Ed.) Formulaic Sequences: Acquisition, Processing and Use (pp. 12–151). Amsterdam/Philadelphia: John Benjamins.

Scott, M. (2008). WordSmith Tool (Version 5) [Computer software]. Liverpool: Lexical Analysis Software.

Sinclair, J. M. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3), 538–556.

(2013). On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21(2), 421–440.

Svartvik, J. (1968). The Evans Statements: A case for Forensic Linguistics. Gotëborg: University of Gothenburg Press.

Turell, M. T., & Gavaldà, N. (2013). Towards an index of idiolectal similitude (or distance) in forensic authorship analysis. Journal of Law and Policy, 21(2), 495–514.

Woolls, D. (2013). CFL Jaccard n-gram Lexical Evaluator (Jangle) (Version 2) [Computer software]. CFL Software Limited. Retrieved from [URL] (last accessed January 2017).

Wray, A. (2002). Formulaic Language and the Lexicon. Cambridge: Cambridge University Press.

(2008). Formulaic Language: Pushing the Boundaries. Oxford: Oxford University Press.

Wright, D. (2013). Stylistic variation within genre conventions in the Enron email corpus: Developing a text-sensitive methodology for authorship research. International Journal of Speech, Language and the Law 20(1): 45–75.

Cited by (32)

Cited by 32 other publications

Order by:

Gillings, Mathew, Gerlinde Mautner & Paul Baker

2023. Corpus-Assisted Discourse Studies,

Grieve, Jack

2023. Register variation explains stylometric authorship analysis. Corpus Linguistics and Linguistic Theory 19:1 ► pp. 47 ff.

Heini, Annina & Krzysztof Kredens

2023. Remote data collection in sociolinguistics: lessons from the COVID-19 pandemic. International Journal of Social Research Methodology ► pp. 1 ff.

Andrea Mojedano Batel, Neus Alberich Buera & Krzysztof Kredens

2023. Estabilidad idiolectal del español a través de cuatro géneros de comunicación. Revista de Llengua i Dret :79 ► pp. 285 ff.

Busso, Lucia, Marton Petyko, Sarah Atkins & Tim Grant

2022. Operation Heron: latent topic changes in an abusive letter series. Corpora 17:2 ► pp. 225 ff.

Fadlil, Abdul, Sunardi Sunardi & Rezki Ramdhani

2022. Similarity Identification Based on Word Trigrams Using Exact String Matching Algorithms. INTENSIF: Jurnal Ilmiah Penelitian dan Penerapan Teknologi Sistem Informasi 6:2 ► pp. 253 ff.

Grant, Tim

2022. The Idea of Progress in Forensic Authorship Analysis,

Grant, Tim & Jack Grieve

2022. The Starbuck Case. In Methodologies and Challenges in Forensic Linguistic Casework, ► pp. 13 ff.

Klyushin, Dmitriy & Yulia Nykyporets

2022. Nonparametric Methods of Authorship Attribution in Ukrainian Literature. In ICTERI 2021 Workshops [Communications in Computer and Information Science, 1635], ► pp. 510 ff.

Liu, Xueqin & Mingzhe Jin

2022. A corpus-based approach to explore the stylistic peculiarity of Koji Uno’s postwar works. Digital Scholarship in the Humanities 37:1 ► pp. 168 ff.

Marko, Karoline, Margit Reitbauer & Georg Pickl

2022. Same person, different platform. Register Studies 4:2 ► pp. 202 ff.

Tomas, Frédéric, Olivier Dodier & Samuel Demarchi

2022. Computational Measures of Deceptive Language: Prospects and Issues. Frontiers in Communication 7

Изотова, Т., Е. Крюк, В. Кузнецов, А. Плотникова, Т. Бердникова, А. Заварыкина, Е. Крюк & Н. Михалева

2022. Методические рекомендации по проведению судебно-автороведческих экспертиз,

Douglas, Fiona M.

2021. Breaking with Europe. In Political, Public and Media Discourses from Indyref to Brexit, ► pp. 85 ff.

Evans, Mel & Alan Hogarth

2021. Stylistic palimpsests: Computational stylistic perspectives on precursory authorship in Aphra Behn’s drama. Digital Scholarship in the Humanities 36:1 ► pp. 64 ff.

MacLeod, Nicci & Tim Grant

2021. Assuming Identities Online: How Linguistics Is Helping the Policing of Online Grooming and the Distribution of Abusive Images. In Rethinking Cybercrime, ► pp. 87 ff.

Mazurek, Marcin & Mateusz Romaniuk

2021. Attribution of authorship in instant messaging software applications, based on similarity measures of the stylometric features’ vector. Computer Science and Mathematical Modelling 0:11-12/2020 ► pp. 33 ff.

Nini, Andrea

2021. Corpus Analysis in Forensic Linguistics. In The Encyclopedia of Applied Linguistics, ► pp. 1 ff.

Nini, Andrea

2023. A Theory of Linguistic Individuality for Authorship Analysis,

Raj, Sariga, B. Kannan & V. P. Jagathy Raj

2021. Significance of Network Properties of Function Words in Author Attribution. In Intelligent Data Engineering and Analytics [Advances in Intelligent Systems and Computing, 1177], ► pp. 171 ff.

Abrams, Zsuzsanna I.

2020. Intercultural Communication and Language Pedagogy,

Deviterne-Lapeyre, Capitaine Marie

2020. Interpol review of questioned documents 2016–2019. Forensic Science International: Synergy 2 ► pp. 429 ff.

Fonteyn, Lauren & Andrea Nini

2020. Individuality in syntactic variation: An investigation of the seventeenth-century gerund alternation. Cognitive Linguistics 31:2 ► pp. 279 ff.

Grant, Tim & Nicci MacLeod

2020. Language and Online Identities,

Miranker, Molly & Alberto Giordano

2020. Text mining and semantic triples: Spatial analyses of text in applied humanitarian forensic research. Digital Geography and Society 1 ► pp. 100005 ff.

Sharon Belvisi, Nicole Mariah, Naveed Muhammad & Fernando Alonso-Fernandez

2020. 2020 8th International Workshop on Biometrics and Forensics (IWBF), ► pp. 1 ff.

Vetchinnikova, Svetlana & Turo Hiltunen

2020. ELF and Language Change at the Individual Level. In Language Change, ► pp. 205 ff.

Yang, Yang, Wu Youyou & Brian Uzzi

2020. Estimating the deep replicability of scientific findings using human and artificial intelligence. Proceedings of the National Academy of Sciences 117:20 ► pp. 10762 ff.

Zhao, Yunqi, Igor Borovikov, Fernando de Mesentier Silva, Ahmad Beirami, Jason Rupert, Caedmon Somers, Jesse Harder, John Kolen, Jervis Pinto, Reza Pourabolghasem, James Pestrak, Harold Chaput, Mohsen Sardari, Long Lin, Sundeep Narravula, Navid Aghdaie & Kazi Zaman

2020. Winning Is Not Everything: Enhancing Game Development With Intelligent Agents. IEEE Transactions on Games 12:2 ► pp. 199 ff.

Grieve, Jack, Isobelle Clarke, Emily Chiang, Hannah Gideon, Annina Heini, Andrea Nini & Emily Waibel

2019. Attributing the Bixby Letter using n-gram tracing. Digital Scholarship in the Humanities 34:3 ► pp. 493 ff.

Vetchinnikova, Svetlana

2019. Phraseology and the Advanced Language Learner,

Beom-mo Kang

2017. Morpheme N-grams and Lexical Frames. EONEOHAG null:79 ► pp. 3 ff.

This list is based on CrossRef data as of 11 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.