Discover more from Paul’s Newsletter
Science, stylometry and the real identity of Q
The New York Times has accused me of writing Q’s posts. Its main argument is technological: using software techniques to determine authorship and the unique fingerprint of a writer, two different teams have determined that I wrote the original Q posts. I won’t get into the arguments for the physical evidence that I cannot be Q - that is a book length discussion which you can read for yourself here. Instead I want to talk about stylometry, the method which apparently has so damningly pointed the finger at me as the author.
Instead of relying on expert opinion, the computer scientists used a mathematical approach known as stylometry. Practitioners say they have replaced the art of the older studies with a new form of science, yielding results that are measurable, consistent and replicable. Sophisticated software broke down the Q texts into patterns of three-character sequences and tracked the recurrence of each possible combination. Their technique does not highlight memorable, idiosyncratic word choices the way that earlier forensic linguists often did. But the advocates of stylometry note that they can quantify their software’s error rate. The Swiss team said its accuracy rate was about 93 percent. The French team said its software correctly identified Mr. Watkins’s writing in 99 percent of tests and Mr. Furber’s in 98 percent.
Really? That sounds very impressive. Because I know quite a bit about stylometry. I used to be a digital forensics consultant and my expert testimony in court has resulted in multi-million dollar civil cases being awarded to my clients. Let’s give it a try for this case. We’ll need Q’s posts stripped of all other users’ comments as well as the cruft that comes from 4chan and 8chan, so just the text of his posts. That you can download for yourself here. It’s in reverse order from Jan 5, 2018 back down to October 28, 2017 and is just over 20 000 words, a nice size for our purposes.
To compare, we’ll need a sample of my writing. I’ve taken the first third of my book, Q: Inside the Greatest Intelligence Drop in History, before I start talking about Q and with all quotes by other people removed. You can download this from here.
The tool I’ll be using is called stylo and it’s a package you can install for the R programming language for statistical computing:
stylo itself is hosted here on Github: https://github.com/computationalstylistics/stylo
Using stylo is not particularly difficult if you’re familiar with the R environment: install it, load it as a library and fire it up:
### stylo version: 0.7.4 ###
If you plan to cite this software (please do!), use the following reference:
Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R:
a package for computational text analysis. R Journal 8(1): 107-121.
To get full BibTeX entry, type: citation("stylo")
Here on my workstation running Ubuntu 20.04, that pops up the GUI so we can fill in parameters for comparison.
Under the Features tab, I’ve selected this method for analysing the text: full words a word at a time without pronouns and the 100 most frequent words without culling.
For the Statistics analysis, I’m doing a Principal Components Analysis based on a correlation matrix with a cosine delta as it’s the one recommended in the stylo HOWTO document which you can find here: https://computationalstylistics.github.io/stylo_nutshell/
For those of you following along, place the book.txt and the qtext.txt in a corpus directory within your R working directory and press OK.
That’s interesting. The text of my book and the text of Q’s drops were not written by the same person according to this method. In fact, they are as far apart as an English translation of Cicero is to The House At Pooh Corner. But wait, the New York Times article says that one of the teams analysed the text using not words but groups of three letter characters. I’m not sure how that’s supposed to identify writers since we normally use full words to express our ideas but whatever. Let’s try that method to analyse the two texts:
Nope. Catch 22 meets The Gruffalo. What about a correlation matrix between the text of the book, the Q drops and a bunch of articles I wrote for Brainstorm magazine when I was a technology journalist there?
Nope. That doesn’t work either. The journalism pieces are tightly clustered in a particular style and would make my sub-editor very happy since it was she who had the final say. But neither my day to day work or the text of a book I wrote on the subject of Q is anywhere near the text of the original Q drops on 4chan.
The two teams whose work the New York Times relied on are either lying or have cherry picked results to make me look guilty. I call upon all parties to share their results, methodologies and what data they used to reach their conclusions. If they’re not willing, then then either they’re lying or incompetent. They’re certainly not practising science or forensics.
I was permanently banned from Twitter in December 2020 before being reinstated in January 2023. I can’t even remember the reason. I had 30 000+ followers and about 30m views per month, roughly the same as a major US news network. One of the side effects of being banned is you cannot access anything you posted before, even to back it up. But strangely enough, the New York Times and their tame forensics consultants could access my Twitter account. No restrictions. They must have if they’ve been feeding my long discussion threads into stylometry tools. Isn’t that interesting? Big Tech has no problem locking users out of their own accounts for wrongthink but will happily give access to any newspaper that comes along.
The second interesting addendum is that when the original Q was first posting on 4chan, the New York Times felt the need to have a dig at him and mention the Q group in NASA by name:
Why did they bother if it was just some LARP?