In 2013, a debut crime novel called The Cuckoo’s Calling hit the shelves. Written by a supposed ex-military man named Robert Galbraith, it earned good reviews but modest sales. That is, until a stray tweet sparked a rumor: could Galbraith be a pseudonym for a much more famous author? Journalists and linguists scrambled, and the answer, confirmed in a matter of days, was a resounding yes. Robert Galbraith was none other than J.K. Rowling.
How did they prove it so quickly and definitively? The secret wasn’t magic; it was a fascinating field where linguistics meets statistics called stylometry.
What is a Linguistic Fingerprint?
Stylometry is, in essence, the science of linguistic fingerprinting. It’s based on the idea that every writer has a unique and largely unconscious style—a “writer’s voice” that can be measured and quantified. Just as your fingerprints have unique ridges and whorls, your writing has identifiable patterns in word choice, sentence structure, and even punctuation.
You might think you could disguise your style by writing about a different topic or consciously using bigger words. But stylometry focuses on the tiny, habitual choices we make without thinking. It’s not about what we say, but how we say it. These subconscious habits are incredibly difficult to fake, making them a reliable signature for attribution.
The Building Blocks of Style
So, what exactly do these algorithms measure? It’s not about looking for a single “tell”. Instead, stylometry analyzes hundreds of features simultaneously. Some of the most important include:
- Function Words: This is the cornerstone of classic stylometry. Function words are the grammatical glue of a language—words like the, a, of, in, on, by, and, but. We use them constantly and without conscious thought. An author might prefer “on” in situations where another might use “upon”, or favor “while” over “whilst”. These preferences are remarkably consistent for a given author.
- Word Frequencies: How often does an author use common words? Or, conversely, how rich and varied is their vocabulary (known as the type-token ratio)?
- Sentence Length: What’s the average number of words per sentence? Does the author use a lot of variation, mixing short, punchy sentences with long, flowing ones?
- N-grams: This sounds complex, but it’s just about word clusters. Computers can analyze the frequency of two-word pairs (bigrams) like “of the”, or three-word triplets (trigrams) like “in the middle”. Certain authors will have favorite phrases and constructions that show up as statistical spikes.
- Punctuation: Is the writer a fan of the semicolon? Do they use em-dashes for emphasis? How frequently do commas appear? These tiny choices add up to a larger stylistic profile.
The Original Detective Story: The Federalist Papers
The most famous historical use of stylometry dates back to the very founding of the United States. Between 1787 and 1788, a series of 85 essays known as The Federalist Papers were published under the single pseudonym “Publius” to advocate for the ratification of the U.S. Constitution.
For centuries, authorship was clear for most of the papers—they were written by Alexander Hamilton, James Madison, and John Jay. However, a dozen essays were disputed, with both Hamilton and Madison claiming to be the author. The mystery remained unsolved for nearly 200 years.
Then, in the 1960s, statisticians Frederick Mosteller and David Wallace took on the case. They didn’t focus on the political arguments, which could be easily imitated. Instead, they focused on the frequency of function words. They discovered, for example, that Madison used the word “whilst” frequently, while Hamilton never did. Hamilton preferred “while”. Hamilton used “upon” far more often than Madison, who preferred “on”. By analyzing the rates of these and other humble words across the known and disputed papers, they concluded with near-certainty that all 12 disputed essays were written by James Madison. It was a landmark victory for statistical authorship attribution.
From Shakespeare to Rowling: Modern Mysteries
Stylometry has since been applied to countless literary puzzles. It’s used to explore collaborations, such as identifying which parts of Shakespeare’s later plays, like Henry VIII, were likely co-authored by John Fletcher. It helped unmask journalist Joe Klein as the author of the 1996 novel Primary Colors, initially published by “Anonymous”.
But the J.K. Rowling case brought the technique into the 21st-century spotlight. When suspicions about Robert Galbraith arose, linguist Patrick Juola ran a computer analysis. He compared The Cuckoo’s Calling to Rowling’s first adult novel, The Casual Vacancy, as well as novels by other female crime writers like P.D. James and Val McDermid.
The analysis looked at the 100 most common words, character n-grams, and word lengths. The result was, in Juola’s words, a “slam dunk”. The stylistic distance between Galbraith and Rowling was minuscule, while the distance to other authors was vast. Her linguistic fingerprint was all over the book, and the mystery was solved.
How Does the Math Work? (A Simple Glimpse)
While the underlying statistics can be complex, the concept is quite intuitive. Imagine a graph where each author is a dot. The position of the dot is determined by their unique combination of stylistic features. Authors with similar styles will cluster together, while those with different styles will be far apart.
When analyzing an anonymous text, the computer calculates its stylistic coordinates and places a new dot on the graph. The “author” is simply the known writer whose dot is closest to the new one. Modern methods like the “Delta procedure”, developed by John Burrows, excel at this by focusing on the most frequent words and measuring how an anonymous text’s word frequencies deviate from the average frequencies of a candidate author.
More Than Just Unmasking Authors
The power of stylometry extends far beyond literary who-dunnits. It has become a vital tool in many fields:
- Forensic Linguistics: Law enforcement can use stylometry to link threatening emails, ransom notes, or fake confessions to a suspect by comparing the text to the suspect’s known writings.
- Historical Research: It can help determine the chronology of a philosopher’s writings (like Plato or Aristotle) by tracking how their style evolved over time.
- Plagiarism Detection: The software used by universities to check for plagiarism, like Turnitin, uses a similar principle to compare student work against a massive database of existing texts.
Stylometry reveals a fundamental truth about our relationship with language. Our voice is more than just the words we choose; it’s a deep, intricate pattern woven into the very fabric of our expression. While we may don a mask or a pseudonym, our linguistic fingerprint remains, a unique and indelible signature that, with a little bit of math, can tell our story.