Imagine discovering a hidden diary from a bygone era, filled with cryptic, nonsensical letters. Or consider the modern-day mystery of an anonymous author whose novel becomes a bestseller, sparking a worldwide guessing game about their true identity. These two scenarios, though centuries apart, are linked by a single, powerful idea: every system, whether a secret code or a human writer, has an unconscious, repeating rhythm. Finding that rhythm is the key to unlocking its secrets.
This is the story of the Kasiski examination, a 19th-century cryptographic breakthrough that cracked one of historyâs most formidable ciphers. More than that, itâs the story of how the very same logic is used today to reveal the ghost in the machineâthe author behind the anonymous text.
The “Unbreakable” Code That Wasn’t
For centuries, the king of ciphers was the polyalphabetic cipher. Unlike a simple substitution cipher (like a Caesar cipher, where ‘A’ always becomes ‘D’), a polyalphabetic cipher uses multiple substitution alphabets. The most famous of these is the Vigenère cipher, invented in the 16th century but not widely used until the 19th. For over 200 years, it was lauded as le chiffrage indĂŠchiffrableâthe indecipherable cipher.
Its strength came from a simple keyword. Let’s say our keyword is LEMON
. To encrypt the message “ATTACK AT DAWN”, you’d write the keyword repeatedly above it:
Keyword:
LEMONLEMONLEMO
Plaintext:ATTACKATDAWN
The first ‘A’ in “ATTACK” is encrypted using the ‘L’ alphabet, the first ‘T’ is encrypted using the ‘E’ alphabet, the second ‘T’ using the ‘M’ alphabet, and so on. This means the two ‘T’s in “ATTACK” would become two completely different letters in the ciphertext. This complexity defeated the most common code-breaking tool of the era: frequency analysis. In English, ‘E’ is the most common letter, but in a Vigenère-encrypted text, its encrypted form would be scattered across different letters, leaving no statistical trace.
A Pattern in the Chaos: Kasiski’s Discovery
Enter Friedrich Kasiski, a Prussian infantry officer and cryptographer. In 1863, he published a book that detailed a devastatingly effective attack on the Vigenère cipher. He wasn’t the first to break it (Charles Babbage did so earlier but never published his work), but Kasiski was the one who revealed the method to the world.
His insight was deceptively simple: look for repeated sequences of letters in the ciphertext.
Why would repetitions occur in such a complex cipher? Kasiski realized it happens by chance when a repeated sequence in the original plaintext happens to align perfectly with the repeating keyword.
Consider this example:
- Plaintext:
THEENEMYWILLATTACKTHEEASTWALL
- Keyword:
CODE
(length 4)
When we align them, something interesting happens with the word “THE”:
Keyword:
CODECODECODECODECODECODECODECODE
Plaintext:THEENEMYWILLATTACKTHEEASTWALL
The first “THE” is encrypted using the keyword letters “COD”. The second “THE” is also, by pure chance, encrypted using the exact same “COD” sequence from the keyword. This means that the resulting three-letter sequence in the ciphertext will be identical in both places. A codebreaker scanning the garbled text would see a repeating pattern, a crack in the cipher’s armor.
How the Kasiski Method Works
Kasiski turned this observation into a methodical process:
- Find repetitions: Scan the ciphertext for repeated strings of three or more characters.
- Measure the distance: For each repeated string, count the number of characters between the start of its first appearance and the start of its second.
- Find the common factor: The distance between these identical segments must be a multiple of the keyword’s length. By finding the distances for several different repeated strings, you can look for their greatest common divisors. The number that appears most often is very likely the length of the secret keyword.
Once you know the keyword is, say, 4 letters long, the “unbreakable” cipher collapses. You can split the ciphertext into four separate columns, each of which is just a simple Caesar cipher. From there, old-fashioned frequency analysis makes quick work of the rest.
The Linguistic Leap: From Keywords to Authorial Voice
So, what does a 19th-century military cipher have to do with linguistics and identifying anonymous authors? Everything. The underlying principle is exactly the same: unconscious, repeated patterns can betray a hidden system.
In the Vigenère cipher, the hidden system is the keyword. In writing, the hidden system is an author’s unique, ingrained stylistic habitsâtheir “authorial fingerprint.”
This is the field of stylometry, the statistical analysis of literary style. Just as you have a unique fingerprint or signature, you have a unique “stylo” composed of countless unconscious choices you make when you write.
Stylometry: Finding the Ghost in the Machine
An author’s fingerprint isn’t about using big, fancy words. In fact, it’s often the opposite. The most reliable markers are the small, functional words and patterns we use without a second thought:
- Function Words: How often do you use “of”, “in”, “on”, “but”, or “and”? The ratios are surprisingly consistent per author.
- N-grams: These are sequences of N items. A 2-gram (or bigram) is a two-word pair, like “of the” or “as a”. A 3-gram (trigram) is a three-word set, like “as a matter.” We all have our pet phrases and habitual word pairings.
- Sentence Structure: Do you prefer short, punchy sentences or long, flowing ones? Do you often start sentences with conjunctions?
- Punctuation Quirks: Your use of commas, semicolons, or the em dash can be as telling as your word choice.
No single trait can identify an author. But when a computer analyzes dozens or hundreds of these features across a large body of text, it can build a remarkably accurate statistical model of their style.
Unmasking Authors in the Digital Age
The most famous modern application of these principlesâa digital Kasiski examinationâwas the unmasking of Robert Galbraith, the author of the 2013 crime novel The Cuckoo’s Calling.
When speculation arose, researchers Patrick Juola and Peter Millican ran stylometric analyses. They converted the book into a set of statistical data, focusing on features like word length frequency and, most importantly, the frequency of the 100 most common words. They compared Galbraith’s “fingerprint” to those of other suspected authors and, of course, to J.K. Rowling.
The result was unequivocal. The patterns in The Cuckoo’s Calling were a near-perfect match for Rowling’s other work. The repeated, unconscious “keyword” of her writing style gave her away. The same techniques helped identify Joe Klein as the author of Primary Colors and even played a role in pinpointing Ted Kaczynski as the Unabomber by analyzing his manifesto.
The Enduring Legacy of a Pattern
Friedrich Kasiski’s goal was to break military codes by spotting patterns in chaos. He could never have imagined that 150 years later, computational linguists would be using his core logic to solve literary puzzles.
The Kasiski method reveals a profound truth about communication: we all operate on a hidden keyword. For the cryptographer, it was a word like LEMON
or CODE
. For each of us, it’s the sum of our linguistic habits, a personal rhythm that embeds itself in everything we write. In an age of digital text and powerful algorithms, true anonymity has become the real chiffrage indĂŠchiffrableâthe truly unbreakable code.