In a recent post here, I floated the idea that the Zodiac Killer’s Z408 (solved) cipher’s unusual homophone distribution may have arisen not conceptually (i.e. from a hitherto-unknown book on cryptography), but instead empirically (i.e. emerging from the properties of a specific text).
It’s certainly possible that he might have used his own (private) text to model his homophone distribution, in which case we probably almost no chance of reconstructing it. However, I think it likely that he instead used the first few characters of an already existing public text (such as Moby Dick, the Book of Genesis, the Declaration of Independence, or whatever) to do this.
It’s a reasonable enough suggestion, I think: and moreover one that we can try to test to a reasonable degree.
A homophonic cipher key allocates a number of cipher shapes to individual plaintext letters, usually (but not always) in broad proportion to their frequency. So in a typical homophonic cipher key you would expect to see far more shapes for E (the most common letter in English) than for, say, Z or Q.
Though this is essentially the case for what we see in the Z408 cipher (particularly for the more frequent letters, ETAOINS), the numbers of homophones chosen for the less frequent letters seem somewhat idiosyncratic and arbitrary:
7 shapes – E
4 shapes – T A O I N S
3 shapes – L R
2 shapes – D F H
1 shape – B C G K M P U V W X Y
Did not appear: J Q Z
People have long searched for a primer or textbook on cryptography where the description of the alphabetic frequency distribution matches this, or even where the alphabetic frequency ordering (e.g. ETAOINSHRDLU etc) matches the order here, but in vain.
Designing a filter
The basic idea for the filter is easy enough:
* read in characters from the start of a passage (we’re only interested in capitalized alphabetic letters, i.e. A-Z)
* if the instance count of that character is higher than the top of the desired range, then the test fails
* if the instance counts for all the characters are within the desired range at the same time, then the test passes
* else keep reading in more characters until the test terminates
As a side note: of all the Z408 homophones, only X appears exactly once in the Z408 ciphertext itself: but while it is conceivable that the Zodiac Killer might have allocated extra homophones for X, it does seem fairly unlikely.
The desired ranges for each of the characters would look like this (though feel free to adapt this if you disagree with the homophone counts listed above):
[7,7] – E
[4,4] – T A O I N S
[3,3] – L R
[2,2] – D F H
[0,1] – B C G K M P U V W Y J Q Z
[0,3] – X (to err on the side of safety)
Note that the single-letter characters have a slightly broader [0,1] range because we have no way of knowing whether or not they would have actually appeared in the original text.
Here are two test texts that should both pass:
Which texts to try?
Though any text published before August 1969 would potentially be a match, it would make sense to look at all manner of texts, and possibly even the first few lines of different chapters of books (though I’d be a little surprised if that was the case). All the same, the filter is easy enough to write (and should execute in a matter of microseconds) and to test, so the difficulty here lies mostly in getting hold of enough texts to try, rather than the compute time as such.
Oddly, I don’t really have a solid feel for how often the filter will find a match: my gut instinct is that roughly one in a million English text comparisons will pass, but that’s just a guesstimate based on each letter having its own little bell-curve distribution, all of which have to match at the same time.
So what do you think will match? “Catcher in the Rye” or “Moby Dick”? Place your bets! 😉