A few days ago, German cryptoblogger Klaus Schmeh mentioned a recent paper by Tom Juzek on the unsolved Z340 Zodiac Killer cipher. This first appeared in March/April 2018, but I was not aware of it before Klaus flagged it.

Juzek’s MSD metric

The metric Juzek uses to drive much of his argumentation is what he calls ‘MSD’ (“Mean Squared Distance”), which is simply the sum of the squares of the instance frequencies of bigrams (or trigrams), but then divided by the number of individual bigrams (or trigrams).

As an example, the 14-letter text “AAAAAAAAAABCD” is made up of thirteen bigram instances AA, AA, AA, AA, AA, AA, AA, AA, AA, AB, BC, and CD. Hence it contains 9 x AA, 1 x AB, 1 x BC, and 1 x CD: and so would have a bigram MSD of (9*9 + 1*1 + 1*1 + 1*1) / 13 = (84 / 13) = 6.46.

The same text contains twelve trigram instances AAA, AAA, AAA, AAA, AAA, AAA, AAA, AAA, AAB, ABC, and BCD. Hence it contains 8 x AAA, 1 x AAB, 1 x ABC, and 1 x BCD: and so would have a trigram MSD of (8*8 + 1*1 + 1*1 + 1*1) / 12 = 5.58.

However, Juzek quickly flags that this raw metric is not really good enough on its own:

The problem with the msd is that there are difficulties with comparing msd’s across data sets. This is because the length of a text influences the msd, as well as the length of a text’s character set. A 400 character cipher using 10 characters will see a different ngram distribution to a 100 character cipher using 40 characters.

Hence Juzek instead generates a “delta MSD”, which he defines as the difference between the ngram MSD of each ciphertext read horizontally (i.e. the generally presumed ‘correct’ symbol ordering) and the ngram MSD of its vertical transposition (i.e. every 17th character). This is to try to ‘normalize’ the raw MSD against a kind of statistically flattened version of the same.

Juzek then applies these two final metrics (bigram delta MSD and trigram delta MSD) to a number of real and fake ciphers, before concluding that the Z340 is quite unlike the Z408, and that it in fact presents more like fake ciphers than real ciphers.

What’s Wrong With This Picture?

Clearly, Juzek’s motivation for squaring ngram instance counts at all is to try to somehow ‘reward’ ngrams that are repeated in a given text being tested. Unfortunately, I think this is no more than a rather clunky and misleading way of looking at entropy / negentropy, which has a long-established and rigorous calculation procedure (and an enormous theoretical literature ranging across Computer Science and indeed Physics).

As a result, I think he may well have reinvented a perfectly round wheel in a somewhat square format: sorry, but I don’t think this is going to roll very far or very fast.

If the same calculations were repeated with different order ngram entropies, I think we might have something more interesting to work with here: but that’s already been done to death in the Zodiac Killer research world.

Moreover, the long-standing suggestion (which I think has a fair amount of evidential support) that the Z340 may well have been constructed in two distinct halves (Z170A and Z170B) would also mess with just about all of his arguments and conclusions. I’d much rather have seen that tested than Vigenere (it’s not a Vig, not even close).

Forward Context vs Backward Context?

As I was reading through Juzek’s paper, I was struck by a quite different question. If we are looking at an encrypted homophonic English ciphertext (a fairly reasonable assumption here), is there a notable difference between the left-context entropy (i.e. the information content of the text using the preceding letter as a context for predicting the next letter) with the right-context entropy?

That is, might encrypted homophonic English ciphertexts have a distinctly asymmetrical statistical “fingerprint” that would give us confidence that this is indeed what we are looking at in the Z340? Perhaps this has already been calculated: if so, it’s not work that I’m aware of, so please leave a comment here to help broaden my mind. 🙂

5 thoughts on “Tom Juzek on the Zodiac Killer Ciphers…

  1. Davidsch on May 7, 2018 at 11:54 am said:

    (..to death? strange choice of words my dear Watson)

    I did not read the paper, but of course the idea behind the delta is to find a deviation. I use it myself often, However when the patterns have a high repeat as in this example, it is better to use an Edit Distance method (& Jaro-winkler etc.) and play with that.

    All that talk about text entropies, was interesting about two decades ago, but currently, in my opinion, is really an old method to identify text, either short or long contents. It will only give an indication, but since most European languages have the same origin, in ciphered texts, it does not really tell you something specific enough.

  2. xplor on May 9, 2018 at 5:45 pm said:

    Will familial DNA testing hold the key to catching the Zodiac Killer ?

  3. @ xplor: I asked (several months ago if it were possible to get as least RNA study from the deceased man (Somerton man — who died on the beach below the town of Somerton). I was hoping for a possible identification of that dead man. Yes, the RNA sampling provided some info which basically sets off a whole lot skeptical answers based on Somerton Man having been a relative of Thomas Jefferson’s huge historical archives.

    When it comes to trying to identify the “Zodiac Killer” you may get some answers from the archives of the California State Police (who also hold information from various persons who claim that the “Zodiac” was a member of their family.

    bd

  4. I’m wondering if it is at all possible that the z340 cipher may have some parts where, in the plain text, the order of the words is written correctly, left to right, but the spelling of the words is reversed. I know it might sound silly, as I know very little about ciphers, but it’s interesting that on line 10 of the cipher, DIEBY is written “EIDYB”.
    His later Halloween card made reference to “By Knife, By Gun, By Rope, By Fire”
    Just curious about what anyone might think.

  5. Hi, a bit late to the discussion.

    First, thanks for providing a platform for discussion.

    The results described in the blog post are a product of something that is more than just tinkering, but something that is also less than science. That should be kept in mind; it’s a hobby.

    That said, yes, you’re right, the d-msds are directly linked to information theoretic entropy. Both are an expression of order. Just, the d-msds don’t scale as well (but: the blog entry controls for the length of the ciphers, so that is somewhat mitigated).

    Also, you are also right that entropy has been applied to both z340 and z408.

    However, tmk, the focus has been on entropy of all characters, e.g. here [0]. And for this, the entropy values for z340 and z408 come out virtually the same.

    And even if one looks at the entropy of bigrams, then the entropy values come out very similar.

    This is what the blog post brings to the table: It looks at a measure of order based on ngrams and, critically, it establishes a meaningless baseline. The last bit is important, because even entropy is influenced by the length of a cipher and by the length of a cipher’s character set.

    I redid this with the entropy of ngrams compared to a meaningless baseline – and the results come out pretty much the same as in the blog post.

    So I would say that what the blog post puts forward is an accurate reflection of the underlying reality. That is, z408 shows a lot of structure / order, z340 does not.

    As to your point wrt directionality: Whether you go left to right or right to left will have no impact on a strings entropy.

    Best,
    -sp

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Post navigation