In the wake of Dave Oranchak’s epic crack of the Zodiac Killer’s Z340 cipher, which other unsolved ciphers might get cracked in 2021?

For me, the way the Z340 was solved highlighted a number of issues:

  • It seems very likely to me that other long-standing cipher mysteries will also require collaboration between entirely different kinds of researcher
  • Hence I suspect that many are beyond the FBI’s in-house capabilities, and it will need to find a new way to approach these if it wants them cracked
  • The whole Big Data thing is starting to open some long-closed doors

With these in mind, here’s my list of what might get cracked next:

Scorpion Ciphers

The Scorpion ciphers were sent to America’s Most Wanted host John Walsh from 1991 onwards: we have copies of S1 and S5, but the rest are in the hands of the FBI. As you’d expect, I’ve blogged about these many times, e.g. here, here, here, and here. I also created a related set of seven cipher challenges, of which only one has been solved (by Louie Helm) so far.

To be honest, I fail to understand why the FBI hasn’t yet released the other Scorpion Ciphers. These are the grist the Oranchak code-cracking mill is looking for: homophonic ciphers, underlying patterns, Big Data, etc.

Nick’s rating for a 2021 crack: 8/10 if the FBI releases the rest, else 2/10

Beale Ciphers

Even if I don’t happen to believe a measly word of the Beale Papers, I still think that the Beale Ciphers themselves are probably genuine. These use homophonic ciphers (albeit where the unbroken B1 and B3 ciphers use a system that is slightly different from the one used in the broken B2 cipher).

Because we already have the hugely improbable Gillogly / Hammer strings to work with (which would seem to be the ‘tell’ analogous to the Z340’s 19-repeat behaviour), we almost certainly don’t need to find a different book

Given that Virginia is Dave Oranchak’s stamping ground, I wouldn’t be surprised if the redoubtable Mr O has already had a long, hard look at the Beale Ciphers. So… we’ll see what 2021 has to bring.

Nick’s rating for a 2021 crack: 2/10

Paul Rubin’s Cryptograms

A curious cryptogram was found taped to the chest of Paul Emanuel Rubin, an 18-year-old chemistry student found dead from cyanide poisoning near Philadelphia Airport in January 1953. As usual, I’ve blogged about this a fair few times, e.g. here, here, here and here.

There’s a good scan of the cryptogram on my Cipher Foundation page here; there’s a very detailed account in Craig Bauer’s “Unsolved!”; and the 142-page FBI file on Paul Rubin is here.

The ‘trick’ behind the cryptogram appears to be to use a different cipher key for each line. Specifically, the first few lines appear to be a kind of “Trithemian Typewriter” cipher, where every other letter (or some such pattern) is enciphered using a substitution cipher, and where the letters inbetween are filled in to make these look like words. This is, I believe, the reason we can see words like “Dulles” and “Conant” peeking through the mess of “astereantol” and “magleagna” gibberish.

Right now, I’m wondering whether we might be able to iterate through thousands of possible Trithemian schemes to crack each individual line (e.g. lines 4 and 5 appear to share the same cipher key number).

The cipher keys appear to use security by obscurity (& terseness), so I suspect that these may well be defeatable. Definitely one to consider.

Nick’s rating for a 2021 crack: 4/10

Who was The Zodiac Killer?

Even if the Z340 plaintext failed to cast any light on his identity (as I certainly expected), surely a DNA attack must now be on the cards?

I’d have thought that the relatively recent (2018) success in identifying Joseph James De Angelo as the Golden State Killer must surely mean that the Zodiac Killer’s DNA is next in line in the forensic queue.

To my eyes, the murder of Paul Stine seems to me to have been the least premeditated of all the Zodiac Killer’s attacks, so I would have expected the crime-scene artifacts to have been a treasure trove of DNA evidence. But there are plenty of other claims for Zodiac DNA, so what do I know?

Anyway, I have no real doubt that there are 5 or 6 documentaries currently in production for 2021 release that are all racing to use DNA to GEDmatch the bejasus out of the Zodiac Killer. I guess we shall see what they find…

Nick’s rating for a 2021 breakthrough: 7/10 with DNA, else 0/10

Who Was The Somerton Man?

2021 may finally see the exhumation Derek Abbott has been pushing for for so long; plus the start of a worldwide DNA scavenger hunt to identify the unidentified corpse found on Somerton Beach on 1st December 1948.

But after all that, will the mysterious man turn out to be Robin McMahon Thomson’s missing father; or a shape-shifting Russian spy; or a Melbourne crim whom everybody suddenly wanted to forget they ever met?

All the same, even if we do get a name and a DOB etc, will that be enough to end all the shoddy melodrama around the case? Errrm… probably not. 🙁

For what it’s worth, I would have thought that Robin’s father’s surname was almost certainly (Nick shudders at the obviousness) McMahon. I also wouldn’t like to bet against a Dr McMahon in Sydney (e.g. the surgeon Edward Gerard McMahon, though I expect there are others), but feel free to enlighten me why you think McMahon was actually a family name etc etc.

Nick’s rating for a 2021 breakthrough: 8/10 with an exhumation, else 1/10

From 1991 onwards, John Walsh – the well-known host of “America’s Most Wanted” – received a series of ominous and threatening letters signed “SCORPION”. Some of these contained a series of Zodiac-style homophonic ciphers: to date, only two of these have been released. Unsurprisingly, these are known as the “Scorpion Ciphers“: but none has yet been cracked.

However, there’s reason to believe (as I pointed out in 2017) that there are some specific regularities with the S1 (‘Scorpion #1’) ciphertext and even more so with the S5 ciphertext that we might be able to exploit. Even though both ciphers at first sight resemble the Zodiac Killer’s ciphers, the Scorpion’s seem to cycle between specific sets of homophonic symbols. In the case of the S5 ciphertext, the cycling seems particularly rigid, in that it cycles between 16 alphabets.

Cryptologically, the problem was that ‘traditional’ homophonic solvers (such as Jarlve’s excellent AZDecrypt) have no way to include extra cipher system constraints, regardless of how stringent they may be. For example, it would be a reasonable hypothesis that S5 uses only a single alphabet for each of its 16 columns: but this is not something that any current solver could use.

After a fair bit of thought, I came up with the idea of putting out a set of challenge ciphers using a completely rigid cycling homophonic cipher, to try to spark interest in solving this class of ciphertext. And so, also back in 2017, I posted up a page containing seven constrained homophonic challenge ciphers. Despite a shockingly high bounty of £10 being on offer for the best solve by the end of 2017, nobody managed to grab my cash.

http://ciphermysteries.com/wp-content/uploads/sites/6/2017/06/smiley-10-pound-note.jpg

Had I made even the longest challenge cipher far too difficult? Would nobody ever solve these? Despite my doubts, I remained reasonably confident some clever person would find a way in, sooner or later…

2020: Enter Louie Helm…

And so it was with great delight that I received an email this morning from Louie Helm, asking me to check his solution. As a recap, my challenge cipher #1 (neatly arranged into its five columns) was:

121,213,310,406,516,
108,200,323,416,513,
112,208,308,409,515,
102,216,309,425,509,
114,215,309,417,507,
102,201,323,401,517,
111,200,306,408,500,
113,203,313,407,512,
103,223,313,403,511,
119,213,316,416,511,
102,204,324,418,517,
120,203,324,407,516,
105,209,312,401,504,
117,208,310,408,500,
113,203,301,425,513,
115,201,313,408,515,
115,214,308,406,501,
122,204,322,408,509,
114,209,305,412,504,
117,213,316,402,509,
100,200,310,423,513,
100,214,320,419,509,
114,209,309,419,520,
101,200,320,416,518,
120,211,313,403,509,
103,207,313,421,513,
107,209,305,407,523,
115,224,313,416,508,
102,203,306,416,514,
107,200,310,401,509,
103,212,324,

Louie’s claimed plaintext was:

THEOBJECTOFMYPROP
OSEDWORKONCYPHERI
SNOTEXACTLYWHATYO
USUPPOSEBUTMYTIME
ISNOWSOENTIRELYOC
CUPIEDTHATIHAVEBE
ENOBLIGEDTOGIVEIT
UPATLEASTFORTHENE
XTTWOORTHREEYEARS

…which was completely correct! Fantastic work, well done!!! 🙂

(Extra crypto brownie points on offer for anyone who recognizes the – admittedly somewhat obscure – source of the quotation.)

Crypto-fans may well recognize Louie’s name as having been (along with Jarlve) one half of the recent pair of solvers of Klaus Schmeh’s 1000-bigram (and then even harder 750-bigram) challenge. And here’s Louie’s cool-looking photo:

So: one last time – well done, Louie, you rock. 😉

So… How Did Louie Do It?

This is, of course, the interesting question: and Louie answered it in a forum post on zodiackillersite.com on 02 Jan 2020, which I can do no better than simply quote in full:

I solved it with AZdecrypt v1.17 using my newest 8-gram model released a few days ago on Christmas. The only modification I made was adding 15 lines of code to restrict the solver to only use one homophone for each of the five columns. The solve for cipher #1 succeeded, but the way I modified the solver to do it is very inelegant since it can quickly lock high-scoring letters into place and then deprives the hill climber of further opportunities to test them in other arrangements. A more well-tuned version of this general solution would merely penalize (but not forbid) repeated letters in each column. This would allow the solver to evolve through a less jagged solution landscape and then still eventually arrive at a 1-homophone/column solution in the end. I predict this sort of modification would likely solve cipher #2 and beyond.

Although I used the column constraint (and it appears to have been necessary for a quick solve), the larger story here is probably the recent improvements to AZdecrypt and the release of 8-gram models in 2019.

For instance, if you simply ignore Nick’s constraint and use Jarl’s state of the art solver + his best n-gram file from 2018, you would have needed an 8 word crib to quickly decrypt challenge cipher #1:

AZdecrypt v1.14 + 7-gram jarl reddit: THEOBJECTOFMYPROPOSEDWORKONCYPHER

But in 2019, Jarl moved away from his IoC-based solver to a more capable entropy-based one. This alone drops the required crib needed to solve the cipher down to 2-3 words:

AZdecrypt v1.17 +
6-gram jarl reddit: THEOBJECT — PR — CYPHER
7-gram jarl reddit: THEOBJECT — PR

And doing the same execrise with the n-gram models I’ve released during 2019 shows they progressed from needing an 8 word crib — down to just a 5 letter crib:

AZdecrypt v1.17 +
6-gram v2 (May 2019): (no cribs sufficient)
6-gram v3 (Jun 2019): THEOBJECTOFMYPROPOSEDWORKONCYPHER — ~90% correct solve
6-gram v4 (Oct 2019): THEOBJECTOFMYPROPOSEDWORKONCYPHER — 100% correct
6-gram v5 (Dec 2019): THEOBJ — PR — CYPHER

7-gram v2 (May 2019): THEOBJECT — PR — WORK
7-gram v3 (Jun 2019): THEOBJECTOF — WORK
7-gram v4 (Oct 2019): THEOBJECT — WORK
7-gram v5 (Dec 2019): THEOBJ — WORK

8-gram v2 (May 2019): THEOBJECT — WORK — CYPHER
8-gram v3 (Jun 2019): THEOBJ — WORK — YPH
8-gram v4 (Oct 2019): THEOBJ
8-gram v5 (Dec 2019): HEOBJ

Note: These are just the cribs I got to work in under a minute using a completely unmodified version of AZdecrypt. It’s quite possible that any of us using either of the last two version of AZdecrypt + beijinghouse 8-gram files could have solved this since Oct 2019 simply by letting it run long enough.

So the real story here seems to be that to crack my first challenge cipher, the three things that were necessary were not only Louie Helm’s tweaks to AZDecrypt to exploit the column constraints, but also Jarlve’s huge improvements to AZDecrypt’s homophonic solver during 2019, along with Louie’s now very extensive 8-gram files.

I think this is a great result for Louie (and for Jarl too!), and I have nothing but admiration and applause for the pair of them. Rock and roll, guys!

The (Inevitable) Crypto Punchline…

Of course, challenge cipher #1 was (numerically) the easiest one, in that it was the longest of the set. So the big test will be to see how far through the list of challenge ciphers Louie Helm’s approach will (admittedly with a bit of tidying up) now be able to reach.

Interestingly, Louie immediately noted that my challenge cipher #2 presents some obvious-looking crypto weaknesses:

  • the 1st and 10th lines (of five symbols each) are identical
  • the 4th and 19th lines are also identical, and share three consecutive symbols with line #22.

He then speculates that it might be worth attacking these patterns in cipher #2 using common words or phrases such as “THERE” / “THOSE” / “I HAVE” / “IN THE” / “IS THE” / “IT WAS”.

Personally, I haven’t looked at the plaintexts since I enciphered them two and a half years ago (and I have no intention of doing so until such time as a proposed decryption arrives here), so I’m not going to be much help. 😉

The only information I’d add is that I took each of the seven short texts from completely different places: so knowing the source of “The object of my proposed work on cypher…” shouldn’t directly assist with the others.

But what I want to now say is: good luck, everyone! The game is afoot!

The story of the ‘Scorpion’ letters to John Walsh, host of “America’s Most Wanted” and (more recently) “The Hunt with John Walsh”, is now reasonably well known. From 1991, Walsh received a string of threatening letters from someone signing themselves “SCORPION”, and also containing cryptograms. Since 2007, two of these cryptograms (“S1” and “S5”) have been released by the FBI: however, none has yet been solved.

The Scorpion also wrote:

I now realize with many hundreds of hours of mindracking experimentation with my complex ciphers that my first one that I sent you was comparatively simple to my second, third, fourth, and now temporarily final cryptograph system. I have been encoding useful information for your use and have done it fairly, since all of my ciphers can be decoded simply, once the limited patterns and systems are discovered.

I’ve blogged before about how the S5 cryptogram (arranged as 15 rows of 12 symbols each) only ever has repeats where the distances between symbols is a multiple of 16, suggesting that it may well be composed of 16 strictly cycling cipher alphabets. I similarly suggested that S1 appeared to have repeats largely centred around multiples of 5, though this distance was far less solid.

Here’s what S1 (the first Scorpion cryptogram) looks like:

To make some kind of organizational sense of this, I tried to follow the basic pattern laid down by the S5 ciphertext, by:
* assigning symbols to five cycling columns
* mostly resetting these at the leftmost column of ten
* assuming that the encipherer’s first cipher system usage wasn’t as disciplined as his later (far more complex) efforts.

Here, you should be able to see all the same symbols as S1 (and in the same order), but assigned to five columns, where the shapes in each column are (mostly) thematically grouped. The only exception to this rule is the mirrored ‘L’ shape, which appears both in column #2 and in column #4. My strong suspicion is that this was an enciphering slip, where a simple geometric shape appeared in two different columns’ cipher alphabets by mistake.

Is this solveable? If I’m even roughly correct about the grouping, then S1 was, like S5, almost exactly the same category of cipher for which I put forward a sequence of challenge ciphers in 2017 (and all of which remain uncracked). There, the first challenge cipher was 153 symbols long, laid out in five perfectly cycling groups. This was more than twice as long as S1, and with the added benefit that I even told you exactly what kind of cipher it is. The second challenge ciphertext was slightly shorter (118 symbols): and so forth.

Can We Crack S1?

On the one hand, the multiplicity of the Scorpion ciphertexts is very high, meaning that pure homophone solvers stand almost no chance.

On the other hand, I’m pretty sure that these aren’t pure homophonic ciphers, insofar as each group of symbols almost certainly will have at most one A shape, at most one B shape etc. We might also try searching ‘down’ from setups that assume that repeated symbols in each group are not randomly chosen, but are most likely frequently used letters, e.g. ETAOINSH. With a long enough ciphertext to work with, this would be the preferred ‘classical’ way to attack the cipher: but, alas, we only have short ciphertexts to work with here. 🙁

However, my understanding is that there has been a handful of historical examples where particular ciphertexts of this general type (i.e. based around a cycle of interleaved cipher alphabets) have been cracked by determined cryptanalysts. So I’m not yet convinced it’s impossible.

All the same, has a specifically optimized machine algorithm for cracking these ever been put forward?

I posted up seven homophonic challenge ciphers a few days ago, and now – though it may sound a little counter-intuitive – I’d like to try to help you solve them (bear in mind I don’t know if they can be solved, but the whole point of the challenge is to find out).

Of the seven ciphers, #1 is the longest (and hence probably the easiest). Reformatted for ten columns rather than five (it uses five cycling alphabets ABCDE, ie. “ABCDE ABCDE” over ten columns):

121,213,310,406,516, 108,200,323,416,513,
112,208,308,409,515, 102,216,309,425,509,
114,215,309,417,507, 102,201,323,401,517,
111,200,306,408,500, 113,203,313,407,512,
103,223,313,403,511, 119,213,316,416,511,
102,204,324,418,517, 120,203,324,407,516,
105,209,312,401,504, 117,208,310,408,500,
113,203,301,425,513, 115,201,313,408,515,
115,214,308,406,501, 122,204,322,408,509,
114,209,305,412,504, 117,213,316,402,509,
100,200,310,423,513, 100,214,320,419,509,
114,209,309,419,520, 101,200,320,416,518,
120,211,313,403,509, 103,207,313,421,513,
107,209,305,407,523, 115,224,313,416,508,
102,203,306,416,514, 107,200,310,401,509,
103,212,324,

Repeated Quadgram

Commenter Jarlve (whose interesting work on the Zodiac Killer ciphers some here may already know) noted that there is a repeated quadgram here, i.e. the sequence 408 500 113 203 appears twice.

This is entirely true, and also a very sensible starting point: I’ve highlighted this quadgram in the following diagram, along with all other repeated A-alphabet tokens (i.e. 100..125), and also any tokens they touch more than once (i.e. in the B and E alphabets):

Another thing that’s interesting here is that the 102 token (that appears four times and is coloured purple in the above) appears with four different letters before it as well as four different letters after it. In classical cryptology, that’s normally taken as a strong indicator that this is a vowel: and with the high instance count (4 out of 31, i.e. 12.9%), you might reasonably predict that this is E, A, O, or perhaps I (in order of decreasing likelihood).

[Note that I haven’t looked to check what letter this actually is: having created the challenge ciphers, I’ve just left them to one side, and don’t intend to look again at them.]

Similarly, the 114 token (that appears three times and is coloured green) is always preceded by 509, and is followed by 209 on two of the three instances. (Note that the token two after it is 309 in two of the three instances as well.) Again, in classical cryptology, these kind of structured contacts are normally taken as strong indicators that this token enciphers a consonant: and with the high instance count (3 out of 31, i.e. 9.7%), you might reasonably predict that this enciphers T or possibly N, S, or H.

With these two examples in mind, it strikes me that for any given plaintext language (English in the case of these challenge ciphers) you could easily build up probability tables for repetitions of the two tokens before and the two tokens after any given token: and then use those as a basis to predict (for a given ciphertext length) which plaintext letter they imply the letter is likely to be.

Though this may not sound like very much, because you can do this for all five of the alphabets independently, the results kind of rake across the ciphertext, yielding a grid of probabilistic clues that some clever person might well use as a basis for working towards the plaintext in ways that wouldn’t possible with randomly-chosen homophonic ciphers. Just sayin’. 😉

And The Point Is…

It’s entirely true that for homophonic ciphers where each individual cipher is chosen at random, the difficulty of solving a reasonably short cipher with five homophones per letter would be very high. But knowing (as here) that each column is strictly limited to a given sub-alphabet, my point is that many of the tips and tricks of classical cryptology are also available to us, albeit in slightly different forms from normal.

Yet while it’s encouraging for solvers that there is a repeated quadgram here, I don’t currently believe that cipher #1 will be (quite) solvable with pencil and paper, as if it were a Sudoku extra-extra-hard puzzle (though as always, I’d be more than delighted to be proved wrong).

However, my hunch remains that strictly cycling homophonic ciphers may well prove to be surprisingly solvable using deviousness and computer assistance, and I look forward very much to seeing how they fare. 🙂

While thinking about the Scorpion S1 unsolved cipher in the last few days, it struck me that it seemed to be a special kind of homophonic cipher, one where the homophones are used in rigid groups.

That is: whereas the Zodiac Killer’s Z408 cipher cycled (mostly but not always) between sets of homophones by their appearance, it appears that the Scorpion S5 cipher maker instead rigidly cycled between 16 sets of homophones by column. What’s interesting about both cases is that the use pattern gives solvers extra information beyond that which they would have for a homophonic cipher where each homophone instance was chosen completely at random.

Perhaps there’s already a special name for this: but (for now) what I’m calling them is “constrained homophonic ciphers“, insofar as they are homophonic ciphers but where an additional use pattern constrains the specific way that the homophones are chosen.

The question I immediately wanted to know the answer to was this: can we solve these? And what better way to find this out than by issuing a challenge!

Seven Challenge Ciphers

The seven challenge ciphers are downloadable as a single zip file here, or as seven individual CSV files here:
* #1
* #2
* #3
* #4
* #5
* #6
* #7

How The Ciphers Were Made

Unlike normal challenge ciphers, what I’m giving you here (in line with Kerkhoffs’ Principle) is complete disclosure of the cipher system and even the plaintext language.

The cipher system used here is a homophonic cipher with exactly five possible homophones for each plaintext letter BUT where the homophones are strictly selected according to the column number in which they appear in the ciphertext. Each separate CSV uses its own individual key.

The plaintext language is English: they are straightforward sentences taken from a variety of books, and without any sadistic linguistic tricks (i.e. no “SEPIA AARDVARK” or similar to confuse the issue).

The enciphered files are simple CSV (comma-separated values) text files, arranged in rows of five letters at a time, but encoded as decimal numbers. For example, the first (and the longest) challenge cipher (“test1.csv”) begins as follows:

121,213,310,406,516,
108,200,323,416,513,
112,208,308,409,515,

Here, “121,213,310,406,516,” enciphers plaintext letters #1..#5, “108,200,323,416,513,” enciphers plaintext letters #6..#10, and so forth. The first column is numbered in the range 100..125 (i.e. these belong to the 1st homophonic alphabet), the second column 200..225 (i.e. these belong to the 2nd homophonic alphabet), and so forth.

The start of the message and the end of the message are exactly as you would expect: there is no padding at either end, no embedded key information, just pure ciphertext.

The Rules

Treating this as a massively parallel book search using cloud databases (a) will be treated as cheating, and (b) will spoil it for other people, so please don’t do that. This challenge is purely about finding the limits of cryptanalysis, not about grandstanding with Big Data.

Hence you’ll need to also tell me (broadly) what you did in order to rise to the challenge, so that I can be sure you haven’t solved it through secondary or underhand means.

The Prize

If nobody solves any of the challenge ciphers by the end of 2017, my wallet stays shut.

However, the person (or indeed group) who has the most success decrypting any of these seven challenge ciphers by 31st December 2017 will be the “2017 Cipher Mysteries Cipher Champion“, and will also receive a shockingly generous £10 prize (sent anywhere in the world where PayPal can send money) to spend as they wish.

In the case of multiple entrants solving the same difficulty cipher independently, I’ll award the prize to the first to contact me. In all cases, please leave a comment below.

In all situations, my decision is final, absolute, arbitrary and there is no opportunity for appeal. Just so you know.

PS: any individual (or indeed covert agency) wishing to donate more money to increase the prize fund (i.e. to make a little more cryptanalytic sport of this), please feel free to email me.

Hints and Tips

I suspect that the multiplicity (i.e. the number of different symbols used divided by the length of the ciphertext) will prove to be too high and the ciphertext lengths too short for conventional homophonic decryption programmes, so I expect prospective solvers won’t be able to look to these for any great help.

Similarly, I don’t believe that numerical brute force and/or parallel processing will be sufficient here: all the same, these challenges (if solvable) will probably prove to be things that anyone anywhere can tackle (e.g. through hill-climbing and cleverly exploiting the constraints), not just the NSA, GCHQ or similar with their supercomputers.

For what it’s worth, my best guess right now is that #1 (the longest of the seven ciphertexts) will prove to be solvable… though only just. Even so, I’d be delighted to be proved wrong for any of the others.

Incidentally, I chose the length of the very shortest challenge cipher to broadly match the length of the Scorpion S1 cipher: so even in the (perhaps unlikely) case where all seven of my challenge ciphers get solved, there’ll still be an eighth challenge to direct your clever efforts at. 😉

I’ve blogged a few times about trying to crack the Scorpion Ciphers (a series of apparently homophonic ciphers sent to American crime TV host John Walsh). Most of my effort has been spent on the Scorpion S5 cipher, which (despite having 12 columns) appears to be rigidly cycling between 16 cipher alphabets.

However, it struck me a few days ago that this might also give us a way in to the Scorpion S1 cipher. This is because all the repeats there seem to be at a column distance of 0, 1, 4, 5, and 6, with the overwhelming majority of repeats at column distances 0 and 5. (The only exception is the “backwards L” glyph, which appears in two pairs, one pair at column distance 0 apart other, and the other at column distance 5 apart)

The Slippy S1 Five-Alphabet Hypothesis

Putting the 16-alphabet-cycle from S5 together with the mostly-0-or-5-column-distance observation from S1 yields my “Slippy S1 Five-Alphabet Hypothesis”: that Scorpion S1 was constructed from a cycle of 5 cipher alphabets, where the encipherer always reset to alphabet #1 at the beginning of a line, and usually (but not always) stepped to the next alphabet along with each new column.

So whereas a rigid 5-alphabet cycle (i.e. with no slips) would have a fixed alphabet “ownership” of 1234512345 for each ten-glyph line, I suspect that we can make a “slippy” guess for S1’s cycle ownership, to try to reconstruct where the encipherer slipped from one cycle into the next. My best current set of guesses for S1 is therefore:

1234512235
1234512344
1234412345
1234512345
1234112345
1234551245
2234512345

(Note that I suspect that the “backwards L” shape appears on two alphabets, i.e. once in alphabet #2 and once in alphabet #4, but that this is the only exception to the rule.)

What this means is that each of the five alphabets has only 26 glyphs in them (one for each letter of the alphabet): and so we can tell that if two shapes are numbered as being in the same alphabet, they are very probably two different letters.

Can We Solve This?

53 of S1’s 10 x 7 = 70 glyphs are unique, yielding a high multiplicity of 75.7%. By way of comparison, it would seem that normal (unstructured) homophonic ciphers are only solvable when their multiplicity is around the 20%-25% mark.

However, the question here is whether being able to group the letters into five unique alphabets (even probabilistically) reduces the number of combinations enough to make this genuinely solvable. As normal, pencil-and-paper solvers can make some pretty good guesses, e.g. the “S Λ” pair on lines #3 and #6 probably enciphers “TH”, while any repeated letter stands a good chance of being a normal high-frequency letter such as ETAOINS etc: but computers would do this much better.

My instinct is that this should be a good candidate for hill-climbing: and that the one-glyph-per-letter-per-alphabet constraint will prove reasonably effective. But effective enough? We’ll have to wait and see…

Incidentally, a good sanity check for this Scorpion S1 hypothesis would be to run some “forward simulations” (which is the kind of thing Dave Oranchak has done so much of with the Zodiac Killer Ciphers). By which I mean: if we feed a variety of 70-letter English texts into my best guess set of slippy cycles (i.e. “ITWASTHEBESTOFTIMESI” fed into 1234512235 / 1234512344 would become: “I1 T2 W3 A4 S5 T1 H2 E2 B3 E5 S1 T2 O3 F4 T5 I1 M2 E3 S4 I4”), I predict that the final average multiplicity of the texts will be close to 75%. But I might be wrong!

I mentioned in a previous post that I thought that the Scorpion S5 cipher’s numerous shape families might offer a backdoor into its cipher system, if they just happened to be elegantly arranged on downward diagonals. I pointed out that if this were correct, the “dice” shape family that appears in columns 1, 3, 4, 8 (twice), 9, 12, 14, 15 would be most likely to have been arranged such that A was 1, C was 3, D was 4 (and so forth).

However, I didn’t actually get so far as calculating the precise probabilities in that post: but now I have (I think).

In my Scorpion spreadsheet, the total probability that a specific family was enciphered as a specific sequential set of letters is calculated as the product of each individual letter’s likelihood. By ‘likelihood’ here, I mean not the probability of that letter occurring randomly (i.e. P, its raw instance probability), but the chances of that occurring exactly N times within a column of letters of height H. And in Excel, you calculate this function using the in-built function ‘BINOMDIST(N, H, P, false)‘. (Note that instead using ‘BINOMDIST(N, H, P, true)’ would calculate the cumulative likelihood of that happening, i.e. the chances of that probability P event happening 0 times up to N times out of a maximum of H times.)

For the raw instance probability values, I used the Scorpion encipherer’s plaintext as a reasonable approximation of the text we are likely to find encrypted inside the S5 cipher. I think there’s a pretty good chance that it will be good enough.

As for the height H: once you have rearranged the message according to the 16 apparent columns of the ciphertext, columns 1 to 4 contain 12 instances each, while for columns 5 to 16, each on contains 11 instances. All of which means that the binomial probability table for N out of 11 looks like this:

binomial-probabilities-11

For example, even though the raw instance probability for ‘E’ is 11.35%, the chance that a given 11-high column of letters will contain exactly one ‘E’ is 37.4271% (or so my spreadsheet says, anyway).

But rather than limit the calculation only to length-16 families, I added a trick whereby shorter families can be checked against other diagonals in the cipher table. If you use the number 99 as the count for an individual family’s column, the spreadsheet works around it in the calculation, by allowing the shifted alphabet to start not at ‘A’ but at ‘z’ (i.e. ‘A – 1’).

I’ve included 11 shape families from the S5 cipher: if you copy a row from any one of these across to row #33, the spreadsheet will calculate a composite ranking value for each of 28 different offsets in column U (the ‘Result’ column). This is equal to the final probability times a million (or else the numbers would be too small to be practical).

For example, the relative rankings for the dice family are:-

2.737265 A
0.013655 B
0.000000 C
0.000046 D
0.293415 E
0.018483 F
0.093272 G
0.000451 H
0.000078 I
0.000000 J
0.009360 K
0.074230 L

Here, the ranking for ‘A’ (2.7372765) is nearly 10x the ranking for second placed ‘E’ (0.293415), which is essentially what my initial imprecise guess was (thank goodness). 🙂

It’ll take a while to figure out what this all means, but I thought I’d post the basic spreadsheet sooner rather than later. 🙂

I’ve been thinking a little more about how to go about cracking Scorpion Cipher S5.

I mentioned before that I thought that the encipherer might well have started from an elegant-looking 26×16 grid filled with diagonally-downward families of shapes, and that this arrangement might offer codebreakers some additional kind of “spatial logic” to support their efforts that traditional ciphers don’t usually provide.

From the letters that accompanied the ciphertexts, my inference is that the Scorpion is like a smart 12-year-old who has just ‘got’ the elegance of maths: but this leads me to a secondary inference that he/she probably didn’t understand modulo addition, because if he/she did, then we would surely have seen more 16-element shape families in the text.

I’ll explain with the help of a diagram of the kind of 26×16 grid I’m talking about:

scorpion-cipher-26x16-grid

If the encipherer had laid out his/her grid with modulo-26 maths in mind, then 16-element families that start in the orange (top right) area and step diagonally down and to the right (as I predict) should wrap around (modulo 26) to the yellow (bottom left) area. However, I believe that we don’t see nearly enough length-16 shape families to support that grid-filling model.

What I think actually happened was that the encipherer only started length-16 families in the A-K range for alphabet #1, which would have ended on P-Z for alphabet #16. This means, for example, that because the ‘dice’ family (actually, the ‘dots in a square’ family, to be precise) has members in alphabets 1, 3, 4, 8, 9, 12, 14, and 15, we may well be able to directly infer that its very first member (in alphabet #1) is A-L.

Moreover, given that the lowest frequency letters in the encipherer’s accompanying letters are…

k : 0.4%
x : 0.3%
j : 0.1%
z : 0.0%
q : 0.0%

…we may also be able to make a reasonable guess as to which possibilities of A-L are the least likely. For example, because the dice family appears in columns 1/3/4/8/9/12/14/15 (of the 16-column sequence I discussed before), this would map to:

+0 : ACDHILNO --- OK
+1 : BDEIJMOP --- has J, so fairly unlikely
+2 : CEFJKNPQ --- has J, K and Q, so not likely at all
+3 : DFGKLOQR --- has K and Q, so not likely
+4 : EGHLMPRS --- OK
+5 : FHIMNQST --- has Q, so not likely
+6 : GIJNORTU --- has J, so fairly unlikely
+7 : HJKOPSUV --- has J and K, so not likely
+8 : IKLPQTVW --- has K and Q, so not likely
+9 : JLMQRUWX --- has J, Q and X, so not likely at all
+10: KMNRSVXY --- has K and X, so not likely
+11: LNOSTWYZ --- has Z, so not likely

So in fact, I suspect that we already know enough to guess that the dice family members encipher either ACDHILNO or EGHLMPRS (in sequence), which I think isn’t a bad starting point at all.

Finally, I suspect there’s something of a cryptological paradox in play here: the more alphabets are involved, the more spatial structure we have to work with. Hence S5’s 16 alphabets might well make it surprisingly crackable. 🙂

As I reported in a post last year (2014), even though the fifth “Scorpion Cipher” (i.e. ‘S5’) sent to John Walsh is arranged using a 12-column layout, it has a very strong internal 16-column structure. What this means is that every single shape repeat spans a distance that is a multiple of 16: which in turn suggests that the encipherer formed the S5 ciphertext by rigidly cycling through a set of 16 simple substitution cipher alphabets.

If you therefore rearrange S5’s shapes into a 16-column layout and colourize their repeats, you get something like the following (click on it to see a higher resolution version):

S5-rearranged-colourized

Now, 155 out of S5’s 180 characters are unique, giving it a ‘multiplicity’ (155/180) of 86%, which is way too high to be cracked using a conventional homophonic cipher solver. For comparison, the three Beale Ciphers have multiplicities of 57%, 24%, and 43% respectively, while the (solved) Zodiac Z408’s multiplicity is a paltry 13%. In fact, the upper limit on solvability for homophonic ciphertexts seems to be multiplicities of around 20%-25% if you’re lucky (or 10%-15% if you’re not), so S5 would at first sight seem to be waaaaaay out of anybody’s practical range.

But I’m not so sure.

Going through what has been released of the encipherer’s letters that the ciphertexts accompanies, he/she starts by saying:-

This code took a lot of time and effort to develop, in hopes that it will defeat FBI and CIA codebreakers.

Which is ‘kind of reasonable’, though the whole enciphering activity would seem to be somewhat pointless unless the person’s overall aim was to somehow emulate the original Zodiac Killer’s ciphers. In a later letter, the encipherer’s position gets finessed somewhat:

I now realise with many hundeds of hours of […] mindracking experimentation with my complex ciphers that my first one that I sent you [S1] was comparatively simple to my second [S2], third [S3], fourth [S4], and now temporarily final cryptograph system [S5]. I have been encoding useful information for your use and have done it fairly, since all of my ciphers can be decoded simply, once the limited patterns and systems are discovered.

What we learn from this, I think, is that what we are looking at here is not the product of a psychopathic academic cryptographer, but is rather a homebrewed cipher system, based around “limited patterns and systems”. So, a bright kid; probably good at maths; and has perhaps read enough popular cryptography (through and beyond the newspaper accounts of the Zodiac Killer’s ciphers) to avoid clunkingly obvious mistakes.

But the mentions of “patterns” makes me suspect that there’s also a little bit of the vanity of the pure mathematician there, intellectual pride that all it would take to “defeat FBI and CIA codebreakers” was “limited patterns and systems”. Hence I think we are likely to be looking at something that is innately very ordered, something that we’ll all kick ourself for not seeing when it is shown to us in the fullness of time. “What a clever person the Scorpion Cipher maker was“, we’re all supposed to say (according to that fantasy script), “much better at making ciphers than the Zodiac Killer ever was“.

In the case of S5, though, I suspect we now know just about enough to break it, even with its dauntingly high multiplicity.

My first observation is that even though it uses a large number of different shapes, these are drawn from a very much smaller set of shape families: and there may well be some kind of cryptographic relationship between the members of each family to help us:-

S5-shape-families

My second observation is that, with the exception of columns 10 and 11 (which may well be random, or possibly ‘S’ vs ‘T’ in the plaintext), the most frequent symbol in any column is always from a different family from the most frequent shape in any other column. It’s not the strongest of observations, sure, but it’s what leads me to my (grandly titled) S5 Construction Hypothesis.

My S5 Construction Hypothesis

I believe that the encipherer very probably constructed 16 cipher alphabets on gridded paper, within a 26 x 16 or perhaps a 16 x 26 grid. But this is a boring activity, and the encipherer’s text suggests a kind of proto-mathematical desire for elegance, like a smart 12-year-old who has just ‘got’ the whole idea of mathematics. So I hypothesize that the encipherer filled this rectangular grid with families of shapes along downward diagonals, from top-left to bottom-right.

Hence for the sixteen component alphabets, any genuine (as opposed to accidental) family of shapes would step through the alphabets. Here, a family that had a member enciphering A in alphabet #1 would also have a member enciphering B in alphabet #2, and maybe a member enciphering C in alphabet #3 etc.

This suggests a quite different kind of cryptologic solving logic from normal, one that not only offers us mathematical means to reduce the multiplicity (because we can posit connections between letters in diffent columns, giving us fewer degrees of freedom to steer our way through), but also spatial means to do the same thing.

What I mean by ‘spatial’ here is that if we look at, say, the family of shapes formed of squares with dots in, I think we might be able to assume that not only are these all part of the same family, but also all the missing shapes on columns without a similar family member can be excluded from the search.

That is, if alphabet #1 uses a square with dots in to encipher ‘A’ and alphabet #3 uses a different square with dots in to encipher ‘C’, then we can very probably infer that alphabet #2 uses a square with dots in to encipher ‘B’, even though we cannot actually see it in the ciphertext. Hence this kind of ‘holistic exclusion’ offers a spatial way to help us reduce the search space.

Of course, turning this visuo-spatial hypothesis into an effective computer algorithm will doubtless prove quite tricky. But perhaps it offers a way of making S5’s cryptologic challenge more tractable than it would be if were a pure homophonic cipher with such a scarily high multiplicity.

Following my recent Scorpion Ciphers post, I’ve put up a permanent reference page on the Scorpion Ciphers and have also tried to contact John Walsh about the as-yet-unreleased other ciphers… so we’ll see how that goes.

Since then, I’ve been working a little more with S5, which has 155 unique symbols out of 180 letters. Because repeated symbols in S5 are always multiples of 16 letters apart, it seems likely to me that this ciphertext was constructed from 16 independent alphabets cycled through in strict sequence. My hope was that this regularity might give us a better chance of cracking S5 than if it were a randomly chosen homophonic cipher.

All the same, this was just a guess: so the first thing I did was come up with a way to test this hypothesis, by writing a short C program to encipher 180-long subsections of the Scorpion’s own plaintext using various numbers of sequential alphabets, to see if this would produce roughly 155 unique symbols.

For each number of alphabets (e.g. 2), I tried (notionally) enciphering every 180-long stretch of the Scorpion’s text, and kept a tally of the minimum number of symbols required (e.g. 37), the maximum number of symbols required (e.g. 44), and the average number of symbols required (e.g. 40).

Interestingly, the results weren’t what I expected:-

alphabets = 1, uniques = (19..24) 21
alphabets = 2, uniques = (37..44) 40
alphabets = 3, uniques = (50..61) 55
alphabets = 4, uniques = (60..74) 68
alphabets = 5, uniques = (72..86) 79
alphabets = 6, uniques = (77..97) 87
alphabets = 7, uniques = (88..105) 97
alphabets = 8, uniques = (91..110) 101
alphabets = 9, uniques = (92..116) 106
alphabets = 10, uniques = (104..122) 113
alphabets = 11, uniques = (107..127) 117
alphabets = 12, uniques = (113..136) 122
alphabets = 13, uniques = (113..134) 123
alphabets = 14, uniques = (115..138) 129
alphabets = 15, uniques = (123..146) 132
alphabets = 16, uniques = (120..147) 133
alphabets = 17, uniques = (128..146) 136
alphabets = 18, uniques = (126..151) 137
alphabets = 19, uniques = (128..150) 139
alphabets = 20, uniques = (132..153) 143
alphabets = 21, uniques = (133..159) 144
alphabets = 22, uniques = (131..155) 145
alphabets = 23, uniques = (137..154) 145
alphabets = 24, uniques = (137..157) 147
alphabets = 25, uniques = (139..160) 149
alphabets = 26, uniques = (141..158) 149
alphabets = 27, uniques = (143..163) 152
alphabets = 28, uniques = (143..164) 152
alphabets = 29, uniques = (139..164) 153
alphabets = 30, uniques = (145..164) 154
alphabets = 31, uniques = (143..164) 153
alphabets = 32, uniques = (146..167) 156

That is to say, even though S5 looks as though it is strictly cycling through 16 ciphers, this isn’t consistent with the stats of the Scorpion’s other plaintext (because that is so verbose and repetitive that it would require on average 32 alphabets to typically yield 155 symbols).

What I think this is implying is either (a) that the Scorpion’s plaintext is significantly less repetitive than the text of his/her messages, or (b) that the cipher system the Scorpion used also employs an extra layer of compression (e.g. a nomenclatura, using extra tokens for common words such as [THE] and [AND], or even common syllable pairs).

I don’t know… I’ll have to have a further think about this, it isn’t at all obvious what’s going on here.


Update: having scratched my head about this for a few more hours, I don’t feel comfortable with the suggestion that some kind of nomenclatura is involved. Rather, what I suspect now is that what we’re looking at here is not a 16 x 26-token set of ciphers (i.e. A-Z) but a 16 x 36-token set of ciphers (i.e. A-Z plus 0-9), coupled with a slightly less verbose plaintext. Hence my very rough (and admittedly as yet unmodelled) estimate is that roughly 25-35 of the tokens in the plaintext will turn out to be digits.

Unfortunately, I also think that this may have left the text undecryptable, unless there is some additional kind of meta-consistency between shapes across the 16 alphabets (e.g. if all the circle-plus-upright-cross shapes encode the same underlying plaintext token). Oh well!