I posted up seven homophonic challenge ciphers a few days ago, and now – though it may sound a little counter-intuitive – I’d like to try to help you solve them (bear in mind I don’t know if they can be solved, but the whole point of the challenge is to find out).

Of the seven ciphers, #1 is the longest (and hence probably the easiest). Reformatted for ten columns rather than five (it uses five cycling alphabets ABCDE, ie. “ABCDE ABCDE” over ten columns):

121,213,310,406,516, 108,200,323,416,513,
112,208,308,409,515, 102,216,309,425,509,
114,215,309,417,507, 102,201,323,401,517,
111,200,306,408,500, 113,203,313,407,512,
103,223,313,403,511, 119,213,316,416,511,
102,204,324,418,517, 120,203,324,407,516,
105,209,312,401,504, 117,208,310,408,500,
113,203,301,425,513, 115,201,313,408,515,
115,214,308,406,501, 122,204,322,408,509,
114,209,305,412,504, 117,213,316,402,509,
100,200,310,423,513, 100,214,320,419,509,
114,209,309,419,520, 101,200,320,416,518,
120,211,313,403,509, 103,207,313,421,513,
107,209,305,407,523, 115,224,313,416,508,
102,203,306,416,514, 107,200,310,401,509,
103,212,324,

Repeated Quadgram

Commenter Jarlve (whose interesting work on the Zodiac Killer ciphers some here may already know) noted that there is a repeated quadgram here, i.e. the sequence 408 500 113 203 appears twice.

This is entirely true, and also a very sensible starting point: I’ve highlighted this quadgram in the following diagram, along with all other repeated A-alphabet tokens (i.e. 100..125), and also any tokens they touch more than once (i.e. in the B and E alphabets):

Another thing that’s interesting here is that the 102 token (that appears four times and is coloured purple in the above) appears with four different letters before it as well as four different letters after it. In classical cryptology, that’s normally taken as a strong indicator that this is a vowel: and with the high instance count (4 out of 31, i.e. 12.9%), you might reasonably predict that this is E, A, O, or perhaps I (in order of decreasing likelihood).

[Note that I haven’t looked to check what letter this actually is: having created the challenge ciphers, I’ve just left them to one side, and don’t intend to look again at them.]

Similarly, the 114 token (that appears three times and is coloured green) is always preceded by 509, and is followed by 209 on two of the three instances. (Note that the token two after it is 309 in two of the three instances as well.) Again, in classical cryptology, these kind of structured contacts are normally taken as strong indicators that this token enciphers a consonant: and with the high instance count (3 out of 31, i.e. 9.7%), you might reasonably predict that this enciphers T or possibly N, S, or H.

With these two examples in mind, it strikes me that for any given plaintext language (English in the case of these challenge ciphers) you could easily build up probability tables for repetitions of the two tokens before and the two tokens after any given token: and then use those as a basis to predict (for a given ciphertext length) which plaintext letter they imply the letter is likely to be.

Though this may not sound like very much, because you can do this for all five of the alphabets independently, the results kind of rake across the ciphertext, yielding a grid of probabilistic clues that some clever person might well use as a basis for working towards the plaintext in ways that wouldn’t possible with randomly-chosen homophonic ciphers. Just sayin’. 😉

And The Point Is…

It’s entirely true that for homophonic ciphers where each individual cipher is chosen at random, the difficulty of solving a reasonably short cipher with five homophones per letter would be very high. But knowing (as here) that each column is strictly limited to a given sub-alphabet, my point is that many of the tips and tricks of classical cryptology are also available to us, albeit in slightly different forms from normal.

Yet while it’s encouraging for solvers that there is a repeated quadgram here, I don’t currently believe that cipher #1 will be (quite) solvable with pencil and paper, as if it were a Sudoku extra-extra-hard puzzle (though as always, I’d be more than delighted to be proved wrong).

However, my hunch remains that strictly cycling homophonic ciphers may well prove to be surprisingly solvable using deviousness and computer assistance, and I look forward very much to seeing how they fare. 🙂

While thinking about the Scorpion S1 unsolved cipher in the last few days, it struck me that it seemed to be a special kind of homophonic cipher, one where the homophones are used in rigid groups.

That is: whereas the Zodiac Killer’s Z408 cipher cycled (mostly but not always) between sets of homophones by their appearance, it appears that the Scorpion S5 cipher maker instead rigidly cycled between 16 sets of homophones by column. What’s interesting about both cases is that the use pattern gives solvers extra information beyond that which they would have for a homophonic cipher where each homophone instance was chosen completely at random.

Perhaps there’s already a special name for this: but (for now) what I’m calling them is “constrained homophonic ciphers“, insofar as they are homophonic ciphers but where an additional use pattern constrains the specific way that the homophones are chosen.

The question I immediately wanted to know the answer to was this: can we solve these? And what better way to find this out than by issuing a challenge!

Seven Challenge Ciphers

The seven challenge ciphers are downloadable as a single zip file here, or as seven individual CSV files here:
* #1
* #2
* #3
* #4
* #5
* #6
* #7

How The Ciphers Were Made

Unlike normal challenge ciphers, what I’m giving you here (in line with Kerkhoffs’ Principle) is complete disclosure of the cipher system and even the plaintext language.

The cipher system used here is a homophonic cipher with exactly five possible homophones for each plaintext letter BUT where the homophones are strictly selected according to the column number in which they appear in the ciphertext. Each separate CSV uses its own individual key.

The plaintext language is English: they are straightforward sentences taken from a variety of books, and without any sadistic linguistic tricks (i.e. no “SEPIA AARDVARK” or similar to confuse the issue).

The enciphered files are simple CSV (comma-separated values) text files, arranged in rows of five letters at a time, but encoded as decimal numbers. For example, the first (and the longest) challenge cipher (“test1.csv”) begins as follows:

121,213,310,406,516,
108,200,323,416,513,
112,208,308,409,515,

Here, “121,213,310,406,516,” enciphers plaintext letters #1..#5, “108,200,323,416,513,” enciphers plaintext letters #6..#10, and so forth. The first column is numbered in the range 100..125 (i.e. these belong to the 1st homophonic alphabet), the second column 200..225 (i.e. these belong to the 2nd homophonic alphabet), and so forth.

The start of the message and the end of the message are exactly as you would expect: there is no padding at either end, no embedded key information, just pure ciphertext.

The Rules

Treating this as a massively parallel book search using cloud databases (a) will be treated as cheating, and (b) will spoil it for other people, so please don’t do that. This challenge is purely about finding the limits of cryptanalysis, not about grandstanding with Big Data.

Hence you’ll need to also tell me (broadly) what you did in order to rise to the challenge, so that I can be sure you haven’t solved it through secondary or underhand means.

The Prize

If nobody solves any of the challenge ciphers by the end of 2017, my wallet stays shut.

However, the person (or indeed group) who has the most success decrypting any of these seven challenge ciphers by 31st December 2017 will be the “2017 Cipher Mysteries Cipher Champion“, and will also receive a shockingly generous £10 prize (sent anywhere in the world where PayPal can send money) to spend as they wish.

In the case of multiple entrants solving the same difficulty cipher independently, I’ll award the prize to the first to contact me. In all cases, please leave a comment below.

In all situations, my decision is final, absolute, arbitrary and there is no opportunity for appeal. Just so you know.

PS: any individual (or indeed covert agency) wishing to donate more money to increase the prize fund (i.e. to make a little more cryptanalytic sport of this), please feel free to email me.

Hints and Tips

I suspect that the multiplicity (i.e. the number of different symbols used divided by the length of the ciphertext) will prove to be too high and the ciphertext lengths too short for conventional homophonic decryption programmes, so I expect prospective solvers won’t be able to look to these for any great help.

Similarly, I don’t believe that numerical brute force and/or parallel processing will be sufficient here: all the same, these challenges (if solvable) will probably prove to be things that anyone anywhere can tackle (e.g. through hill-climbing and cleverly exploiting the constraints), not just the NSA, GCHQ or similar with their supercomputers.

For what it’s worth, my best guess right now is that #1 (the longest of the seven ciphertexts) will prove to be solvable… though only just. Even so, I’d be delighted to be proved wrong for any of the others.

Incidentally, I chose the length of the very shortest challenge cipher to broadly match the length of the Scorpion S1 cipher: so even in the (perhaps unlikely) case where all seven of my challenge ciphers get solved, there’ll still be an eighth challenge to direct your clever efforts at. 😉

I’ve blogged a few times about trying to crack the Scorpion Ciphers (a series of apparently homophonic ciphers sent to American crime TV host John Walsh). Most of my effort has been spent on the Scorpion S5 cipher, which (despite having 12 columns) appears to be rigidly cycling between 16 cipher alphabets.

However, it struck me a few days ago that this might also give us a way in to the Scorpion S1 cipher. This is because all the repeats there seem to be at a column distance of 0, 1, 4, 5, and 6, with the overwhelming majority of repeats at column distances 0 and 5. (The only exception is the “backwards L” glyph, which appears in two pairs, one pair at column distance 0 apart other, and the other at column distance 5 apart)

The Slippy S1 Five-Alphabet Hypothesis

Putting the 16-alphabet-cycle from S5 together with the mostly-0-or-5-column-distance observation from S1 yields my “Slippy S1 Five-Alphabet Hypothesis”: that Scorpion S1 was constructed from a cycle of 5 cipher alphabets, where the encipherer always reset to alphabet #1 at the beginning of a line, and usually (but not always) stepped to the next alphabet along with each new column.

So whereas a rigid 5-alphabet cycle (i.e. with no slips) would have a fixed alphabet “ownership” of 1234512345 for each ten-glyph line, I suspect that we can make a “slippy” guess for S1’s cycle ownership, to try to reconstruct where the encipherer slipped from one cycle into the next. My best current set of guesses for S1 is therefore:

1234512235
1234512344
1234412345
1234512345
1234112345
1234551245
2234512345

(Note that I suspect that the “backwards L” shape appears on two alphabets, i.e. once in alphabet #2 and once in alphabet #4, but that this is the only exception to the rule.)

What this means is that each of the five alphabets has only 26 glyphs in them (one for each letter of the alphabet): and so we can tell that if two shapes are numbered as being in the same alphabet, they are very probably two different letters.

Can We Solve This?

53 of S1’s 10 x 7 = 70 glyphs are unique, yielding a high multiplicity of 75.7%. By way of comparison, it would seem that normal (unstructured) homophonic ciphers are only solvable when their multiplicity is around the 20%-25% mark.

However, the question here is whether being able to group the letters into five unique alphabets (even probabilistically) reduces the number of combinations enough to make this genuinely solvable. As normal, pencil-and-paper solvers can make some pretty good guesses, e.g. the “S Λ” pair on lines #3 and #6 probably enciphers “TH”, while any repeated letter stands a good chance of being a normal high-frequency letter such as ETAOINS etc: but computers would do this much better.

My instinct is that this should be a good candidate for hill-climbing: and that the one-glyph-per-letter-per-alphabet constraint will prove reasonably effective. But effective enough? We’ll have to wait and see…

Incidentally, a good sanity check for this Scorpion S1 hypothesis would be to run some “forward simulations” (which is the kind of thing Dave Oranchak has done so much of with the Zodiac Killer Ciphers). By which I mean: if we feed a variety of 70-letter English texts into my best guess set of slippy cycles (i.e. “ITWASTHEBESTOFTIMESI” fed into 1234512235 / 1234512344 would become: “I1 T2 W3 A4 S5 T1 H2 E2 B3 E5 S1 T2 O3 F4 T5 I1 M2 E3 S4 I4”), I predict that the final average multiplicity of the texts will be close to 75%. But I might be wrong!

I mentioned in a previous post that I thought that the Scorpion S5 cipher’s numerous shape families might offer a backdoor into its cipher system, if they just happened to be elegantly arranged on downward diagonals. I pointed out that if this were correct, the “dice” shape family that appears in columns 1, 3, 4, 8 (twice), 9, 12, 14, 15 would be most likely to have been arranged such that A was 1, C was 3, D was 4 (and so forth).

However, I didn’t actually get so far as calculating the precise probabilities in that post: but now I have (I think).

In my Scorpion spreadsheet, the total probability that a specific family was enciphered as a specific sequential set of letters is calculated as the product of each individual letter’s likelihood. By ‘likelihood’ here, I mean not the probability of that letter occurring randomly (i.e. P, its raw instance probability), but the chances of that occurring exactly N times within a column of letters of height H. And in Excel, you calculate this function using the in-built function ‘BINOMDIST(N, H, P, false)‘. (Note that instead using ‘BINOMDIST(N, H, P, true)’ would calculate the cumulative likelihood of that happening, i.e. the chances of that probability P event happening 0 times up to N times out of a maximum of H times.)

For the raw instance probability values, I used the Scorpion encipherer’s plaintext as a reasonable approximation of the text we are likely to find encrypted inside the S5 cipher. I think there’s a pretty good chance that it will be good enough.

As for the height H: once you have rearranged the message according to the 16 apparent columns of the ciphertext, columns 1 to 4 contain 12 instances each, while for columns 5 to 16, each on contains 11 instances. All of which means that the binomial probability table for N out of 11 looks like this:

binomial-probabilities-11

For example, even though the raw instance probability for ‘E’ is 11.35%, the chance that a given 11-high column of letters will contain exactly one ‘E’ is 37.4271% (or so my spreadsheet says, anyway).

But rather than limit the calculation only to length-16 families, I added a trick whereby shorter families can be checked against other diagonals in the cipher table. If you use the number 99 as the count for an individual family’s column, the spreadsheet works around it in the calculation, by allowing the shifted alphabet to start not at ‘A’ but at ‘z’ (i.e. ‘A – 1’).

I’ve included 11 shape families from the S5 cipher: if you copy a row from any one of these across to row #33, the spreadsheet will calculate a composite ranking value for each of 28 different offsets in column U (the ‘Result’ column). This is equal to the final probability times a million (or else the numbers would be too small to be practical).

For example, the relative rankings for the dice family are:-

2.737265 A
0.013655 B
0.000000 C
0.000046 D
0.293415 E
0.018483 F
0.093272 G
0.000451 H
0.000078 I
0.000000 J
0.009360 K
0.074230 L

Here, the ranking for ‘A’ (2.7372765) is nearly 10x the ranking for second placed ‘E’ (0.293415), which is essentially what my initial imprecise guess was (thank goodness). 🙂

It’ll take a while to figure out what this all means, but I thought I’d post the basic spreadsheet sooner rather than later. 🙂

I’ve been thinking a little more about how to go about cracking Scorpion Cipher S5.

I mentioned before that I thought that the encipherer might well have started from an elegant-looking 26×16 grid filled with diagonally-downward families of shapes, and that this arrangement might offer codebreakers some additional kind of “spatial logic” to support their efforts that traditional ciphers don’t usually provide.

From the letters that accompanied the ciphertexts, my inference is that the Scorpion is like a smart 12-year-old who has just ‘got’ the elegance of maths: but this leads me to a secondary inference that he/she probably didn’t understand modulo addition, because if he/she did, then we would surely have seen more 16-element shape families in the text.

I’ll explain with the help of a diagram of the kind of 26×16 grid I’m talking about:

scorpion-cipher-26x16-grid

If the encipherer had laid out his/her grid with modulo-26 maths in mind, then 16-element families that start in the orange (top right) area and step diagonally down and to the right (as I predict) should wrap around (modulo 26) to the yellow (bottom left) area. However, I believe that we don’t see nearly enough length-16 shape families to support that grid-filling model.

What I think actually happened was that the encipherer only started length-16 families in the A-K range for alphabet #1, which would have ended on P-Z for alphabet #16. This means, for example, that because the ‘dice’ family (actually, the ‘dots in a square’ family, to be precise) has members in alphabets 1, 3, 4, 8, 9, 12, 14, and 15, we may well be able to directly infer that its very first member (in alphabet #1) is A-L.

Moreover, given that the lowest frequency letters in the encipherer’s accompanying letters are…

k : 0.4%
x : 0.3%
j : 0.1%
z : 0.0%
q : 0.0%

…we may also be able to make a reasonable guess as to which possibilities of A-L are the least likely. For example, because the dice family appears in columns 1/3/4/8/9/12/14/15 (of the 16-column sequence I discussed before), this would map to:

+0 : ACDHILNO --- OK
+1 : BDEIJMOP --- has J, so fairly unlikely
+2 : CEFJKNPQ --- has J, K and Q, so not likely at all
+3 : DFGKLOQR --- has K and Q, so not likely
+4 : EGHLMPRS --- OK
+5 : FHIMNQST --- has Q, so not likely
+6 : GIJNORTU --- has J, so fairly unlikely
+7 : HJKOPSUV --- has J and K, so not likely
+8 : IKLPQTVW --- has K and Q, so not likely
+9 : JLMQRUWX --- has J, Q and X, so not likely at all
+10: KMNRSVXY --- has K and X, so not likely
+11: LNOSTWYZ --- has Z, so not likely

So in fact, I suspect that we already know enough to guess that the dice family members encipher either ACDHILNO or EGHLMPRS (in sequence), which I think isn’t a bad starting point at all.

Finally, I suspect there’s something of a cryptological paradox in play here: the more alphabets are involved, the more spatial structure we have to work with. Hence S5’s 16 alphabets might well make it surprisingly crackable. 🙂

As I reported in a post last year (2014), even though the fifth “Scorpion Cipher” (i.e. ‘S5’) sent to John Walsh is arranged using a 12-column layout, it has a very strong internal 16-column structure. What this means is that every single shape repeat spans a distance that is a multiple of 16: which in turn suggests that the encipherer formed the S5 ciphertext by rigidly cycling through a set of 16 simple substitution cipher alphabets.

If you therefore rearrange S5’s shapes into a 16-column layout and colourize their repeats, you get something like the following (click on it to see a higher resolution version):

S5-rearranged-colourized

Now, 155 out of S5’s 180 characters are unique, giving it a ‘multiplicity’ (155/180) of 86%, which is way too high to be cracked using a conventional homophonic cipher solver. For comparison, the three Beale Ciphers have multiplicities of 57%, 24%, and 43% respectively, while the (solved) Zodiac Z408’s multiplicity is a paltry 13%. In fact, the upper limit on solvability for homophonic ciphertexts seems to be multiplicities of around 20%-25% if you’re lucky (or 10%-15% if you’re not), so S5 would at first sight seem to be waaaaaay out of anybody’s practical range.

But I’m not so sure.

Going through what has been released of the encipherer’s letters that the ciphertexts accompanies, he/she starts by saying:-

This code took a lot of time and effort to develop, in hopes that it will defeat FBI and CIA codebreakers.

Which is ‘kind of reasonable’, though the whole enciphering activity would seem to be somewhat pointless unless the person’s overall aim was to somehow emulate the original Zodiac Killer’s ciphers. In a later letter, the encipherer’s position gets finessed somewhat:

I now realise with many hundeds of hours of […] mindracking experimentation with my complex ciphers that my first one that I sent you [S1] was comparatively simple to my second [S2], third [S3], fourth [S4], and now temporarily final cryptograph system [S5]. I have been encoding useful information for your use and have done it fairly, since all of my ciphers can be decoded simply, once the limited patterns and systems are discovered.

What we learn from this, I think, is that what we are looking at here is not the product of a psychopathic academic cryptographer, but is rather a homebrewed cipher system, based around “limited patterns and systems”. So, a bright kid; probably good at maths; and has perhaps read enough popular cryptography (through and beyond the newspaper accounts of the Zodiac Killer’s ciphers) to avoid clunkingly obvious mistakes.

But the mentions of “patterns” makes me suspect that there’s also a little bit of the vanity of the pure mathematician there, intellectual pride that all it would take to “defeat FBI and CIA codebreakers” was “limited patterns and systems”. Hence I think we are likely to be looking at something that is innately very ordered, something that we’ll all kick ourself for not seeing when it is shown to us in the fullness of time. “What a clever person the Scorpion Cipher maker was“, we’re all supposed to say (according to that fantasy script), “much better at making ciphers than the Zodiac Killer ever was“.

In the case of S5, though, I suspect we now know just about enough to break it, even with its dauntingly high multiplicity.

My first observation is that even though it uses a large number of different shapes, these are drawn from a very much smaller set of shape families: and there may well be some kind of cryptographic relationship between the members of each family to help us:-

S5-shape-families

My second observation is that, with the exception of columns 10 and 11 (which may well be random, or possibly ‘S’ vs ‘T’ in the plaintext), the most frequent symbol in any column is always from a different family from the most frequent shape in any other column. It’s not the strongest of observations, sure, but it’s what leads me to my (grandly titled) S5 Construction Hypothesis.

My S5 Construction Hypothesis

I believe that the encipherer very probably constructed 16 cipher alphabets on gridded paper, within a 26 x 16 or perhaps a 16 x 26 grid. But this is a boring activity, and the encipherer’s text suggests a kind of proto-mathematical desire for elegance, like a smart 12-year-old who has just ‘got’ the whole idea of mathematics. So I hypothesize that the encipherer filled this rectangular grid with families of shapes along downward diagonals, from top-left to bottom-right.

Hence for the sixteen component alphabets, any genuine (as opposed to accidental) family of shapes would step through the alphabets. Here, a family that had a member enciphering A in alphabet #1 would also have a member enciphering B in alphabet #2, and maybe a member enciphering C in alphabet #3 etc.

This suggests a quite different kind of cryptologic solving logic from normal, one that not only offers us mathematical means to reduce the multiplicity (because we can posit connections between letters in diffent columns, giving us fewer degrees of freedom to steer our way through), but also spatial means to do the same thing.

What I mean by ‘spatial’ here is that if we look at, say, the family of shapes formed of squares with dots in, I think we might be able to assume that not only are these all part of the same family, but also all the missing shapes on columns without a similar family member can be excluded from the search.

That is, if alphabet #1 uses a square with dots in to encipher ‘A’ and alphabet #3 uses a different square with dots in to encipher ‘C’, then we can very probably infer that alphabet #2 uses a square with dots in to encipher ‘B’, even though we cannot actually see it in the ciphertext. Hence this kind of ‘holistic exclusion’ offers a spatial way to help us reduce the search space.

Of course, turning this visuo-spatial hypothesis into an effective computer algorithm will doubtless prove quite tricky. But perhaps it offers a way of making S5’s cryptologic challenge more tractable than it would be if were a pure homophonic cipher with such a scarily high multiplicity.

Following my recent Scorpion Ciphers post, I’ve put up a permanent reference page on the Scorpion Ciphers and have also tried to contact John Walsh about the as-yet-unreleased other ciphers… so we’ll see how that goes.

Since then, I’ve been working a little more with S5, which has 155 unique symbols out of 180 letters. Because repeated symbols in S5 are always multiples of 16 letters apart, it seems likely to me that this ciphertext was constructed from 16 independent alphabets cycled through in strict sequence. My hope was that this regularity might give us a better chance of cracking S5 than if it were a randomly chosen homophonic cipher.

All the same, this was just a guess: so the first thing I did was come up with a way to test this hypothesis, by writing a short C program to encipher 180-long subsections of the Scorpion’s own plaintext using various numbers of sequential alphabets, to see if this would produce roughly 155 unique symbols.

For each number of alphabets (e.g. 2), I tried (notionally) enciphering every 180-long stretch of the Scorpion’s text, and kept a tally of the minimum number of symbols required (e.g. 37), the maximum number of symbols required (e.g. 44), and the average number of symbols required (e.g. 40).

Interestingly, the results weren’t what I expected:-

alphabets = 1, uniques = (19..24) 21
alphabets = 2, uniques = (37..44) 40
alphabets = 3, uniques = (50..61) 55
alphabets = 4, uniques = (60..74) 68
alphabets = 5, uniques = (72..86) 79
alphabets = 6, uniques = (77..97) 87
alphabets = 7, uniques = (88..105) 97
alphabets = 8, uniques = (91..110) 101
alphabets = 9, uniques = (92..116) 106
alphabets = 10, uniques = (104..122) 113
alphabets = 11, uniques = (107..127) 117
alphabets = 12, uniques = (113..136) 122
alphabets = 13, uniques = (113..134) 123
alphabets = 14, uniques = (115..138) 129
alphabets = 15, uniques = (123..146) 132
alphabets = 16, uniques = (120..147) 133
alphabets = 17, uniques = (128..146) 136
alphabets = 18, uniques = (126..151) 137
alphabets = 19, uniques = (128..150) 139
alphabets = 20, uniques = (132..153) 143
alphabets = 21, uniques = (133..159) 144
alphabets = 22, uniques = (131..155) 145
alphabets = 23, uniques = (137..154) 145
alphabets = 24, uniques = (137..157) 147
alphabets = 25, uniques = (139..160) 149
alphabets = 26, uniques = (141..158) 149
alphabets = 27, uniques = (143..163) 152
alphabets = 28, uniques = (143..164) 152
alphabets = 29, uniques = (139..164) 153
alphabets = 30, uniques = (145..164) 154
alphabets = 31, uniques = (143..164) 153
alphabets = 32, uniques = (146..167) 156

That is to say, even though S5 looks as though it is strictly cycling through 16 ciphers, this isn’t consistent with the stats of the Scorpion’s other plaintext (because that is so verbose and repetitive that it would require on average 32 alphabets to typically yield 155 symbols).

What I think this is implying is either (a) that the Scorpion’s plaintext is significantly less repetitive than the text of his/her messages, or (b) that the cipher system the Scorpion used also employs an extra layer of compression (e.g. a nomenclatura, using extra tokens for common words such as [THE] and [AND], or even common syllable pairs).

I don’t know… I’ll have to have a further think about this, it isn’t at all obvious what’s going on here.


Update: having scratched my head about this for a few more hours, I don’t feel comfortable with the suggestion that some kind of nomenclatura is involved. Rather, what I suspect now is that what we’re looking at here is not a 16 x 26-token set of ciphers (i.e. A-Z) but a 16 x 36-token set of ciphers (i.e. A-Z plus 0-9), coupled with a slightly less verbose plaintext. Hence my very rough (and admittedly as yet unmodelled) estimate is that roughly 25-35 of the tokens in the plaintext will turn out to be digits.

Unfortunately, I also think that this may have left the text undecryptable, unless there is some additional kind of meta-consistency between shapes across the 16 alphabets (e.g. if all the circle-plus-upright-cross shapes encode the same underlying plaintext token). Oh well!

Back in 2007, John Walsh (the host of “America’s Most Wanted”) announced that he had, since 1991, received a string of disturbing-sounding letters from an individual calling himself / herself “The Scorpion”: many of them had sections or pages that were apparently in cipher. Two of these ciphers were released to the public: these became known as “S1” and “S5”.

In the same year, Christopher Farmer (“President of OPORD Analytical”) announced that he had cracked S1 (which was apparently built on a 10×7 grid):-

scorpion3

Farmer’s claimed solution reads like this:-

baelprovid
edthemwith
newstories
butwhatifi
askjwdoiwa
xrtwbonesa
gezjefxkon

Unfortunately, all the diagrams illustrating Farmer’s ingenious reasoning have withered on the Internetty vine in the years since then (they’re not even in the Wayback Machine, nor anywhere else as far as I can see), which is a bit of a shame.

Even so, this turns out to be an entirely surmountable problem: Farmer’s claimed solution is clearly incorrect, for the simple reason that letters in the ciphertext aren’t consistent in the plaintext. For example, the cipher “K” maps to both ‘a’ and ‘g’, the “backwards-L” maps to ‘w’, ‘w’, and ‘x’, the “backwards-F” maps to both ‘u’ and ‘v’, and so on. At the same time, his claimed plaintext doesn’t really make a lot of sense (“BAEL”… really? I’m not so sure).

It seems likely to me that Farmer guessed that “PROVID” was steganographically hidden in plain sight at the end of the topmost line (and if you squint a bit, you can see why that would be), and then built the rest of his decryption attempt around this hopeful starting point. Moreover, he seems to have guessed that “O” maps to ‘o’, and “backwards-E” maps to ‘e’, which are both pretty peachy assignments. But I don’t buy any of this for a minute: there are way too many degrees of freedom in this S1 cryptogram (roughly half of the individual cipher shapes occur exactly once), and quite a few extra ones in his claimed solution too.

It’s a brave attempt, for sure: but it’s still wrong, whichever way you turn it round.

Other people have tried their hand with S1, though both AlanBenjy in 2009 and Glurk on Dave Oranchak’s site in 2010 pessimistically pointed out that 53 of S1’s 70 symbols are unique, yielding a ‘multiplicity’ a fair way beyond the range of what homophonic cryptograms can practically be solved. Hence I would tend to agree with their assessment that there’s no obvious way that we will solve S1 with what we currently have to hand: in fact, there seems no way to tell whether S1 is a real cipher or a hoax – the only repeating cipher pair is “S A” (i.e. “S Λ”), which could well have happened by pure chance.

The only other Scorpion ciphertext released to the public to date is the 180-character cryptogram known as “S5”:-

scorpion4

Once again, 155 of these 180 symbols are unique, which at first glance would seem to make S5 even less likely to be solved than S1.

But wait! In May 2007, user “Teddy” on the OPORD Analytical forum pointed out that if you transpose S5 from a 12-column arrangement to a 16-column layout, shape repeats only ever occur within a single vertical column. In fact, every single 16-way column except one (column #5) includes one or more repeated shapes.

Radically, this suggests to me that S5 was constructed in a completely different way from conventional homophonic ciphers: specifically, I think that each 16-way column of S5 may well have its own unique cipher alphabet. This would mean that S5 would need to be solved in a completely different manner to the way, say, zkdecrypto works. (I don’t believe S5 was constructed with eight columns, but I thought I ought to mention that that’s a possibility as well, however borderline). Maybe that small insight will be enough to help someone make some headway with S5, who can tell?

The huge shame here is that it may well be that the other Scorpion ciphers (which to this day have not been released) might well give us additional clues about the inner workings of both S1 and S5. Specifically, if one of the other ciphers happened to have used precisely the same 16-alphabet systemas S5, it might well give us enough raw data to crack them both.

Has anyone apart from John Walsh ever seen S2, S3, S4, and S6? Just askin’, just askin’…


Update: Looking again at S1 (while bearing in mind the way S5 seems to have been constructed), I find it hard not to notice that the distances between instance repetitions seem strongly clustered around multiples of 5 (with the only instance not fitting the pattern being the “backwards-L” on row #5):-

+60, +20, +50, +36, +24, +20, +40, +20, +40, +25, +35, +10, +25, +6, +45, +9, +6.

I suspect that this means that the encipherer probably enciphered S1 by cycling through five independent cipher alphabets (largely speaking). This wasn’t a mechanically precise encipherment (whether by accident or by design), but something close enough to one such that almost all the time he/she was no more than a single alphabet ‘off’, one way or the other.

This offers a quite different kind of constraint from normal homophonic cipher searches, and possibly even enough to crack the S1 cipher. After all, we have a fair amount of the Scorpion’s meandering plaintext to use as a statistical model to aim for… 🙂