As is well known, Alexander d’Agapeyeff’s 1939 challenge cipher looks like this:-
75628 28591 62916 48164 91748 58464 74748 28483 81638 18174
74826 26475 83828 49175 74658 37575 75936 36565 81638 17585
75756 46282 92857 46382 75748 38165 81848 56485 64858 56382
72628 36281 81728 16463 75828 16483 63828 58163 63630 47481
91918 46385 84656 48565 62946 26285 91859 17491 72756 46575
71658 36264 74818 28462 82649 18193 65626 48484 91838 57491
81657 27483 83858 28364 62726 26562 83759 27263 82827 27283
82858 47582 81837 28462 82837 58164 75748 58162 92000
Almost all cryptanalyses of this ciphertext start from the reasonable observation that (a) this is dominated by number-pairs of the form [6/7/8/9/0][1.2.3.4.5], and that (b) these pairs have a very strongly language-like distribution:-
** .1 .2 .3 .4 .5
6. _0 17 12 16 11
7. _1 _9 _0 14 17
8. 20 17 15 11 17
9. 12 _3 _2 _1 _0
0. _0 _0 _0 _1 _0
To simplify discussion (and ignoring the issue of fractionation for the moment), we can assign these structured number pairs to letters in an obvious sort of order, e.g.:-
** .1 .2 .3 .4 .5
6. _A _B _C _D _E
7. _F _G _H _I _J
8. _K _L _M _N _O
9. _P _Q _R _S _T
0. _U _V _W _X _Y
In which case, the rather less verbose version of the same ciphertext would look like this:-
J B L O P B P D K D P I O N D I I L N M
K C K K I I L B D J M L N P J I E M J J
J R C E E K C K J O J J D B L Q O I C L
J I M K E K N O D O D O O C L G B M B K
K G K D C J L K D M C L O K C C C X I K
P P N C O N E D O E B S B B O P O P I P
G J D E J F E M B D I K L N B L D P K R
E B D N N P M O I P K E G I M M O L M D
B G B E B M J Q G C L L G G M L O N J L
K M G N B L M J K D J I O K B Q - - - -
Cryptanalytically, the problem with this as a ciphertext is that, even discounting the ’00’ filler-style characters at the end, it simply has too many doubled (and indeed tripled) letters to be simple English: 11 doublets, plus 2 additional triplets. Hence the chance of any given letter in this text being followed by itself within this text is 15/195 ~= 7.7%, while the chance of any given letter being followed by itself twice more is 2/194 ~= 1.03%.
According to my spreadsheet, if the letters were jumbled randomly, the chances of the same letter appearing twice in a row would be 7.44% (very slightly less than what we see, but still broadly the same), while the chances of the same letter appearing three times in a row would 0.594% (quite a lot less).
It struck me that these statistics might possibly be what we might expect to see for texts formed of every second or every third letter of English. So, I decided to test this notion with some brief tests on Moby Dick:-
Distance – doubles – triples
1 – 3.693% – 0.075%
2 – 4.426% – 0.275%
3 – 5.994% – 0.476%
4 – 6.289% – 0.466%
5 – 6.682% – 0.566%
6 – 6.491% – 0.546%
7 – 6.372% – 0.508%
8 – 6.536% – 0.533%
9 – 6.544% – 0.536%
n – 6.524% – 0.525% (i.e. predicted percentages based purely on frequency counts)
Indeed, what we see is that the probability of a triple letter occurring in the actual text starts very low (0.075%), but rises to close to the raw probability (from pure frequency counts) at a distance of about 5 (i.e. A….B….C….D…. etc).
So, comparing the actual triple letter count in the ciphertext with the ciphertext’s raw frequencies would seem to suggest a transposition step of about 2 is active, whereas comparing the double count in the ciphertext with the ciphertext’s raw frequencies would seem to suggest a transposition step of about 5 is active.
Yes, I know that this looks a bit paradoxical: but it is what is. Still workin’ on it…