I’ve been trying to break the d’Agapeyeff challenge cipher this week, a process that I (along with several other cipher commentators, although opinions differ etc etc) strongly suspect will involve solving a 14×14 transposition cipher and a substitution cipher simultaneously.
A plausible-sounding way to try to do this would be to model the distribution of digraph frequency counts in English texts, and then for a given transposition compare an ordered table of its digraph frequency counts against that model. However, when I tried this with some test text (taken from d’Agapeyeff’s book), the English digraph frequency values given on the Internet weren’t even close.
I initially looked at getting a corpus of British English text to generate a proper digraph frequency table: but that proved to be difficult and expensive, with the bother of licenses and licence fees to deal with. But then I thought… why not use d’Agapeyeff’s book “Codes and Ciphers” itself as the corpus? Sure, it’s on a much smaller scale, but it would surely be more statistically representative of the cryptogram’s plaintext than the complete works of Shakespeare (which are often included in English corpora, presumably on the principle of what-the-heck-let’s-throw-it-all-in-can-it-really-hurt?).
Even though the book’s text looked nice and clean to my eye, OCR’ing it turned out to be completely unsatisfactory: and so I was delighted to find a page put up by regular Cipher Mysteries commenter Menno Knul containing a lot of text from “Codes and Ciphers” (thanks Menno!). After a bit of tweaking (fixing some typos, removing foreign language quotes, removing confusing cipher / code passages, etc), I then ended up with a reasonably workable d’Agapeyeff mini-corpus to plug into a trivial C digraph-counting programme.
So, here are d’Agapeyeff’s top 50 digraphs from the text of “Codes and Ciphers” (but with spaces, punctuation, spaces and numbers removed), together with their frequency percentages in descending order. I’ll be using this table before very long to try to break his cipher, fingers crossed they’ll do the trick!
TH,3.23744%
HE,2.80072%
IN,2.03171%
ER,1.89246%
AN,1.50321%
ES,1.41460%
RE,1.39245%
ON,1.21523%
NT,1.20257%
ED,1.20257%
ST,1.19624%
EN,1.14561%
SE,1.10447%
EA,1.08548%
TE,1.05383%
TI,1.04750%
ET,1.02851%
ND,1.01269%
IS,0.99370%
OF,0.98104%
TO,0.95889%
OR,0.94940%
AT,0.92725%
AS,0.92725%
IT,0.87978%
HI,0.83547%
LE,0.82598%
NG,0.81648%
AL,0.81648%
HA,0.80699%
AR,0.80699%
SA,0.73104%
SI,0.71838%
VE,0.70255%
RI,0.69623%
CO,0.69306%
SO,0.68673%
ME,0.67724%
EC,0.67407%
DE,0.66774%
RA,0.60129%
RS,0.59496%
RO,0.59179%
DI,0.59179%
TT,0.58546%
OU,0.58546%
TA,0.58230%
BE,0.57597%
US,0.54432%
IC,0.52850%
Nice work thus far nick. I often build my own distributions using Project Gutenburg. I know some out there will build tables using bible frequencies or something non standard like Huck Finn.
Nick, Jim,
For your convenience here follows an alphabetical list of the digraphs:
ab ad ah al an ar as at
be ge he me we
co
de di
do go lo no of on or so to wo
ea ec ed em en er es et
ha he hi
ic if in is it
le
me
nd ng nt
of on or ou
ra re ri ro rs
sa se si so st
ta te th ti to tt
up us ut
ve
by my fy ye
The list includes the ones by d’Agapeyeff (33), Wikipedia (28) and Nick Pelling (50).