Voynich cipher structure... - Cipher Mysteries

What (in my opinion) has scuppered just about every code-breaking assault on the VMs to date has been the faulty presumption that you don’t need to worry about anything so boring as “getting the history right first”, because the statistical analysis fairy will fly in waving her +10 Wand of ANOVA and somehow sort it all out. Errrm… won’t she?

Yet even as early as 1931 John Matthews Manly pointed out that the Voynich Manuscript’s quire numbers were clearly written in a 15th century hand – hence it couldn’t easily post-date 1500. So why do people still persist in using their stats tricks to hunt for complex mechanisms (such as Vigenère polyalphabetic ciphers, and even the dreaded Cardan grille – though personally, I’d rather have a nice mixed grille) that weren’t invented till nearly a century later? Whatever cryptographic arrangement you propose yielded the VMs’ distinctively structured behaviour should be consistent with the limited palette of techniques available to a cipher-maker circa 1450-1500.

In short, history says (whether you choose to listen to it or not) that we shouldn’t hypothesize some single high-powered cipher trick – rather, we need to look for a set of simple techniques consistent with Quattrocento knowledge of cryptography blended together in a devious (and probably subtly deceptive) way.

Even so, 50 years of statistical hackery later, it seems pretty clear that the cryptographic stats fairy hasn’t worked her spell as hoped. But why should her (usually pretty good) magic have failed so abysmally?

Well… if you delve a little deeper into theoretical statistica, you find the theorists’ nimble “get-out clause” – that the presence of so-called confounding factors can disrupt all the neat assumptions underlying stats’ mathematical models. These factors, in a nutshell, are ways in which individual samples within your overall dataset are causally linked together locally in non-obvious ways that make the global numerical tests go wrong… really quite badly wrong, in fact.

Unsurprisingly, I believe that the presence of confounding factors is precisely the reason that the standard statistical toolbox continues to draw so much of a blank with the VMs – that is, the individual letters / glyphs are silently linked together in ways that act as local confounding factors to any global analytical pass. But if so, what exactly are those pesky linkages?

My conclusion is that the bulk of the, errrm, ‘confundity‘ arises because the “vowel” glyphs (a / e / i / o / y) are like many Budapest workers – they only get to pay their rent by holding down two or more different jobs at the same time (though normally only one job at any given moment). For example, I don’t believe that the “o” in “qo” has anything at all to do with the “o”s in “ot”, “or” or “ol” (you can tell this because “qol” only pops up in those later sections where you find free-standing “l” glyphs). Similarly, I don’t think the “a” in “aiin” is in any way linked with the “a” in “al”, or even with the “a” in “aiir” (because their usage and context patterns are so different).

Yet once you fully accept that this is the case, you’re more or less forced to follow a long path of uncomfortable reasoning that leads you right to the doorstep of an even more uncomfortable conclusion: that the cover cipher is largely formed of groups of letters. Which is to say, that a / e / i / o / y have no independent meaning except as part of the letters with which they are immediately paired or grouped, and that the Voynichese CVCV structure (to which linguists have clung for so long through a storm of criticism) is not definitive evidence of the presence of language, but is instead an artefact of the cipher system’s covertext – merely its misleading outer appearance, not its inner structure at all.

Furthermore, the VMs’ cunning encipherer occasionally adds in a “space-transposition cipher” after his “pairification” stage to prevent repetition of the pairs becoming just that little bit too obvious for his liking: for example, on page f15v, the first line’s “ororor” has been turned into “oror or“, while the second line’s “orororor” has been turned into “or or oro r“. To my eyes, the underlying sequential repetitions look very much as if the plaintext contains the oh-so-familiar “MMM”, “CCC”, “XXX” or “III” of Roman numbers, even circa 1450 a pattern so visually obvious to codebreakers that the verbose cipher system needed to be hacked yet further to conceal it.

Hmmm… four repeated Roman number letters, and four pairs in the or / ol / ar / al set apparently to hide repetitions – a coincidence? Or might it be that or / ol / ar / al in some way encipher m / c / x / i (though probably not in that order)? Well… I would be hugely unsurprised if this miniature four-pair cipher system turns out to have been designed specifically for this purpose in an earlier cipher, and that the mechanism was reused and adapted for the later (and far more complex) Voynichese cipher system that evolved out of it – that is, it remained in the encipherer’s personal cipher alphabet rather like a “cryptographic fossil”. (As a general aside, the idea that Voynichese popped into existence fully formed in the shape we now see it makes no engineering sense at all to me – something as tricksy and structurally complex must necessarily have gone through a fair number of stages of evolution en route.)

Incidentally: with spaces included, you see olol (51), arar (47), alal (37) and oror (23) – but take spaces out, and you get olol (186), arar (161), oror (87), alal (60), ololol (13), ararar (7), ororor (1), alalal (1) and the orororor (1) mentioned above. At the same time, remove all the spaces and you see just four instances of okok and two of otot in the entire VMs, i.e. ol is followed by ol more than 50x more often than ok is followed by ok.

To be sure, this does not explain all the behavioural properties of Voynichese: ultimately, repetitive phrases like “qokedy qokedy dal qokedy qokedy” (and “ur so qokedy daiin, m8 lol“?) are dancing to a quite different underlying beat. Whatever kind of token gallows characters turn out to encipher, I think it is extremely unlikely that they are a single letter substitution – some other kind of enciphering mechanism is going on there (I suspect some kind of in-page transposition cipher). Similarly, I’m pretty sure that “qo”, “d”, and “y” encipher tokens of a quite different nature again (here, I suspect that they match the three basic abbreviatory shorthand marks that were in active use in the Quattrocento, i.e. “qo” => subscriptio, “d” => superscriptio [macron], and “y” => truncatio).

Arrange all these smaller systems together, and I think that what is revealed is a multi-stage cipher architecture looking broadly like this:-

vmsciphersystem-v001

Now, I’ve had flak from a number of people over the years (ummm… mainly Elmar, truth be told, bless ‘im) who comment that this kind of arrangment seems far too complex to be right. The problem I have is how to convince such naysayers that what I present here is the end result of a basically sensible methodology – that of decomposing the heavily-structured nature of Voynichese into its constituent pieces, each of which is consistent with the (relatively low) level of cryptographic technique available before 1500. That is, the real cleverness we are up against here isn’t one of fiendish mathematical complexity, but of cunning arrangement of simple pieces – not of Albertian innovative technique, but of Filaretian innovative architecture.

All the same, even the best-designed house needs walls and a roof – and here, the main cipher properties arise from the use of verbose cipher to make a language-like covertext. Get around that, and I the rest of the structure’s secrets may very well yield to the repetitive numerical prodding that is statistical analysis. Here’s how to do it:-

(1) write a simple text filter to undo the space transposition cipher:-
–> (a) get rid of any spaces between [o | a] and [l | r] – i.e. transform “oro ror” into “ororor”
–> (b) get rid of any spaces between [or | ol | ar | al] and [or | ol | ar | al] – i.e. transform “or al or” into “oralor”

(2) tokenize the text into the following units, making sure that you tokenize “qo” before anything else (i.e. “qok” must tokenize to “qo” + “k”, and not to “q” + “ok”). Of course, I may have got the precise details very slightly wrong, but I’m pretty sure that this is not less than 90% of the way there.

qo
eeee, eee, ee, ch, sh
ok, ot, of, op, yk, yt, yf, yp
ockh, octh, ocfh, ocph, yckh, ycth, ycfh, ycph
ol, or, al, ar, an, am
air, aiir, aiiir, aiiiir
ain, aiin, aiiin, aiiiin
aim, aiim, aiiim, aiiiim
ckh, cth, cfh, cph
eo, od  <--- these two are the wobbly ones that may need further thought!

Note that a number of additional single-letter tokens get left over (most notably s, d, y, l, k, t, f, p), but these are not paired, so that’s OK. My guess is that any left-over “o” and “a” characters are probably nulls or pen-slips.

(3) perform your statistical analyses on the set of (roughly 50) tokens output by stage 2.

Note that I don’t believe that this somehow “solves” Voynichese in a single step (becaus nothing connected with the VMs has ever been that simple before, and that is unlikely to change now). However, I do believe that removing the pairs like this removes arguably the most problematic set of confounding factors from the text, and hence that doing so should allow the productive use of statistics to crack the rest of the cryptographic problems associated with Voynichese.

4 thoughts on “Voynich cipher structure…”

Elmar on August 27, 2009 at 11:04 am said:

I think the failure of statistical attacks on the VM may be explained on a much simpler level — Without knowing which glyphs are different and which are really the same, and which glyphs “cooperate” to constitute a single ciphertext letter, there are way too many uncertainties to unleash the hounds of stochastics to the limited sample size of some 110k glyphs.

(The fact that there are at least two different enciphering systems at play, perhaps more, perhaps diffusely defined “dialects” rather than discrete “languages”, doesn’t help either.)

I fully agree with you that the solution ultimately will turn out to be an embarassingly simple scheme (as opposed to something impenetrable and incredibly cunning) because we all failed to see the forest for the trees.

It’s interesting to note that you end up with some 50 ciphertext tokens, much like Robert Firth did. That’s too close to twice the number of latin letters in the alphabet to simply ignore, IMHO…
nickpelling on August 27, 2009 at 12:01 pm said:

I think that, for the greatest part, it is clear what letters are supposed to resemble: “vowels” are supposed to resemble vowels, word-terminal “y” is supposed to resemble the “-us” sign, and “aiir” / “aiiv” are supposed to resemble page references. My contention is that those resemblances are designed to mislead us, not to inform us: and that the real unit of analysis lies at a level upwards, i.e. groups of glyphs that encipher for a single token. I know you’ve long wondered whether the text should instead be fractionated (subdivided beneath the level of the glyphs) to be understood, but I would argue that “swimming towards the strokes” is going in the wrong direction – we need to find units with more information in them than an EVA stream does, not less.

As for the roughly 50 units: I would predict that there will turn out to be two parallel alphabets / mechanisms at play here, one for static enciphering (i.e. a verbose monoalphabetic substitution cipher, which predominates in labels, for example) and another for dynamic enciphering (i.e. some kind of transposition and/or stateful system about which we currently have no real idea, but plenty of ill-formed suspicions). But right now that’s as far as it goes, I guess. Oh well! 😮

PS: did you enjoy “De Aqua”? 🙂
Elmar on August 28, 2009 at 8:26 am said:

Hi Nick,

I agree with you regarding the misleading “clues”, but actually I meant something different — like, are the four different gallows really “different”, or are they just variations of the same glyph? Is EVA r the same as EVA s? What’s the deal with the ch characters? If we really *knew* the ciphertext alphabet, we *might* be able to do meaningful statistics on it…

As regards my personal pet theory about the strokes, it’s actually the other way around: I think that a *sequence* of several ciphertext letters may combine to represent a *single* plaintext letter. (Namely, each ciphertext letter represents one of the pen strokes required to write the corresponding plaintext letter.) Thus, in my little world, the 50 odd ciphertext “syllables” you found might well represent two alphabets of plaintext letters. (Upper/lower case letters, randomly alternating…?)

PS — Havent checked out “De Aqua” yet. Last night was devoted to Quatermass… 😉 (The 50’s original.)
Mark Knowles on January 28, 2023 at 2:18 pm said:

Nick,

Another interesting post. I think when the truth emerges to the full light of day people will recognise how insightful your many writings were.