Heteroscedasticity and ciphers...

Heteroscedasticity – now there’s a word you don’t see very often (thanks to Rosco Paterson for kindly plonking it in my path). Which is a pity, because it’s a particularly useful concept that might help us crack several longstanding cipher mysteries.

The idea behind it is not too far from the old joke about the statistician with his feet in the oven and his head in the fridge, who – on average – felt very comfortable. A set of numbers is heteroscedastic if it simultaneously contains different (‘hetero-’) subgroups such that (for example) their average value falls between the groups. As a result, looking to that average for enlightenment as to the nature of those two separate subgroups is probably not going to do you much good.

Perhaps unsurprisingly, it turns out that a lot of statistical properties implicitly rely on the data to be analyzed not having this property. That is, for data with multiple modes or states, the consequent heteroscedasticity is likely to mess up your statistical reasoning. Though you’ll still get plausible-looking results, there’s a high chance they’ll be of no practical use. So for cipher systems in general, any hint of multimodality should be a heteroscedastic alarm bell, a warning that your statistical toolbox may be as much use as a wet fish for tightening a bolt.

Plenty of Voynich Manuscript (‘VMs’) researchers will be sagely nodding their heads at this point, because they know all too well that the plethora of statistical analyses performed so far on it has failed to yield much of consequence. Could this be because its ‘Voynichese’ text heteroscedastically ‘hops’ between states? Cipher Mysteries regulars will know I’ve long suspected there’s some kind of state machine at play, but I’ve yet to see any full-on analysis of the VMs with this in mind.

Historically, the first proper ciphering state machine was Alberti’s 1465 cipher disk. He placed one alphabet on a stator (a static disk) and another on a rotor (a rotating disk), rotating the latter according to some system pre-agreed between encipherer and decipherer, e.g. rotating it after every couple of words, or after every vowel, etc.

Even if you don’t happen to buy in to my Averlino hypothesis (but don’t worry if you don’t, it’s not mandatory here), 1465 isn’t hugely far from the Voynich Manuscript’s vellum radiocarbon dating. It could well be that state machine cryptography was in the air: perhaps Alberti was building on an earlier, more experimental cipher he had heard of, but with an overtly Florentine, Brunelleschian clockwork gadget twist.

As an aside, there are plenty of intellectual historians who have suggested that the roots of Alberti’s cipher disk lie (for example) in Ramon Llull’s circular diagrams and conceptual machines: in a way, one might argue that all Alberti did was collide Llull’s stuff with the more hands-on Quattrocento Florentine machine-building tradition, and say “Ta-da!” 🙂

All the same, we do know that the Voynich Manuscript’s cipher is not an Albertian polyalphabetic cipher: but if it is multimodal, how should we look for evidence of it?

A few years ago when my friend Glen Claston was laboriously making his own transcription of the VMs, he loosely noticed that certain groups of symbols and even words seemed to phase in and out, as if there was a higher-level structure underlying its text. Was he glimpsing raw heteroscedasticity, arising from some kind of state machine clustering? For now this is just his cryptological instinct, not a rigorous proof: and it is entirely true he may have been influenced by the structure of Leonell Strong’s claimed decryption (which introduced a new cipher alphabet every few lines). Despite all that, I’m happy to take his observation at face value: and that Voynichese may well be built around a higher-level internal state structure that readily confounds our statistical cryptanalyses.

So, the big question here is whether it is possible to design tests to explicitly detect multimodality ‘blind’. The problem is that even though this is done a lot in econometrics (there was even a Nobel Prize for Economics awarded for work to do with heteroscedasticity), economic time series are surely quite a different kettle of monkeys to ciphertexts. Perhaps there’s a whole cryptanalytical literature on detecting heteroscedasticity, please leave a comment here if you happen to know of this!

I don’t know what the answer to all this is: it’s something I’ve been thinking about for a while, without really being able to resolve to my own satisfaction. Make of it what you will!

At the same time, there’s also a spooky echo with the Zodiac Killer’s Z340 cipher here. I recently wrote some code to test for the presence of homophone cycles in Z340, and from the results I got I strongly suspect that its top half employs quite a different cipher to the bottom – the homophone cycles my code suggested for the two halves were extremely different.

Hence it could well be that most statistical analyses of Z340 done to date have failed to produce useful results because of the confoundingly heteroscedastic shadow cast by merging (for example) two distinct halves into a single ciphertext. How could we definitively test whether Z340 is formed of two halves? Something else to think about! 🙂

9 thoughts on “Heteroscedasticity and ciphers…”

Robin on November 2, 2011 at 10:03 am said:

Interesting stuff Nick… of course, you must be thinking of the ‘handedness’ of the VMs as defining at least two groups to contribute to a heteroscedastic (a bugger to type, nevermind enunciate!) quality. I guess one way to shed light on this would be to take sections of similar ‘hands’ and show that their defining statistical features were sufficiently different from the other sections, e.g. predominant qo- prefixes, predominant -in suffixes. Surely a lot of the early (post-internet) work by Stolfi/ Guy would help on this point?
nickpelling on November 2, 2011 at 10:54 am said:

Robin: no, I’m actually talking about something far more fine-grained than Currier’s languages and hands, an aspect of Voynichese that ebbs and flows even within a paragraph, even within a line. There are (of course) significant differences between Currier A and Currier B (and even between labels and non-labels), but I suspect the Voynich’s heteroscedasticity [type it a few more times, I swear it gets easier] is rather more pervasive.
Knox on November 2, 2011 at 10:09 pm said:

You can see variable proximities of similar words with a display of the similarity of each word to all others in a 2D matrix. One and two letter words are a problem. We might substitute “u” for space and divide the transcription into equal length strings of average word length or longer. EVA-h, at a minimum, should be deleted. Dice’s coefficient is one of many ways to measure similarity of letter strings. We won’t have to tell anyone that, after we reach a higher level of scedastic consciousness we are still confounded.
Knox on November 2, 2011 at 10:25 pm said:

Re. Z340, rambling without checking, perhaps the writer used most of his substitutions in the first half and began to repeat them in the second half. Possibly there are different transpositions in the halves — or quarters.
nickpelling on November 2, 2011 at 11:00 pm said:

Knox: ain’t that the (heteroscedastic) truth! 🙂
nickpelling on November 2, 2011 at 11:04 pm said:

Knox: Dave Oranchak has looked at all kinds of curious transpositions for the Z340, with (I think it’s fair to say) only limited success so far. Really, I’m wondering if there’s some kind of fancy-pants way to automatically detect heteroscedasticity within a ciphertext, something that might help transform our wonky understanding of mixed-up ciphers. 🙂
Don Latham on November 3, 2011 at 6:22 am said:

Perhaps what is needed are techniques for analyzing chaotic data, or nonstationary series (not necessarily the same thing.) I know Fourier analysis etc have been applied, but I do not recall if such things as a Poincare` map etc were tried.
I do not have encyclopedic recall as to what was tried.
Don
BrookeStephen on August 1, 2012 at 2:01 am said:

From my analysis, if it were two-parted, you’d have new picts showing up mid-message, and you don’t!!
Diane on July 1, 2013 at 9:25 am said:

Sorry to be dull, but I keep having that sculpture of a-certain-Franciscan come to mind whenever Albert’s cipher is mentioned. I mean, given indications in the imagery that some of the Vms has come through the Mediterranean portolan-and-cartography stream.

I mean this one, of course
http://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Roger-bacon-statue.jpg/220px-Roger-bacon-statue.jpg