You might think I’d be pleased by the appearance of another Voynich statistics study (Voynich Manuscript: word vectors and t-SNE visualization of some patterns), courtesy of those well-known peer-reviewed online journals Reddit and Hacker News. [*] After all, statistical experiments are – if carefully planned and executed – beyond all reproach, surely?
But there is a big problem (arguably a meta-problem) with this: and it’s one that’s been around for a very long time.
Even back in 1962, Elizebeth Friedman – having been a top US Government code-breaker for several decades – was able to note that all attempts to decrypt the Voynich Manuscript as if it were a simple language or single-substitution alphabet were “doomed to utter frustration”. That is, if you wind the clock back half a century from the present day, it was already clear then that Voynichese’s curious lack of flatness was strongly incompatible with:
* natural languages
* exotic languages
* lost languages
* monoalphabetic (simple) substitution ciphers, and even
* straightforward hoaxes
Unfortunately, the primary assumption of flatness is precisely the starting point of a large number of statistical studies carried out on the Voynichese text ever since.
Why Is Voynichese Not Flat?
A long succession of (actually pretty good) past statistical studies has revealed that Voynichese has an abundance of mechanisms that give it internal structure, not only in terms of letter adjacency and within words generally, but also within lines, paragraphs, and pages. Yet while all natural languages do work to plenty of orthographic rules, none of them (from this far back in time, at least) has orthographic conventions that extend so far into the high-level page layout.
In Voynichese, you can see these “supra-orthographic structures” in such places as:
* Horizontal Neal sequences (stereotypically manifesting themselves as pairs of single-leg gallows placed about two-thirds of the way along the topmost line of a paragraph or page
* Vertical Neal sequences (the first letter of each of a series of adjacent lines, forming a putative column of letters, and very probably distorting the agrregate statistics for the first character of each line)
* Vertical free-standing key-like sequences
* Substantial difference in word structure within “labels” (short pieces of free-floating text, typically inside or beside drawn features)
* Grove “titles” (small fragments of right-justified text tagged onto the end of paragraphs, e.g. on f1r)
* Small text_size:dictionary_size ratio
* Multiple repetitions of high frequency words (daiin daiin, qotedy qotedy, etc), etc
[Just about the only supra-word-level orthographic structure we can directly match is the change in frequency stats for the last letter of a line. In natural languages, we often see a hyphen placed there, while in Voynichese we often see EVA ‘m’ or ‘am’: so I would be unsurprised if these are essentially the same thing.]
Each of these features (which I’ve discussed in more detail elsewhere on this site) on its own would be annoying enough to account for, if (say) you were trying to reconcile Voynichese with a conventional language. However, put them all together and you suddenly get a glimpse of what we’re really dealing with here: something arbitrary, painfully complex, and extremely unlanguage-like.
If, as per almost all natural languages and ciphertexts, Voynichese did not have these features, we would happily describe it as “flat”, and it would be utterly fair and reasonable for people to throw their home-grown statistical toolkits at it in the reasonable expectation that something might just emerge from the process.
However, Voynichese is not flat: and so this kind of simple-minded approach is 99.9% certain to reveal nothing of any genuine novelty or insight. Sorry, but that’s just the way it is.
So, What’s The Answer, Nick?
If you want to do statistical analysis on the Voynich Manuscript that genuinely stands a chance of producing insightful and helpful results, you really need to put the Voynichese text through some kind of normalization filter before analysing it: by which I mean you need to condition the worst parts out.
The best starting point is to restrict your scope to one of the two large relatively homogeneous blocks of text:
* Quire 13 (but without labels, and without vertical sequences) – though note there is a long-unresolved suggestion that Q13 may have originally been composed in two parts / phases, not coincident with the final binding order
* Quire 20 (but without f116v) – though note there is also a long-unresolved suggestion that Q20 may have originally been composed in two parts / phases, and also not coincident with the final binding order.
Doing this should sidestep the thorny issues (a) of Currier A vs Currier B, (b) of text vs labels, and (c) of space transposition ciphers (because I don’t recall Q13 and Q20 having and “oro ror”-like sequences). [Personally, Q20 would be my preferred starting point.]
I would also strongly advise filtering out any matched pairs of single-leg gallows that fall on any single line, along with the (usually shortish) text sequence that sits between them: and any ornate gallows too.
All of which leaves the tricky issue of how best to normalize page-initial, paragraph-initial, and line-initial letters. The jury is still well and truly out on these: which probably means that evaluating them would be a good use of statistical analysis. Which also probably means that nobody is going to actually do it. 🙁
Finally: once you have got that far, all you’re left with is… the truly humungous issue of how best to parse Voynichese. Is EVA ‘ckh’ one letter, two letters, or three? Should EVA ‘qa-‘ and ‘qe-‘ always be interpreted as if they are copying errors for EVA ‘qo-‘? Should each of EVA ‘or’ / ‘ol’ / ‘ar’ / ‘al’ be read as a pair of letters or a single (tricky) verbose cipher glyph? Does ‘ok’ encipher a different token to ‘k’? Is ‘yk’ two letters or one composite one? And so forth… the list goes on (and it’s a very long list).
But unless you can find a way to see clearly past Voynichese’s supra-orthography, you’ll probably never get even remotely close to anything that interesting with your own Voynich statistics. Just so you know! 😐
[*] Tongue planted firmly and immovably in cheek.