Thanks to Newsweek, Fox News, The Daily Mail and The Independent [*sigh*], some techy Canadian Voynich research is currently enjoying its day in the media sun. (Hint to authors: sorry, but based on recent evidence, it would seem that you have ~48 hours to get your next funding request submitted and approved before everyone currently cheering starts booing.)
CompSci professor Greg Kondrak and graduate student Bradley Hauer presented their research at the 2017 ACL conference, and their paper “Decoding Anagrammed Texts Written in an Unknown Language and Script” appeared in Transactions of the Association for Computational Linguistics Volume 4, Issue 1, pages 75–86 [though the PDF is freely downloadable, at least for now].
From the press coverage so far, you might think that they had CARMELed the Voynich (i.e. thrown a tame supercomputer and some kind clever-arse AI libraries at the problem): for, as the media incessantly repeat at the moment, All Human Problems Will Inevitably Yield To The Scythed Mega-Bulldozer That Is AI. But… is any of that true? Or useful? What’s actually going on here?
Behind the Kondrak and Hauer headlines
The initial question is obvious: what did Kondrak and Hauer actually do to try to crack the Voynich’s mysterious secrets that (they thought) nobody else had tried before? A quick snoop reveals that Bradley Hauer is a pretty smart crypto cookie: the simple substitution cipher solver presented in his 2014 paper “Solving Substitution Ciphers with Combined Language Models” outperforms many competing academic solutions. It does this by using both letter statistics and word lists at the same time (a) to solve Aristocrat cryptograms (i.e. ones where you know where the word boundaries are) even under mildly noisy conditions, and (b) to solve Patristocrat cryptograms (i.e. ciphertexts without spaces, though the recursive approach used to turn Patristocrats into candidate Aristocrats seems somewhat heavy-handed), before finally moving on to trying (unsuccessfully) to reproduce the kind of deniable encryption loosely proposed in Stanislaw Lem’s (1973) “Memoirs found in a bathtub”.
And here’s what Hauer looks like in real life:
So what happened before the Voynich paper was even written was that Hauer had built up a lot of software machinery for solving nicely-word-boundaried simple substitution ciphers at speed, and where some kind of mild text mangling had optionally taken place. And so it should not be a surprise that he carried this technology and approach forward, insofar as the 2016 paper tries to solve Voynichese as if it were a nicely-word-boundaried simple substitution cipher that had had its text mangled via anagramming plus optional abjad-style vowel removal. Given that as the paper’s founding presumption, all it is trying to do is evaluate which plaintext language was used if that entire presumption just happened to be correct (oh, and the transcription used was accurate).
Incidentally, the Voynich corpus used was 43 pages (“17,597 words and 95,465 characters”) of Currier-B text in the Currier transcription that one or both of Knight & Reddy had supplied, but the authors did not seem to have questioned the reliability or parsing choices behind that particular transcription. (More on this below.)
Unlike Stephen Bax’s well-known Voynich 2014 paper (which began by gleefully flipping the bird at nearly all previous Voynich research), Kondrak and Hauer’s Voynich paper begins by covering what they consider related Voynich work (section 2.1) in a level-headed, if somewhat brief, way. The most relevant source they have for the notion that we might be looking at anagrammed text is Gordon Rugg’s 2004 paper: this floated the idea that there might be a similarity between alphabetically ordered anagrams (‘alphagrams’) and what we see in the Voynich Manuscript’s text.
Yet much has already been written about Voynich anagramming beyond this, not least William Romaine Newbold’s monstrously tangled ‘decryption’ (*shudder*). More recently, Edith Sherwood claimed both that it was a young Leonardo da Vinci who wrote the Voynich Manuscript, and that the Voynich text was written in anagrammed Italian (though so far she has mainly only tried to reconstruct Voynich plant names using her proposed scheme). As I pointed out in 2009 this seems extraordinarily unlikely to work in the way she proposes.
Arguably the most interesting previous Voynich research into anagrams (again, not mentioned by Hauer) has been that of London-based researcher and translator Philip Neal. In a (now long-lost) page he posted many years ago on the late Glen Claston’s voynichcentral.com website, Philip proposed:
Here is a transformation of plaintext into ciphertext which explains certain features of the Voynich “language”.
1. Divide a plaintext into lines
2. Sort the words of each line into alphabetical order
3. Sort the letters of each word into alphabetical order
1. one thing led to another thing last night
2. another last led night one to thing thing
3. aehnort alst del ghint eno ot ghint ghint
The result has some of the statistical properties of the Voynich text.
A. The frequency distribution of words and letters is the same as in the natural language plaintext, but the distribution of two-letter groups and two-word groups is significantly altered.
B. Words at the beginning of a ciphertext line tend to start with letters at the beginning of the alphabet. Compare the high frequency of Voynich “d” at the beginning of a line.
C. If a letter near the end of the alphabet has a tendency to be word-initial in the plaintext (e.g. German “w”), it will have a strong tendency to be the last word in a line. Compare the high frequency of Voynich “m” at the end of a line.
D. The ciphertext versions of frequent words will tend to cluster together in a line. That is, where a word such as “thing” occurs twice in the plaintext line (as in the above example) the two word sequence “ighnt ighnt” will occur, but “ighnt” may also occur elsewhere in the line as an anagram of “night”.
E. A one-letter word of ciphertext can only be an anagram of a single word of plaintext (“a” can only be an anagram of “a”) and a two-letter word of ciphertext can only be an anagram of two possible words of plaintext (“et” can only be an anagram of “et” and “te”). This means that you cannot have a ciphertext line of the pattern “… i … i … ” or of the pattern “… et … et … et …”. This principle largely holds good in the Voynich text: there are only six exceptions in the corpus of Currier’s language B.
To his credit, Philip then immediately pointed out some problems with this suggestion:
1. Voynichese words do not conform to a strict alphabetical ordering of letters (there are quite a lot of words of the pattern dshedy).
2. Voynichese words have a strong tendency to contain only one instance of a given letter, unlike any obvious candidate language for the plaintext.
3. The enciphering described is not unambiguously reversible (however I think it would work as a private aide-memoire, or as a means of establishing priority like Galileo’s well known anagram announcing his discovery of the phases of Venus)
(Philip has since instead proposed a possible grid-like constraint on the position of Voynichese letters within Voynichese ‘words’, though problems with that alternative explanation remain.)
Incidentally, Philip has also pointed to a number of places within the Voynich Manuscript where entire lines appear to have been written in a non-one-after-the-other way (i.e. unexpected line transpositions): while nobody has yet come up with a powerfully convincing explanation for the presence of “Neal keys” (sections of text typically delimited by pairs of single-leg gallows) in the top lines of pages (typically embedded ⅔ of the way across). He is a sharp observer, and these anomalies are all inconsistent with the widely-held presumption that the text we are looking at here is completely unmangled.
Ultimately, though, it remains a sizeable step (or three, or indeed more) to go from anywhere here to Hauer’s presumption that what we are looking at is straightforwardly anagrammed text in a conventional European language, whether abjad or not.
The actual Voynich research gap
If asked for the single largest methodological problem with Voynich research, I would point to the way that Voynich researchers tend to make a series of unfounded assumptions:
(a) the transcription they are using is perfectly reliable;
(b) the way that they parse that transcription (i.e. into tokens) is correct – there are many hidden linkages here which are each probably sufficient to derail any decryption attempt;
(c) the candidate plaintext languages they consider are genuinely representative of the Voynich Manuscript’s plaintext;
(d) no other textual transformations are present;
(e) the putative hypothetical transformation that they just happen to have plucked from the air and which they are testing is precisely that which is present in the Voynich Manuscript; and
(f) the output of their reverse transformation will be straightforward text that can be read and marvelled over by historians.
In the case of Kondrak and Hauer, I hope it should be clear that they have fallen foul of every one of these issues in turn: and their paper is all the worse for it. It is one thing to note in passing that Esperanto’s “extreme morphological regularity […] yields an unusual bigram character language model which fits the repetitive nature of the VMS words” (p.83), but it would be quite another to point out that this might easily have arisen from the way that Voynichese needs to be parsed in order for it to make sense: and it is this apparent lack of perception of the practical difficulties that all Voynich decryptors face that devalues the genuinely good work that went into their paper.
What particularly frustrates me is that in spite of these many issues, there are plenty of ways Voynich researchers can make genuine progress towards understanding what is going on: but, rather, they instead persist in trying to airball their own personal Voynich match-winner from the other end of the basketball court. They seem seduced by the glamour of being The One Who Solved The Voynich, instead of getting on with the graft of making a difference to what we know. 🙁
Yet computational linguistics has such a rich toolbox (of which CARMEL is merely one small screwdriver) that it surely has ample capacity to at least try to bridge all the actual research gaps that people are falling into, e.g.:
* What is the right way to parse EVA into tokens? (e.g. is EVA ‘or’ two tokens or one? is EVA ‘cth’ three tokens, two tokens, or one? etc)
* How does Currier A map to Currier B? And what about all the subtypes of each of these?
* What are the differences between them and “Currier C”? (Rene Zandbergen’s term for labelese)
* Can we determine whether line-initial letters are likely reliable or unreliable?
* Are words abbreviated (e.g. is EVA y some kind of truncation symbol)? If so, are A and B abbreviated in exactly the same way?
If people had the intellectual good sense to stop trying to fly over all these separate hurdles all at the same time in a Steve Austin-style 100m leap of misplaced faith, we might start to make real progress. However, even when researchers do have the necessary brains to make progress (as Hauer clearly has), it seems they have insufficient strength of mind to not be tempted by the glamour of the big ticket “Researchers Crack Voynich Manuscript” headline. 🙁