Here’s a 2011 paper by Grzegorz Jaskiewicz of the Faculty of Electronics and Information Technology at Warsaw University of Technology, entitled “Analysis of Letter Frequency Distribution in the Voynich Manuscript“.
Essentially, Jaskiewicz used some Java code to screen-scrape a mini-corpus of text from 23 different languages via Wikipedia’s Random Article button, and then compared each of them with Voynichese (he used Glen Claston’s Voynich-101 transcription): cutting to the chase, the top five matches were Moldavian, Karakalpak, Kabardian Circassian, Kannada, and Thai.
Obviously, if you’re a Voynich cipher true believer (or even a Voynich hoax false believer), none of this will cause you to lose any sleep. Similarly, if you’re a Jacques Guy-esque Chinese language supporter (and Jacques Guy himself isn’t, Voynich trivia fans), you’ll probably be patting yourself hard enough on the back to send your dentures flying.
Personally, I think there’s something utterly wrong with the Chinese hypothesis, and indeed about this kind of experiment. In effect, what people are doing isn’t comparing Voynichese with a language, but instead comparing a clunky transcription of Voynichese with a clunky transcription of a language. Wherever a given language fails to be captured by ‘pure’ Romanized letters, it almost inevitably ends up being expressed using paired language groups – letters and modifiers. I’ll give some examples from, let’s say, Jaskiewicz’s top 5 matches:
First example: Kannada. Its 49-letter alphabet includes “half-letters” which combine to form a huge number of compound letters known as “vattakshara”.
Second example: Kabardian Circassian. This is a language shoehorned into the Cyrillic alphabet by forming compounds of letters to create a single sounds (one such compound is four letters long).
Third example: Moldovan and its various transcriptions form a hugely political issue – I can’t even display the Moldovan Wikipedia page in Internet Explorer, that’s how bad it gets. I can only presume it has ended up in some kind of 16-bit Unicode limbo.
Fourth example: Thai. This has 44 consonants (“phayanchaná”), and 15 vowel symbols (“sàrà”) that further combine into 28 or more compound vowel forms, as well as four tone marks. It’s a complicated compound transcription.
The point I’m making (in a somewhat laboured way) is that what Voynichese shares with these languages is a clunky transcription that does not naturally capture the essence of the language itself (and the stroke-based EVA transcription is probably even worse for this). Yet for Voynichese, I argue that this is not a linguistic feature but a cryptographic feature: even though Voynichese letters like “o” and “a” are intended to resemble vowels, their statistical structure is that of modifiers – “4o” / “ol” / “al” / “aiir” / “aiiv” all statistically operate as compound letters.
So ultimately, I have to say that I find such language comparisons futile and misguided: they are almost always built on an insufficient grasp of both the nature of Voynichese and the nature of languages and transcriptions simultaneously. What’s behind this isn’t innately bad science or bad history, just an unrefined (and actually rather primitive) human desire to understand things by trying stuff out. Yes, for all the newmedia technology sheen and stats smarts, it’s no more than hitting a rock with a hammer and hoping for a perfect diamond to fall out. But yuh ain’t gonna get no diamonds that way this week, bubba. 🙁