The Perplexing Mystery of the Voynichese 'Languages'...

Where will the first proper Voynich research breakthrough come from? To my mind, there is a good chance that this will be made by someone taking a fresh look at the mystery of the Voynichese ‘languages’.

For even though the notion that Voynichese is a simple, regular language seems to be the default decryption starting point for just about every YouTube codebreaker on the planet (e.g. “it’s obviously proto-Breton with Urdu loanwords“, etc etc), it simply isn’t.

Rather, when you put Voynichese under the linguistic microscope, you see a series of different (but closely related) languages / writing systems. And whatever you think Voynichese is, having to account for multiple variants of that thing is bemusing, if not downright perplexing.

The most fundamental challenge, then, that these variants present us is this: can we work out how these variants relate to each other? Furthermore, can work out how a letter / word / sentence written in one variant would be written in another? In short, can we somehow normalize all the Voynich Manuscript’s languages relative to each other, to step towards a single, regular system underlying them all?

For me, reaching even part of the way towards doing this would be perhaps the most significant Voynich research achievement yet.

The ‘Language’ Landscape…

It was top American cryptologist Captain Prescott Currier back in the 1970s who first inferred the presence of multiple Voynichese ‘languages’. He famously categorised Voynichese pages as having been written in either an ‘A’ language variant (now known as “Currier A”) or a ‘B’ language variant (A.K.A. “Currier B”). This was motivated by various statistical features of the text that he observed clustering together in A pages and B pages respectively. What is more, Currier’s A/B clustering largely holds true not only for both the pages on any given folio (i.e. recto and verso), but also for all the folios / panels on a single bifolio (or trifolio, etc).

Though Currier’s A/B division is a very useful categorisation tool, it remains somewhat problematic as an absolute measure, for (as Rene Zandbergen likes to point out) a few intermediate pages have both Currier A and Currier B features simultaneously. Rene points especially to the foldout folios for this: he says that Currier’s initial assessment was drawn from the herbal pages (which I think is very probably true), and that these super-wide pages behave a little differently.

Moreover, the variations of the languages used in different sections (e.g. “Herbal A”) present yet further dialect-like differences to be accounted for. Inferring from this that these differences ‘must therefore’ relate to the pages’ semantic content would be a convenient way of explaining them away: but there is as yet no evidence to support that conclusion. For now, these section clusters need to be handled with statistical white gloves too.

We additionally have codicological evidence that suggests that some sections of the manuscript were originally formed of pairs of gatherings (e.g. Q13 was Q13A + Q13B, Q20 was Q20A + Q20B), but nobody (as far as I know) has as yet gone looking for Voynichese text statistics that might support or refute these proposed divisions.

And on top of that, there is what has come to be called as ‘labelese’, i.e. the disjointed one-word-at-a-time text found on pages with ‘labels’ attached to parts of diagrams (e.g. the Astrological / Zodiac section). Here again, some people like to infer that it ‘must somehow be’ the semantic content of these labels that affects the way Voynichese works: but there is no evidence to support that conclusion, beyond wanting it to be true for an easy life. 😉

In summary, what we observe in Voynichese is a lot of language-like variation going on at a number of levels. In my opinion, we should stop trying to explain away these variations in terms of speculative concepts (e.g. ‘semantic differences’ or ‘labels’), and start instead to look at the basic statistical patterns that each text cluster presents, and use those results as our starting point moving forward.

Unsurprisingly, this is what the next section does. 🙂

A/B Observations…

It’s worth reprising Currier’s observations (which we will turn into actual statistical evidence shortly). He wrote (transcribed on Rene’s site):

(a) Final ‘dy’ is very high in Language ‘B’; almost non-existent in Language ‘A.’
(b) The symbol groups ‘chol’ and ‘chor’ are very high in ‘A’ and often occur repeated; low in ‘B’.
(c) The symbol groups ‘chain’ and ‘chaiin’ rarely occur in ‘B’; medium frequency in ‘A.’
(d) Initial ‘chot’ high in ‘A’; rare in ‘B.’
(e) Initial ‘cTh’ very high in ‘A’; very low in ‘B.’
(f) ‘Unattached’ finals scattered throughout Language ‘B’ texts in considerable profusion; generally much less noticeable in Language ‘A.’

Rene Zandbergen adds the following observations:

The very frequent character combination ‘ed’ is almost entirely non-existent in all A-language pages.
The very common character combination ‘qo’ is almost completely absent in the zodiac pages and the rosettes page, but appears everywhere else.
The common character combination ‘cho’ does not appear in the biological pages (and the rosettes page), but it does in other B-language pages.

Marco Ponzi further added:

The ‘cluster’ aiin has more or less the same frequency in A and B, but as a stand-alone word it is about three times as frequent in B than in A.

Prescott Currier also noted a number of striking language oddities in the ‘Biological B’ section:

This ‘word-final effect’ first became evident in a study of the Biol. B index wherein it was noted that the final symbol of ‘words’ preceding ‘words’ with an initial ‘qo’ was restricted pretty largely to ‘y’; and that initial ‘ch, Sh’ was preceded much more frequently than expected by finals of the ‘iin’ series and the ‘l’ series. Additionally, ‘words’ with initial ‘ch, Sh’ occur in line-initial position far less frequently than expected, which perhaps might be construed as being preceded by an ‘initial nil.’
This phenomenon occurs in other sections of the Manuscript, especially in those ‘written’ in Language B, but in no case with quite the same definity as in Biological B. Language A texts are fairly close to expected in this respect.

My own contribution to this line of inquiry has been to point out that word-initial ‘l-‘ is a very strong feature of B pages (particularly Q13). Emma May Smith similarly posted on the various “l + gallows” digraphs:

It should also be noted that <lk> is mostly a feature of the Currier B language. It is roughly twenty times less common on A pages than B pages.
The presence of digraphs composed of <l> and other gallows characters is less secure. The string <lt> occurs 107 times, <lp> occurs 40 times, and <lf> occurs 39 times. Although <lf> is the least of the three its rate is actually rather great, being nearly 8% of all <f> occurrences, approaching the 10% for <lk> of all <k>. Even so, these number are still small and could easily be overlooked if not for <lk>.
Like <lk>, <lt, lp, lf> all appear at the beginning of words, and mostly occur in Currier B. They seem to work in the same way, even if less common.

All in all, it seems to me that there are probably more than twenty Voynichese features that display a statistically significant difference between Currier A pages and Currier B pages. It also seems that many of these features have different relative frequencies between different clusters (e.g. Herbal A) and/or sections (e.g. Q13).

There is therefore plenty of work to be done here!

List of Distinctive Behaviours

Even though we have excellent transcriptions (EVA and otherwise), I think we’re collectively missing a foundational piece of Post-Currier empirical analysis here: a list of distinctive behaviours present and absent in sections of the Voynich Manuscript. This would extend Prescott Currier’s list to include many more features (such as the use of the EVA ‘x’ glyph, etc) that have been flagged up as distinctive in some way by researchers over the years, though with less of a pure A/B focus. Here is a preliminary list (based largely on the above), which I’m more than happy to extend with additional ones put forward in comments here or elsewhere:

-dy	B
[chol]	A
[chor]	A
[chol.chol]	A
[chor.chor]	A
[chain]	A
[chain]	A
chot-	A
cth-	A
ed	B
[ar]	B
qo-	Absent in rosette and zodiac pages
cho	Absent in rosette and Bio pages
cho*	Rare in Q13
[aiin]	Common in B as standalone word
l-	B
r-	B, particularly Q13 and Q20
lk	B
lt-	B
lp-	B
lf-	B
aly	f58
x	Q20
-m not at line-end	Bifolios f3-f6 & f17-f24

My core beliefs here are (a) that Voynichese will turn out to be fundamentally rational (if perhaps a bit strange); (b) that behaviours in one section will somehow rationally map to behaviours in many (if not all) different sections; and (c) that Voynichese will turn out to have an underlying story / evolution / growth path that we can reconstruct.

10 thoughts on “The Perplexing Mystery of the Voynichese ‘Languages’…”

Emma May Smith on May 6, 2019 at 4:05 pm said:

I agree that mapping out the key differences is important Julian Bunn did something similar some years ago with specific words.

The difference between A and B could be a number of things, but they seem to boil down to two headings: content and encoding. Did the topic/language/dialect change, or the cipher/orthography/code change? It may even have been a mixture of the two.

I think that the existence of Herbal A and B rule out topic change. And the slow transition from A to B also rules out any change which must be either/or.
nickpelling on May 6, 2019 at 4:18 pm said:

Emma: for me, topic/content change is something we should only invoke as a last resort, when we’re still stuck on first steps.

As a (related) aside, what are your top tips for unexpected word adjacency properties? Currier mentions some in Bio B, but I don’t know if there are more general cases to be had. I vaguely recall reading something about this on one of your blog posts, but couldn’t find it when I went looking for it. Thanks!
Emma May Smith on May 6, 2019 at 5:23 pm said:

I’m not sure that’s really an aside Nick, but my thoughts are summed up in this post: http://agnosticvoynich.wordpress.com/2017/01/09/first-last-combinations/ . I hope to explore the topic more in the future.
Julian Bunn on May 7, 2019 at 5:46 pm said:

I always enjoy your Voynich posts, Nick! The Currier A/B feature has long seemed to me the best handle we have for cracking the manuscript. I became obsessed a while back with the idea that the A language was simply one configuration of some algorithm the author(s) were using to encode the manuscript, and that B was a different configuration. That implied that words in Language A would have equivalent words in Language B. I spent a lot of time trying to find glyph/n-gram mappings that would transform the most prolific words in A into the most prolific words in B. I concentrated on the Herbal A and Herbal B folios, with the theory that the plaintext vocabulary was likely to be similar, and so there would be more chance of finding correspondence. The results of that study are in my blog, but can be summed up as inconclusive, of course 🙂
Drabkikker on May 8, 2019 at 9:07 pm said:

> it’s obviously proto-Breton with Urdu loanwords

Hey! That was my idea!
Ruby on May 9, 2019 at 2:50 pm said:

Thank you, Nick, for reminding us that the existence of several languages (at least two) has been recognized (proposed) by Currier. So perhaps it’s time for us to lower the level of censorship and self-censorship and admit the mixture of proto-breton and urdu?
Jokes aside, I hope that the solution, at least partial, is not far.
Mark Knowles on July 16, 2019 at 2:44 pm said:

Now my instincts coming out of my recent research with regard to this would be that we find different null words are more prevalent in some parts of the manuscript than in others.

Now given my research this is still very speculative.

Anyway this would imply different “null” languages rather than “real” languages, composed of the non-null words. So the real language would remain the same throughout the manuscript.

But then how does one explain different null words being used in different parts of the manuscript? This is a hard question to answer as is the question as to how did the author produce null words. I would think this is due to a semi-deliberate and choice of null words changing throughout the manuscript, but also just a changing habit of using null words. To restate some previous comments I doubt null words were generated formally by using a cardan grille etc. but just had a common format by which they can be identified.

Hopefully at some point I will write up my study of Labelese.
D.N.O'Donovan on January 8, 2022 at 10:42 pm said:

Nick, fully appreciate, the reasons for the romanised ‘Voynichese’ transcription(s) devised twenty ago, but wonder why there is still no free-to-use Voynichese script. Or is there one you know and would recommend ?
nickpelling on January 8, 2022 at 10:58 pm said:

Diane: it’s because EVA was designed to be a transcription alphabet and so (largely) without interpretation bias.

OK, there are a number of places where it falls subtly short: but even so, it remains a remarkably good conversation enabler.
D.N. O'Donovan on January 8, 2022 at 11:39 pm said:

Nick – I agree, it makes conversation easier; my concern is that it may dull perception. Imagine, for example, that some of the Voynich glyphs are from one of the various forms for the Hindu-Arabic numerals, or that what is being transcribed as ‘ol’ should be read as .. I don’t know.. say ‘cz’ or ‘ни’.

The current romanisation accords with, and serves to maintain, certain expectations which aren’t necessarily supported by such things as the codicology or the images’ style and content. I suppose it’s really only important for the linguists and cryptographers, but for me – I’d still like to see Voynichese written and discussed as it actually appears.
Thanks for the response, Nick. I wish you and all your readers a happier 2022 than 2021.