Don’t worry, this will be the last Voynichese trivia post for a good few days (I’ve had enough myself). 🙂
If you remove all the spaces (both genuine-looking ones and space-insertion-cipher-like ones) in Voynichese, a slightly different set of patterns emerge (this is one of Marke Fincher’s favourite tricks). I thought I’d have another quick look at multiple repetitions of various verbose pairs in this space-less transcription…
Here are the raw instance counts, and their relative percentages of the unrepeated instance counts:-
- qo = 5168, qoqo = 9 [0.17%], qoqoqo = 0 [0%]
- ot = 2679, otot = 3 [0.11%], ototot = 0 [0%]
- ok = 2961, okok = 7 [0.24%], okokok = 0 [0%]
- op = 407, opop = 0[0%], opopop = 0 [0%]
- of = 122, ofof = 0 [0%], ofofof = 0 [0%]
- ol = 5238, olol = 186 [3.55%], ololol = 13 [0.25%]
- al = 2885, alal = 60 [2.08%], alalal = 1 [0.035%]
- or = 2619, oror = 87 [3.32%], ororor = 1 [0.038%]
- ar = 2815, arar = 161 [5.72%], ararar = 7 [0.25%]
Yes, it’s a load of dull statistics to make your brain ache. But what jumps out at my eyes from this is that the ol/al/or/ar verbose pairs appear repeated many times more than the other high-frequency pairs (such as ot and ok: note these counts are with all the qo pairs removed first). From this, I conclude that qo/ot/ok/op/of all have one kind of statistical behaviour, while ol/al/or/ar have another statistical behaviour entirely.
Specifically, my guess is that the pairs in the first set appear in doubles roughly once every 500 instances (and never three times in a row) simply as a result of scribal copying errors, and that this gives us a rough idea of how frequent copying mistakes probably are in the VMs. My other guess is that ol/al/or/ar are genuinely meant to be in the ciphertext both twice and three times in a row (I somehow doubt the scribe could miscopy a pair three times in a row), and that these are quite probably Roman numerals in the plaintext.
Discuss!
Do we have any readily available statistics for roman numerals? I suppose things like Benford’s law law don’t work in those.
The interesting thing in these statistics is how dramatically the ranking changes. ol has top frequency in all three columns, so (under some strong assumptions) that would correspond to I. Then al should correspond to V, since it’s likely to occur on its own (5 is a very common number), but hardly ever occurs in doubles. That would leave ar for X, and possibly or for M. They may also be reversed, since MMM is more likely to have frequency 1 that XXX.
This is an interesting line of research, because if it can be strengthened, we might get some years out of it. It would be slightly worrying though, if the double MM’s referred to years. 🙂
Peter: every time you try to pin Voynichese letters (or even pairs of letters) down to a single text mapping, they seem to wormily writhe through your acquisitive statistical grasp, dive to the ground and quickly dig themselves new homes underground. All of which probably means that there’s some key aspect of how it all hangs together that we’re missing. As for MM’s, I guess they could just as well be referring to years BC, distances, lengths, etc? We don’t need to write a new chapter of The Da Vinci Code every time we float a Voynich hypothesis. 🙂
Also, I forgot to mention that “v” is a special case because it too often appears in pairs and triples (though linguistically rather than numerically), because it is used both as “u” and “v” circa 1400-1500. Cicco Simonetta even notes that this is at the root of the only linguistic triple letter: “uuula” (“uvula”), which you can see in Rule 11 on the English Wikipedia page on him. So it may well be that ol / al / or / ar encipher C L X V (in some order), though perhaps this is slightly less likely, I don’t know.
Going through the Roman numerals one by one, I offer my observations.
I, II and III and normal constructions. The pair coding for “I” would thus be very frequently seen alone, quite frequently seen repeated, and still somewhat frequently seen twice repeated. Even if “ol” were “I”, there could not plausibly be only thirteen instances of the number “3” in the VMs.
V is never repeated, so it must be enciphered as either “op” or “of”. Due to frequency, I’d say “op” is more likely.
X can be repeated once or twice, but numbers XXX -> XXXIX are rarely used, so X would fit as “al”, “or” or “ar” (keeping “ol” reserved for I).
L is not used much, and is never repeate, and so must be “op” or “of”. Due to low frecuency, “of” is more likely.
C is used very frequently, also repeated and twice repeated. “ar” seems the likely candidate.
D is not used much and is never repeated, but “op” and “of” are already taken, and the rest of the candidates are very frequent. Oh wait. Would the date of the time not have been around MCDXXI? D would actually be used a lot of the dates of the time. Maybe this does explain the discrepancy in frequency. “ot” could be a candidate then, handwaving the three instances of “otot” as errors.
M is used a lot for dates, as well as a few other things, and would so be expected to have a frequency a bit higher than D. This makes “ok” likely.
So there are still two pairs left unaccounted for. And my, uh, pairing of the pairs with numerals are non-scientific speculations. But it was fun.
Nick,
Your method of removing and ignoring spaces/breaks must be too crude. There’s more …
dydy, so good they named it twice.
Marke
Marke: you’re right, it was just intended to obliterate any sign of the space-insertion cipher as brutally as possible. If I wanted to do it with finesse, I’d have removed any spaces between adjacent ol / or / al / ar pairs: however, that would require (a) using my brain to construct a nice regexp, and (b) hacking my JavaScript code from 5 years ago. I’m not sure I’m up for either of those at the moment. 🙂
Hi Nick,
I too have been looking at something similar recently. Specifically, I was looking within lines only (as I believe the line is a unit, rather than governed by page width/space etc.) and at repeating strings of glyphs between 3 and 12 long. I saw nothing of note.
My basic mistrust of removing the spaces is that, until someone can explain why the space are irrelevent, it is logical to suppose they are there for a purpose, and their positions are significant. By removing them, you are throwing away important clues.
Just my tuppence worth.
Julian: as I replied to Marke (above), I only got rid of all the spaces out of laziness. I only really suspect a verbose-pair-disruptive space-insertion cipher to be operating between any two ol / or / al / ar pairs (including repetitions), and that was what the post was about. 🙂
Forgive a late-comer.. has anyone suggested that – in these repetitive uses – the ‘o’ and ‘a’ might be punctuation signs?
The Turkic ms [pic sent a couple of weeks ago to Nick] uses the small circle in that way, and of course so did the Greeks with their “i.o”.
so ol might mean 9 units, etc.
btw, the relevant characters in the Voynich look not unlike [Hindu] Devanagari numerals, don’t they?.
Diane: while it’s possible, (a) both ‘a’ and ‘o’ have high instance counts, far higher than you’d expect for full stops or commas, (b) early 15th century writers weren’t (as I recall) yet as punctiliously punctuation mad as later writers were to become, (c) ‘o’ and ‘a’ (particularly ‘o’) very frequently occur inside words, (d) many extremely common pairs (such as ‘qo’, ‘or’, ‘ol’, ‘ok’, and ‘ot’, to name but five) contain an ‘o’, and (e) many short labels contain one or more o’s / a’s, which would seem a rather odd place to be putting punctuation.
So the light was on briefly there for a few minutes, but if people prefer fumbling in the dark who am I to stop them?
Good luck guys…
Marke
Marke: Let me pose you a slightly different question. What are the top ten statistical observations about Voynichese that you think should guide anybody trying to develop a theory about how the language works? If that’s too restrictive, try the top twenty instead?
For example, the kind of observations that guide me are:-
(1) Very strong forwards predictability of ‘q’, i.e. in A pages ‘q’ is followed by ‘o’ 97.3% of the time, 97.6% in B pages.
(2) Strong backwards predictability of ‘r’, i.e. in A pages ‘r’ is preceded by ‘o’ 60.2% and ‘a’ 28% of the time, B pages ‘o’ 51% and ‘a’ 28.9%.
(3) Strong backwards predictability of ‘l’, i.e. in A pages ‘l’ is preceded by ‘o’ 73.9% and ‘a’ 22.9% of the time, but in B pages ‘o’ 49% and ‘a’ 29.2%.
(4) word-initial patterns / word-middle patterns / word-end patterns
(5) paragraph top-line patterns / paragraph first-letter patterns
(6) line-initial patterns / line-end patterns (most notably ‘m’ and ‘am’, of course)
(7) word-length patterns in A and B
etc
Perhaps more importantly, I try to use these to frame active hypotheses to answer the basic cryptographic questions: where (in such a large technical document) are all the numbers? How (if it’s not obviously polyalphabetic) were doubled letters hidden? Where are common words like ‘and’ and ‘the’? Is there (despite the apparently small alphabet) any sign of a nomenclator or code book? What was the visual and processual relationship between the ciphertext and the plaintext? How do the pictures fit into all this?
But until we can really understand the simplest things (in particular “qo”), we probably don’t stand a chance of making progress in any of these. 🙁