A little bird (hi, Terri) told me about a flurry of activity on the Voynich mailing list prompted by some posting by Sean Palmer, who Cipher Mysteries readers may remember from his pages on Michitonese and the month names. Well, this time round he’s gone after a rather more ambitious target – the internal word structure of Voynichese.

Loosely building on Jorge Stolfi’s work on Voynichese word paradigms, Sean proposes a broadly inclusive Voynichese word generator:-

^                      <---- i.e. start of a word
(q | y | [ktfp])*      <---- i.e. one or more instances of this group
(C | T | D | A | O)*   <---- i.e. one or more instances of this group
(y | m | g)?           <---- i.e. 0 or 1 instances of this group
$                      <---- i.e. end of a word
...where...
C = [cs][ktfp]*h*e*    <---- i.e. basically (ch | sh | c-gallows-h) followed by 0 or more e's
T = [ktfp]+e*          <---- i.e. gallows character followed by 0 or more e's

D = [dslr]             <---- i.e. (d | s | l | r)
A = ai*n*              <---- i.e. basically (a | an | ain | aiin | aiiin)
O = o

Sean says that his word paradigm accounts for 95% (later 97%) of Voynichese words, but I’d say that (just as Philip Neal points out in his reply) this is because it generates way too many words: what it gains in coverage, it loses in tightness (and more on this below).

Philip Neal’s own Voynichese word generator looks something like this:-

^
(d | k | l | p | r | s | t)?
(o | a)?
(l | r)?
(f | k | p | t)?
(sh | ch )?
(e | ee | eee | eeee)?
(d | cfh | ckh | cph | cth)?
(a | o) ?
(m | n | l | in | iin | iiin)?
(y)?
$

Though this is *much* tighter than Sean’s, it still fails to nail the tail to the sail (I just made that up). By 2003, I’d convinced myself that the flavour of Voynichese wasn’t ever going to be satisfactorily captured by any sequential generator, so I tried defining an experimental Markov state-machine to give an ultra-tight word generator:-

It wasn’t by any means perfect (there’s no p and f characters, for a start), but it was the kind of thing I’d expect a “properly tight” word paradigm to look like. But even this proved unsatisfactory, because that was about the time when I started seeing o / a / y as multivalent, by which I mean “performing different roles in different contexts”. Specifically:-

  • Is the ‘o’ in ‘qo’ the same as the ‘o’ in ‘ol’ or ‘or’?
  • Is the ‘a’ in ‘aiin’ the same as the ‘a’ in ‘al’ or the ‘a’ in ‘ar’?
  • Is word-initial ‘y’ the same as word-terminal ‘y’?

Personally, I think the answer to all three of these questions is an emphatic ‘no’: and so for me it was the shortest of ceonceptual hops from there to seeing these as elements of a verbose cipher. Even if you disagree with me about the presence of verbose cipher in the system, I think satisfactorily accounting for o / a / y remains a problem for all proposed cipher systems, as these appear to be knitted-in to the overwhelming majority of glyph-level adjacency rules / structures.

Really, the test of a good word generator is not raw dictionary coverage but instance coverage (“tightness”), by which I mean “what percentage of a given paradigm’s generated words does the instances-as-observed make up”.

Philip’s paradigm generates (8 x 3 x 3 x 5 x 3 x 5 6 x 3 x 7 x 2) = 1,360,800 possible words, while my four-column generator produces – errrrm – no more than 1192 (I think, please correct me if I’m wrong): by contrast, Sean’s generator is essentially infinite. OK, it’s true that each of the three is optimized around different ideas, so it’s probably not entirely fair to compare them like this. All the same (and particularly when you look at Currier A / B sections, labels, etc), I think that tightness will always be more revealing than coverage. And you can quote me on that! 😉

10 thoughts on “Sean Palmer’s Voynichese word generator…

  1. What happened to Stolfi? His work is mentioned so often, I wonder whether he is still following the Voynich discussions.

    Oh yes, and why can’t we all use Voyn-101? Isn’t EVA recognised as being inferior? I still haven’t got the hang of mentally converting “qokeedy” into Voyn-101.

  2. Julian: As far as I know, Stolfi just got too busy with other stuff to fit Voynich studies in as well – one day he’ll come back to the fold. 🙂

    Voyn-101 has different problems, neither it nor EVA has Voynichese mapped out 100%. The biggest problem both systems arises from those adherents who take them too literally – for example, people who treat EVA ‘h’ as a real letter worthy of statistical or linguistic analysis. Oh, well!

    (i.e. in both cases, you almost always need to pre-filter the data to fit your own particular transcription model before doing tests on it.)

  3. Maybe this has been mentioned in the mailing list, but what you refer to as tightness and coverage, is called precision and recall in the grammar induction literature.

    In this case precision would be the number of generated Voynich words divided by the total number of generated words, whereas recall would be the number of generated Voynich words divided by the total number of Voynich words. It is always possible get perfect recall by simply generating all sequences of all letters with a very compact generator. People tend to take the harmonic mean of these two values (the f-score) as a reasonable indicator of how good a generator is.

    One problem here is that we don’t have the set of all legal Voynich words, and we don’t really know how large that set is, so it’s difficult to find the recall exactly. The generator might be generalizing perfectly and generating lots of legal Voynich words that simply aren’t in the manuscript, and we’d have no way of knowing.

    Another problem is that to say anything meaningful about these generators we should compare them to generators for similar sequences of tokens. We can’t really say what it means that such a small generator gets f-score X unless we know that (for instance) English words require a vastly more complex generator to get that kind of f-score.

    One way to solve this would be to run a grammar induction algorithm on Voynichese and natural languages and compare the complexity of the generated grammars. If such an algorithm finds significantly smaller generators for Voynichese, then that’s an indication (though not proof) that Voynichese has a more regular word structure.

    I actually tried this a while back with a relatively simple algorithm but results were very much inconclusive.

  4. Peter: thanks very much for this – I’m surprised (at myself) that I haven’t yet reviewed your 2006 paper on applying ADIOS to the VMs, it’s certainly something I’ve been meaning to do. All the same, the short version would read: “I strongly suspect that you would need to apply some kind of pre-parsing to the Voynichese corpus in order to give the ADIOS algorithm any real chance of producing anything revealing.” Would you be interested in trying this?

  5. I’ve been meaning to revisit that paper (actually my BSc thesis) for a long time now. Before you spend to much time with it you should know that my interpretation/implementation of the algorithm was flawed. I fixed it for a later project, see here (not related to the VMs, but a nicer paper). It would take some time to get the code going again, but it’s worth a shot.

    I sort of gave up on it because I found that ADIOS relies strongly on knowing where the sentence boundaries are. Of course, on the word level, this is no problem…

    I would be very interested in hearing about pre-parsing strategies. I’ve always been a little too lazy to to get too deep into the specifics of transcription and just used whatever EVA transcription seemed reasonable for quick experiments.

  6. Peter: I’ve spent years trying to work out how Voynichese should be pre-parsed, because I think this runs right to the heart of its cryptological challenge.

    The first parsing level is that EVA was deliberately designed to be a stroke-based transcription, so there’s a high probability that ch / sh / ckh / cth / cph / cfh / eeee / eee / ee all represent individual glyphs (the Voy-101 transcription transcribes all these as individual glyphs for this reason).

    The second parsing level is that I think we’re looking at a verbosely enciphered scribal shorthand (where the brevity gained from the contraction & abbreviation is cancelled out by the verbose expansion) based around o/a/y, and hence groups like qo / ok / ot / op / of / ockh / octh / ocph / ocfh / ol / or / om / al / ar / air / aiir / aiiir / an / ain / aiin / aiiin / am / yk / yt / yp / yf / yckh / ycth / ycph / ycfh / eo all encipher individual tokens (but be sure to parse qo first, as ‘qok’ = ‘qo’ + ‘k’, not ‘q’ + ‘ok’). Just so you know, my prediction is that ‘d’ codes for a ‘contraction’ token, ‘-y’ for an ‘abbreviation’ token, and a[i][i][i]n for Arabic digits (in some way), while ‘l-‘ in Currier B is an optional equivalent to ‘ol-‘ in Currier A.

    The third parsing level is that, apparently because ol / or / al / ar often appear next to each other, the encipherer occasionally inserts a space inside the pairs, apparently to break up the visual pattern. Hence I suspect that any occurrences of o.l / o.r / a.l / a.r should first be reduced to ol / or / al / ar in order to undo this transform.

    Hence, what I’m suggesting is that it is the pervasiveness of the o/a/y groups that causes automatic rule finders (and similar Markov state machine inference engines) to find no obvious word structure: so if you pre-parse these out, I’m pretty sure that a whole deeper level of word structure will present itself.

    Hope this is a help!

  7. Hello Nick,

    A small point (which has been covered somewhat before), but wouldn’t one way to “tighten up” the word model, is to try and de-conflict the symbols that could never fit the model (ie resolve the “end of line” dependent symbols into plausible existing symbols).

    Eva “g” occurs about 96 times, of which around 65 are end of lines, thus it has a manufactured feel to it verses being part of a word structure…my guess is it’s just a “d” with a tail the author added.

    Eva “m” occurs about 1196 times, of which about 820 are end of lines. This symbol seems to behave the same as “g”, and therefore may be a flourished version of an existing symbol (my guess “i’).

    Although a small impact overall, they just don’t seem to fit the word models given their end of line dependence. If you accept the above, the question now is why would the author do this…of which an answer may lie in your comment on “multivalent” symbols, which I also believe is key to the VM structure/cipher.

    BTW : 67% of g’s are end of lines
    68% of m’s are end of lines

    ….strange co-incidence indeed.

    TT

  8. Tim: a proper word model would also need to ignore the first letter of a page (or indeed paragraph, or indeed line), as well as the last one or two characters (particularly if am), as well as Philip Neal’s key-like features (which are prototypically bound by a pair of f or p gallows, about 2/3rds of the way along the top line of a page or paragraph). Philip has pointed this out for years, and I completely agree.

    However, perhaps the most useful feature of a properly ‘tight’ (or, as Peter Bloem prefers, ‘precise’) word generator would be to quickly direct our attention to those places where the model does not apply. These should be hugely informative, as they perhaps point to breaks in the system caused by awkward content in the plaintext – possibly proper names, numbers, etc.

    In fact, it may well be that we should be trying to create a two-track generator – an extremely precise generator [in green] and a reasonably precise generator [in amber] – and using these to automatically markup a transcription [with anything that doesn’t fit in red]. Just a thought!

  9. If the VM is composed using an aleatoric game, then you’d expect the words which cannot be generated by (our reconstruction of) the game to be scribal errors or boredom. Examination of the non-covered words would give insight there as well.

    It might also be worth comparing to the codex seraphinianus as an example of a text “made up to look good” without any rules meaning. It would thus be useful as a control, along with Latin, English, and various secret languages or contemporary codes.

  10. Scott: oooh, I don’t know about that. Trying to classify Johannes Ockeghem’s 15th century multi-modal music “aleatoric” would be stretching the term beyond breaking point: aleatoric artistic work don’t seem to me to have any obvious history pre-1900, which is one of the key reasons why Gordon Rugg’s pseudo-random Cardan grille daemon is so much of a stretch as a way of generating the VMs. This was in an era without 20-sided dice (though the Romans apparently had some, if this one is not an Egyptian fake), so we should see some pretty obvious statistics emerge from any aleatoric process, right?

    Also, I’m still far from convinced that the Codex Seraphinianus’s text was constructed solely to look good – there’s a core of deliberately warped hyperrationality to that, whatever Serafini may now (belatedly) claim – and so I’m not sure that all the parallels with that will turn out to be as revealing as you hope. Oh well!

Leave a Reply

Your email address will not be published. Required fields are marked *

Post navigation