As I see it, there are four foundational tasks that need to be done to wrangle Voynichese into a properly usable form:

* Task #1: Transcribing Voynichese text into a reliable computer-readable raw transcription e.g. EVA qokeedy
* Task #2: Parsing the raw transcription to determine Voynichese’s fundamental units (its tokens) e.g. [qo][k][ee][dy]
* Task #3: Clustering the pages / folios into groups where the text shares distinct features e.g. Currier A vs Currier B
* Task #4: Normalizing the clusters e.g. how A tokens / patterns map to B tokens / patterns, etc

I plan to tackle these four areas in separate posts, to try to build up a substantive conversation on each topic in turn.

Takahashi’s EVA transcription

Rene Zandbergen points out that, of all the different “EVA” transcriptions that appear interleaved in the EVA interlinear file, “the only one that was really done in EVA was the one from Takeshi. He did not use the fully extended EVA, which was probably not yet available at that time. All other transcriptions have been translated from Currier, FSG etc to EVA.

This is very true, and is the main reason why Takeshi Takahashi’s transcription is the one most researchers tend to use. Yet aside from not using extended EVA, there are a fair few idiosyncratic things Takeshi did that reduce its reliability, e.g. as Torsten Timm points outTakahashi reads sometimes ikh where other transcriptions read ckh“.

So the first thing to note is that the EVA interlinear transcription file’s interlinearity arguably doesn’t actually help us much at all. In fact, until such time as multiple genuinely EVA transcriptions get put in there, its interlinearity is more of an archaeological historical burden than something that gives researchers any kind of noticeable statistical gain.

What this suggests to me is that, given the high quality of the scans we now have, we really should be able to collectively determine a single ‘omega’ stroke transcription: and even where any ambiguity remains (see below), we really ought to be able to capture that ambiguity within the EVA 2.0 transcriptions itself.

EVA, Voyn-101, and NEVA

The Voyn-101 transcription used a glyph-based Voynichese transcription alphabet derived by the late Glen Claston, who invested an enormous amount of his time to produce a far more all-encompassing transcription style than EVA did. GC was convinced that many (apparently incidental) differences in the ways letter shapes were put on the page might encipher different meanings or tokens in the plaintext, and so ought to be captured in a transcription.

So in many ways we already have a better transcription, even if it is one very much tied to the glyph-based frame of reference that GC was convinced Voynichese used (he firmly believed in Leonell Strong’s attempted decryption).

Yet some aspects of Voynichese writing slipped through the holes in GC’s otherwise finely-meshed net, e.g. the scribal flourishes on word-final EVA n shapes, a feature that I flagged in Curse back in 2006. And I would be unsurprised if the same were to hold true for word-final -ir shapes.

All the same, GC’s work on v101 could very well be a better starting point for EVA 2.0 than Takeshi’s EVA. Philip Neal writes: “if people are interested in collaborating on a next generation transcription scheme, I think v101/NEVA could fairly easily be transformed into a fully stroke-based transcription which could serve as the starting point.

EVA, spaces, and spatiality

For Philip Neal, one key aspect of Voynichese that EVA neglects is measurements of “the space above and below the characters – text above, blank space above etc.

To which Rene adds that “for every character (or stroke) its coordinates need to be recorded separately”, for the reason that “we have a lot of data to do ‘language’ statistics, but no numerical data to do ‘hand’ statistics. This would, however, be solved by […having] the locations of all symbols recorded plus, of course their sizes. Where possible also slant angles.

The issue of what constitutes a space (EVA .) or a half-space (EVA ,) has also not been properly defined. To get around this, Rene suggests that we should physically measure all spaces in our transcription and then use a software filter to transform that (perhaps relative to the size of the glyphs around it) into a space (or indeed half-space) as we think fit.

To which I’d point out that there are also many places where spaces and/or half-spaces seem suspect for other reasons. For example, it would not surprise me if spaces around many free-standing ‘or’ groups (such as the famous “space transposition” sequence “or or oro r”) are not actually spaces at all. So it could well be that there would be context-dependent space-recognition algorithms / filters that we might very well want to use.

Though this at first sounds like a great deal of work to be contemplating, Rene is undaunted. To make it work, he thinks that “[a] number of basics should be agreed, including the use of a consistent ‘coordinate system’. Again, there is a solution by Jason Davies [i.e.], but I think that it should be based on the latest series of scans at the Beinecke (they are much flatter). My proposal would be to base it on the pixel coordinates.

For me, even though a lot of this would be nice things to have (and I will be very interested to see Philip’s analysis of tall gallows, long-tailed characters and space between lines), the #1 frustration about EVA is still the inconsistencies and problems of the raw transcription itself.

Though it would be good to find a way of redesigning EVA 2.0 to take these into account, perhaps it would be better to find a way to stage delivery of these features (hopefully via OCR!), just so we don’t end up designing something so complicated that it never actually gets done. 🙁

EVA and Neal Keys

One interesting (if arguably somewhat disconcerting) feature of Voynichese was pointed out by Philip Neal some years ago. He noted that where Voynichese words end in a gallows character, they almost always appear on the top line of a page (sometimes the top line of a paragraph). Moreover, these had a strong preference for being single-leg gallows (EVA p and EVA f); and also for appearing in nearby pairs with a short, often anomalous-looking stretch of text between them. And they also tend to occur about 2/3rds of the way across the line in which they fall.

Rather than call these “top-line-preferring-single-leg-gallows-preferring-2/3rd-along-the-top-line-preferring-anomalous-text-fragments“, I called these “Neal Keys”. This term is something which other researchers (particularly linguists) ever since have taken objection with, because it superficially sounds as though it is presupposing that this is a cryptographic mechanism. From my point of view, those same researchers didn’t object too loudly when cryptologist Prescott Currier called his Voynichese text clusters “languages”: so perhaps on balance we’re even, OK?

I only mention this because I think that EVA 2.0 ought to include a way of flagging likely Neal Keys, so that researchers can filter them in or out when they carry out their analyses.

EVA and ambiguity

As I discussed previously, one problem with EVA is that it doesn’t admit to any uncertainty: by which I mean that once a Voynichese word has been transcribed into EVA, it is (almost always) then assumed to be 100% correct by all the people and programmes that subsequently read it. Yet we now have good enough scans to be able to tell that this is simply not true, insofar as there are a good number of words that do not conform to EVA’s model for Voynichese text, and for which just about any transcription attempt will probably be unsatisfactory.

For example, the word at the start of the fourth line on f2r:

Here, the first part could possibly be “sh” or “sho”, while the second part could possibly be “aiidy” or “aiily”: in both cases, however, any transcriber attempting to reduce it to EVA would be far from certain.

Currently, the most honest way to transcribe this in EVA would be “sh*,aii*y” (where ‘*’ indicates “don’t know / illegible”). But this is an option that isn’t taken as often as it should.

I suspect that in cases like this, EVA should be extended to try to capture the uncertainty. One possible way would be to include a percentage value that an alternate reading is correct. In this example, the EVA transcription could be “sh!{40%=o},aiid{40%=*}y”, where “!{40%=o}” would mean “the most likely reading is that there is no character there (i.e. ‘!’), but there is a 40% chance that the character should be ‘o'”.

For those cases where two or more EVA characters are involved (e.g. where there is ambiguity between EVA ch and EVA ee), the EVA string would instead look like “ee{30%=ch}”. And on those occasions where there is a choice between a single letter and a letter pair, this could be transcribed as “!e{30%=ch}”.

For me, the point about transcribing with ambiguity is that it allows people doing modelling experiments to filter out words that are ambiguous (i.e. by including a [discard words containing any ambiguous glyphs] check box). Whatever’s going on in those words, it would almost always be better to ignore them rather than to include them.

EVA and Metadata

Rene points out that the metadata “were added to the interlinear file, but this is indeed independent from EVA. It is part of the file format, and could equally be used in files using Currier, v101 etc.” So we shouldn’t confuse the usefulness of EVA with its metadata.

In many ways, though, what we would really like to have in the EVA metadata is some really definitive clustering information: though the pages currently have A and B, there are (without any real doubt) numerous more finely-grained clusters, that have yet to be determined in a completely rigorous and transparent (open-sourced) way. However, that is Task #3, which I hope to return to shortly.

In some ways, the kind of useful clustering I’m describing here is a kind of high-level “final transcription” feature, i.e. of how the transcription might well look much further down the line. So perhaps any talk of transcription

How to deliver EVA 2.0?

Rene Zandbergen is in no doubt that EVA 2.0 should not be in an interlinear file, but in a shared online database. There is indeed a lot to be said for having a cloud database containing a definitive transcription that we all share, extend, mutually review, and write programmes to access (say, via RESTful commands).

It would be particularly good if the accessors to it included a large number of basic filtering options: by page, folio, quire, recto/verso, Currier language, [not] first words, [not] last words, [not] first lines, [not] labels, [not] key-like texts, [not] Neal Keys, regexps, and so forth – a bit like on steroids. 🙂

It would also be sensible if this included open-source (and peer-reviewed) code for calculating statistics – raw instance counts, post-parse statistics, per-section percentages, 1st and 2nd order entropy calculations, etc.

Many of these I built into my JavaScript Voynichese state machine from 2003: there, I wrote a simple script to convert the interlinear file into JavaScript (developers now would typically use JSON or I-JSON).

However, this brings into play the questions of boundaries (how far should this database go?), collaboration (who should make this database), methodology (what language or platform should it use?), and also of resources (who should pay for it?).

One of the strongest reasons for EVA’s success was its simplicity: and given the long (and complex) shopping list we appear to have, it’s very hard to see how EVA 2.0 will be able to compete with that. But perhaps we collectively have no choice now.

24 thoughts on “Voynichese Task #1: moving towards “EVA 2.0″…

  1. Josef Zlatoděj Prof. on May 14, 2017 at 6:21 pm said:

    Nick. Eva 1. Eva 02. Eva O3. Eva 4. etc. is wrong. ( bad ).

    Page 2r is very important. As I wrote to you. Root. 🙂
    Everyone shows you how to write. The root is composed of 4 characters.
    All have a value of 3. ( C= 3, G = 3, S =3, L = 3 ).
    The author shows you how to write. Numerological system . When each letter has its numeric value.
    That’s why the author shows you at the beginning of the manuscript.
    On page 1 there is a translation instructions. ( In Czech ).

  2. Hand code breaking methods that I learned 60 years ago at Ft. Devens were fun, but not up to date. Today NSA uses compurtized screening techniques, a few of which are described by

  3. Hi Nick – I published my Voynich API to GitHub under MIT license:

    The main intention is to present the original EVA interlinear as a resource that can be accessed over the web.

    The main feature that is maybe new is the functionality that allows for retrieving text as a sequence of morphemes e.g. [d][a][iin] or [d][aiin] or [da][ii][n] etc in a consistent and repeatable fashion.

    There’a s bunch of examples ( showing usage of the API to retrieve and analyse various sections of the interlinear. I recreated some of the ‘classic’ experiments e.g. word-length distribution and Sukhotin vowel-identification algorithm.

    Hope it is useful. I’d be keen to collaborate with people who are interested to taking this to the next level per your suggestions around NEVA/ EVA 2.0/ voy101 etc.

    Great post – lots of useful comments about moving this forward.

    All the best,

  4. Robin: very interesting, and well worth a post on its own, thanks!

  5. Don Latham on May 15, 2017 at 5:45 pm said:

    Nick: Perhaps crowdfunding and git is possible?

  6. John Turnley on May 16, 2017 at 4:58 pm said:

    [Insert lots of hand-waving and unsupported assumptions]

    In overly simplified terms, what if every character were represented by number (more likely a large set of numbers) that described that particular character. Think a bit image to some resolution. Two characters that look a lot alike would have different, but similar representations.

    Through some computer coding magic, you could quantify just how closely two characters resemble each other. You could have characters A, B, and C where B is X “distance” away from being the same as A, and Y “distance” away from being the same as C. It’s a way to derive the probabilities of (B == A) or (B == C) or (B != A or C).

    Ideally, it would also be context specific. Which characters surround the character? Where is it in a line? How close are the characters before and after it? Where is it on the page? Which page? Etc., etc.

    Essentially, it would be an electronic copy of the physical text. It would take a lot of the human error out of interpretation of the text.

    This is all off the top of head. It’s a very large and very difficult problem, hence all the hand-waving and qualifiers.

  7. Young Kim on May 19, 2017 at 4:21 am said:

    I would like to share some Voynich fonts with others for non-profit purpose. It’s not professionally made, but it goes just fine with other usual Roman-type fonts in a text. Is there someone who can host those files so that anyone can download them to use?

  8. Young Kim: in which format are these fonts?

    I’m hosting TTF files for Eva and v101 at my web site, and there are WOFF files for both embedded as well (even if only the Eva one seems to work).

    An important aspect of both files is the mapping between the fonts’ character shapes and ASCII values.

  9. The voyn_101 version of the first word of 2r.4 is ãaI89, NEVA śöaudy. I envisage breaking this down into a transcription at the level of individual stroke with some such representation as


    with NEVA type diacritics which would be held in a data table looking roughly like this

    Stroke Stroke type Variant Glyph Join
    Ç C 94 ś Contiguous +1
    ) 07 Near contiguous
    Ć C 12 ö Contiguous
    D 83 Space: 1em
    Ċ C 20 a Half contiguous
    Ĩ I 11 Contiguous
    Ì I … … …

    Something like this could fairly easily be generated from NEVA and then voyn_101 and the other existing transcriptions could be reverse engineered from the database, or people could specify their own Voynich alphabet. If Rene’s ideas require a machine-readable text as starting point, then it could be designed and specified with his requirements in mind as well.

    This is just a suggestion: I think a new transcription is desirable and that it should be both fine-grained and intuitive, but there is no point going ahead unless a number of us are all agreed on what is wanted.

  10. Young Kim on May 20, 2017 at 5:27 am said:

    Rene Zandbergen: There are two font families named Voynich Mesa and Voynich Symbol and each family has two types of fonts, WOFF and OTF. The character mapping is based on my own definition and does not follow the usual EVA or EVA-like encoding. That is not because I am not fond of them, but because I needed a normalized or print friendly encoding. By “normalized”, I mean that using Voynich characters in a text does not alter the format of text such as the line space and does not sacrifice its readability. To alleviate the gap between EVA and my encoding, I will provide a transcription of Voynich texts from Folio 1r to 67r2 in a MS Word file. There will be an image of keyboard layout and a sample text captured from my article.

    If you can host them, how can I send them to you?

  11. Davidsch on May 23, 2017 at 3:13 pm said:

    There is really no need for another new font or a new transcription, although I seem to be alone in this, I think there are much better (Voynich) ideas to put energy in, such as cipher related ideas.

  12. Davidsch: that’s all very well, but how reliable can analyses be if they rely on the existing transcriptions?

  13. The funny thing is: translation _should have_ worked with the existing transcriptions. Even Currier’s.
    It did not, and there is no guarantee whatsoever, that a modern transcription will make this possible.

    There is simply a need to have something more accurate, more independent of assumptions, capturing much more information, and allowing people to verify each other’s results.

    Who can tell me how many characters there are in the MS?
    How many words, or how many lines?
    This sort of basic information is not there.

    Some of these numbers depend on the transcription alphabet, but it should also be possible for people to make their own. Ideally in a way that is repeatable and verifiable.

    Something that is desperately needed in numerical/statistical analysis is a way to describe strings of text in the MS that are “very similar”. That can’t easily be done with existing transcriptions and Unix “grep”.
    That is just one example….

  14. Dennis S. on May 24, 2017 at 3:50 am said:

    Nick: Phillip Neal is a really sharp guy, so I usually go along with what he says. He said that Glen Claston’s transcription is the best one and the one he usually uses,

  15. Young Kim on May 25, 2017 at 1:06 am said:

    Rene Zandbergen: The active set of Voynich characters I defined in my encoding consists of 27 letters. I think it covers over 99% of Voynich manuscript. I agree with others who think that just 27 characters are not enough to transcribe the whole manuscript, but I also don’t believe that it will nullify any attempts to understand the 99% of manuscript text without defining extra characters.

    In my personal opinion, EVA or EVA-like encoding is misleading since people show a tendency to get attracted to the transcribed outcome and not to focus on the original Voynich text. It might have been rather different if the transcribed Voynich text looks gibberish than when it looks kind of readable.

    In my transcription, the Voynich text looks like gibberish unless it is encoded in the right font. And my encoding is not based on any methodological or statistical theory. The mapping of Voynich characters to Roman alphabets is arbitrary.

  16. Dear Nick! I believe that the ambiguity of EVA-identical VM graphemes is not only a reflection of their real difference (which, of course, can hardly reflect any modifications EVA). But, for the most part, this is the trail of errors in calligraphic gestures that were made by the scribes of the VM. And these mistakes in turn, arise from two circumstances: first, the VM was executed not only by the author but with several other people (possibly friends, possibly assistants); second, these other people don’t quite mastered the technique of cryptography, developed by the author of the VM to modify the standard Latin abbreviations. Accordingly, their calligraphy before reaching normal automaticity, failed, exposing, thus, some standard elements that the author of VM compile their graphemes.
    In my analysis VM and the analysis of the principles of decoding you will be able to find some actual confirmation of this point of view (

  17. I’ve experimented with the idea of a crowd-sourced transcription tool.

    It would operate in two stages. In the first stage, participants would step through the manuscript and up/down vote transcribed words and word boundaries. In the second stage we’d submit and apply corrections for flagged words/boundaries.

    Here’s an example of stage 1:

    It opens the manuscript at a random location. You then use the spacebar to up-vote transcribed words/boundaries or x to down-vote.

    The same process would be used to alter a transcription. E.g. transform EVA by replacing “sh” with a new character and crowd-source any corrections.

  18. Job: I think it would be more useful to have different options to vote on, rather than a thumbs-up / thumbs-down. For example, one example came up (when I tried it just now) where there was a dubious space. There, the option would be to choose between space, comma, or nothing at all.

  19. Job, I am very pleased to see your reply, indicating that you are interested in this problem. is a tool appreciated by many people, and a step in the direction of an improved transcription of the MS.

    The link proposes a way to interact with the MS and the transcription.
    For me this is right, regardless whether a ‘crowd’ or a selected few, or a piece of software should make the decisions.

    What has to be one first, in my very strong opinion, is to agree on:
    – which images to use
    – how to define the positions on each page
    – how to store and allow retrieval of all information

    Especially the third point is important. I know that you already implemented an answer for all three in, but for future collaboration on this issue, this should be done in a way that is documented, and acceptable for the users.

  20. Rene: does it have to be an either-or choice between the two sets of scans? For almost all of the kind of purposes we’re talking about here, both are more than good enough: most of the difficult transcription decisions arise from ambiguity rather than from lack of clarity.

  21. Young Kim: the point of EVA was to help a research community to talk, and I don’t obviously see why that’s such a bad thing. :-/

  22. Young Kim on May 27, 2017 at 4:23 am said:

    Nick Pelling: I am sorry if I gave that impression to anyone. It’s not what I meant. I agree that having a good transcription is very important thing to move forward.

  23. Nick, for transcription purposes the two sets are mostly equivalent, and why not use both. However, for recording of the location of the text elements, and even more for the measuring of handwriting properties (character size, line spacing, slant angle) and all their variations, a standard should be introduced, and the flatter images from the latest digitisation are preferable.
    I imagine the inaccessible images made by Siloé to be even suprerior.

  24. My last post did not format well.

    The main difficulty I have with EVA is the treatment of sequences like an, am, ain and ey eey chey. Broken down into strokes, they amount to cin, ciin, ciin; c, ccy, cccy etc, sometimes ligatured and sometimes not. What is more, EVA chooses to group the one set rightwards and the other leftwards. Obviously, the transcribers had to settle on one convention or another, but the result is that we do not really know how many instances of a and e there are in the text

    The solution I propose is a transcription at the level of the stroke, distinguishing all discernible variant strokes in the manner of Glen Claston and specifying the position of each stroke relative to its predecessor and successor. I would also like to see information on what lies above and below the stroke. This information would be held in the columns of a data table like this:

    col. 1

    Stroke type


    col. 2

    Stroke variant


    col. 3

    Left ligature

    full space
    half space
    strong ligature
    full space
    strong ligature
    weak ligature
    strong ligature

    col. 4

    Right ligature

    strong ligature
    full space
    half space
    strong ligature
    full space
    strong ligature
    weak ligature

    Given an analysis of this kind, one could specify many-to-one mappings from strokes to the character sets assumed by the existing transcriptions, e.g.

    full space + c-4 + strong ligature + i-2 + half space -> a

    A front end to the database like would display the text in the form of characters and you could choose your favourite transcription or specify a custom mapping. I have gone some way towards this with a database containing voyn_101 and new data of my own about white space and so on. I have not yet taken it to the level of the stroke nor defined mappings to EVA but it is what I intend to do next

    As for fonts, the problem is that a) they are designed to map on to different transcription schemes and b) you have to download them. Diacritics have the merit that support for Unicode is now more or less universal and they would make it easier to agree standards: however, I am not wedded to the idea.

    Does any of this fit in with other people’s plans?

Leave a Reply

Your email address will not be published. Required fields are marked *

Post navigation