A paper came out a few days ago on arXiv.org, called “Probing the statistical properties of unknown texts: application to the Voynich Manuscript” written by three Brazilian academics (with assist from two German academics).

The authors grouped Voynichese (i.e. Voynich text) hypotheses into three broad categories:

“(i) A sequence of words without a meaningful message;
(ii) a meaningful text written originally in an existing language which was coded (and possibly encrypted) in the Voynich alphabet; and
(iii) a meaningful text written in an unknown (possibly constructed) language.”

After developing a whole load of word-occurrence-based statistical machinery (defining “intermittency”, etc) and applying them both to real text corpora and to Voynichese, they conclude that the word structure of Voynichese is incompatible with shuffled texts (which is how they model (i)-class hypotheses), and “mostly compatible with natural languages” (the (ii)- and (iii)-class hypotheses). They end up by using their statistical machinery to suggest Voynichese “keywords” – words that, according to their statistical measures, stand out from the text.

Their suggested English keywords (generated from the New Testament) are:-
* begat Pilates talents loaves Herod tares vineyard shall boat demons ve pay sabbath hear whosoever

Their suggested Voynichese keywords (generated from an EVA transcription, though they don’t say which, so possibly Takahashi’s?):-
* cthy qokeedy shedy qokain chor lkaiin qol lchedy sho qokaiin olkeedy qokal qotain dchor otedy

OK, but… what do I think? First off, I’m pleased to see that their results seem incompatible with “shuffled texts” or randomized texts, because that is what nearly all of the various Voynich “hoax” hypotheses rely on. Intuitively, just about anyone who has worked with Voynichese for any period of time is struck by its intense internal structuring on many levels: so it is nice to see the same result coming out from a different angle.

Secondly, what they mean by “mostly compatible” is that while Voynichese passes many of their proposed tests comfortably, it actually fails some of them (and only passes others by the slimmest of whiskers). To me, that implies either (a) an exotically- (and non-obviously-)structured language or constructed lanaguge, or (b) an obfuscated language (e.g. a ciphertext or shorthand): conversely, it seems to imply that Voynichese isn’t a one-to-one-map of any mainstream language (which is what cryptographers such as Elizebeth Friedman have been saying for years). Yet the earliest constructed language we currently know of was devised at least a century after the Voynich’s vellum dating (and about a century after its earliest marginalia), so we can almost certainly rule that possibility out.

I don’t know: while it’s always good to see people approaching the Voynich Manuscript from a new angle, I can’t help but feel that in just about every instance the Voynich’s author remains at least three or four steps ahead of them. The key paradox of Voynichese revolves around the fact that even though it so resembles a natural language, the way its words work as semantic units fails to do so in quite the same way. So for me, the important thing here is to try to understand the tests that failed, and see what they tell us about how Voynich words don’t work… but that will doubtless take a little time.

As for the suggested keywords: personally, I’d be rather more convinced by their statistical machinery if it had automagically suggested the word “Jesus” rather than “boat” or “vineyard” for the New Testament, so I have to say I’m far from persuaded that their list of Voynich cribs will help us unlock its secrets at all… but you never know, so perhaps let’s give them the benefit of the doubt on this one! 😉

26 thoughts on “Brazilian academics: Voynichese “incompatible with random texts”…

  1. If these researchers had found the key of the manuscript, I think it will make us more sad than happy. Because in this case all our efforts, we fans Voynich, would be cut short.
    I think the script could be written in two or three languages ​​at least. In addition, it is possible that one of these languages ​​is not as widely studied.
    If the manuscript is a translation of a document of European explorers in South America, for example, the Quechua language is very different from the English language. First, it was not writtenand may be the author sought to invent an alphabet to transcribe?
    This language does not sound “b” and “g” for example. The modern alphabet has the letters “k”, “p”, “q” and “t” in triplicate.
    Will be this language a better candidate for the Voynich?

  2. I’ve often wondered how many of these problems are due to a failure to recognise, and distinguish individual glyphs. I’d like to see retina-recognition technology applied to the script. I expect it would be so precise that every single glyph would be considered unique. Perhaps someone who makes bank-notes, or a forger of same is what we need here?

  3. Ps I think they might have included a fourth category, though perhaps no-one’s theory has required it as yet.

    a meaningful text written originally in an existing language, which was coded (and possibly encrypted) in another language, attempting to employ the same script as the original, this imitation being known as “the Voynich alphabet”.

  4. Ruby and Diane: yes, there’s a category (iv) the researchers missed – a (language A text) transcribed using a mismatched (language B alphabet). But if that’s right, we haven’t had much luck identifying either A or B so far. 🙁

    And in fact, there may be twenty more categories that are similarly possible (if you really put your mind to working them out)!

  5. bdid1dr on March 7, 2013 at 7:35 pm said:

    Nick, Diane, & Ruby (and ThomS, if he should be following this latest discussion):

    You can be burying my posts and letter-by-letter/word-for-word translations in the depths of “That Which Brings…..”, but I am not discouraged. I will soon be completing my notes and fully understandable reading of the fascinating Beinecke manuscript 408.

    So, I’ll let you continue your decryptologizing in peace. I notice that my spell-checker doesn’t recognize that long word I just typed; maybe I’ve invented a new system of cryptology?

    Gotta find my O 2 it B4 it gets buried in my toppling pile of notes!

    Hows that 4 a cryptic farewell? 🙂

  6. I based category (iv) on an ambiguity of Baresch’s letter. Come to think of it, the ambiguity might not exist in the Latin, but the English can surely be read in two ways.

    “… acquired the treasures of Egyptian medicine …brought them back with him and buried them in this book in the same script.”
    quoting Neal’s translation of the 1637 letter from B to K

  7. Jody on March 8, 2013 at 8:04 pm said:

    And so we go on… not knowing… already found out

  8. bdid1dr on March 9, 2013 at 5:55 pm said:

    An “O2-it” is a disk which has printed upon it “TUIT” So, when someone has a big pile of chores, errands, corresponce to write, etc. — and is being nagged — the response is often “When I get around to it!” The person doing the nagging hands the naggee the “TUIT”

    Sometimes the disc looks like this: 🙂

    But only if a yellow “smiley” appears here!

  9. bdid1dr
    If I understand correctly, you do not have a website or personal blog, and to follow your work we must search hundreds of posts on different sites ? This is regrettable; I’d love to read your translation in full.

  10. tricia on March 24, 2013 at 1:56 pm said:

    The next version will be even better. A Greek and Canadian helping now.

  11. I always have difficulty with the definition of ‘natural language’ since it seems to be based on an assumption of prose, or at the very best prose and formal poetry. (Is that right?) A mathematical text, or bills of lading, ledger entries and Ptolemy’s Tables would all fail by that criteron, surely?

    Wouldn’t they?

  12. Diane on March 30, 2013 at 5:42 pm said:

    the earliest constructed language we currently know of was devised at least a century after the Voynich’s vellum dating.

    Just wondering whether you mean that sixteenth-century mainland Europe is our first example of any constructed language? Would you call something like Hisperica famina a constructed language? (not that I think Voynichese is like H.f, but wonder that human beings had never before thought to construct one).

  13. Now another scientific proof for “Signs of language surface” is here:
    And the original paper here:

    Now, Crypto-guys please start to brood about…

  14. Joachim: if Voynichese is a natural language, then what these results (and numerous others) are saying is that it’s a natural language that has less information in a typical word (or indeed a typical letter instance) than other natural languages – that is, it is more predictable at a letter and word level than other natural languages.

    In the cipher world, we have a very old type of cipher that has many similar attributes – verbose cipher. The key thing about verbose cipher is that it disrupts the way we visually parse the text: we don’t see ABC, but “otolal” (i.e “ot-ol-al”).

    If Voynichese had been parsed differently, would these academics’ results have been the same? I suspect not.

  15. Nick: On this subject I am betraying here something in advance:
    We are working on a dictionary, Voynich- Arabic-English, a small excerpt here (not finalised):
    27 > .lTaih. (27*7 لذيح on the way
    28 > .ltan. (27*6) لتان tongue
    29 > .marh. (27*6) مرح fun
    30 > .oltarh. (27*8) الطرح a question
    31 > .oqwey. (27*7) اقوي stronger
    32 > .taly. (27*6) التالي next

  16. Diane on June 22, 2013 at 10:57 am said:

    not a natural outgrowth, then? ?

  17. Šuruppag on November 16, 2013 at 11:13 pm said:

    “Yet the earliest constructed language we currently know of was devised at least a century after the Voynich’s vellum dating, so we can almost certainly rule that possibility out.”

    What about Hildegard of Bingen’s Lingua Ignota, dating to the 12th century?

    It would seem very odd if VM was a constructed language, but I think we should still try to rule out the possibility. Has anyone tested to see if VM has linguistic patterns similar to highly artificial constructed languages such as Lojban?

  18. On the subject of academics from Brazil, I wonder who should be credited with information on a deepsky website’s paper. The paper has not been given a serious-sounding title: “A Manchu Skeleton Key to the Voynich Manuscript, Pangolin on folio 80 verso a.k.a. “A Tijuana Bible in Elvish”, or “Finnegans Wake in Enochian”

    However, I’ve just seen there some items which I didn’t realise any Voynich researcher before me had pointed out, and (as ever) I’m keen to acknowledge precedents.
    In this case, the credit is for an allusion to Katarina Viloni.

  19. SirHubert on October 28, 2016 at 5:39 pm said:

    I’ve recently read through this paper again, and still think that Nick’s summary is a very fair one.

    One point made which I think is interesting and worth exploring comes close to the end in the section on keywords (p. 9). The authors discuss ‘term frequency-inverse document frequency’ (tf-idf), which they summarize as a technique that assigns a high relevance to a word if it is frequent in the document under analysis but not in other documents of the collection.

    If I’ve understood this right, it means that if you’re a mechanic and have the service manuals for twenty different cars on your shelves, words like ‘windscreen’ and ‘engine’ will be common to all of them, while specifics like ‘Volkswagen’ and ‘Passat’ will appear frequently in one manual only. Tf-idf will therefore home in on ‘Volkswagen’ and ‘Passat’ because they appear with much greater frequency in this one manual, but won’t pick up ‘windscreen’ and ‘engine’ which appear in all twenty – even if they are in fact more common than ‘Volkswagen’ in the Passat service manual.

    The authors take the view that because the Voynich Manuscript is a single item rather than part of a set, tf-idf isn’t applicable. But I’m not sure this is entirely correct. Assuming that the text in the herbal section does relate to the illustrations, one would expect this to include the name of the plant being shown – and that shouldn’t then appear on other herbal pages on which another plant is being discussed. (I know this is hardly a new observation…) So if these assumptions are correct, there should be a unique word, group of characters or some other kind of pattern which occurs on each herbal page and nowhere else, barring the extremely unlikely possibility of a polyalphabetic cipher.

    Best of all, that shouldn’t depend on languages or encipherment techniques – we’re just looking for a word, or an enciphered word/phrase, that happens to mean ‘viola’ or whatever the particular flower happens to be.

    Has anyone tried?

  20. Sir Hubert: this is essentially what Montemurro and Zanette did.

  21. SirHubert on October 31, 2016 at 6:39 pm said:

    Rene: thank you for the reminder. It’s a similar kind of approach, but I don’t think they are doing exactly what I had in mind – if I understand rightly, they’re looking at entire sections whereas I’m thinking in terms of working with individual pages within a given section, and not necessarily looking for single ‘words’. But this is helpful and thank you again.

  22. Mark Knowles on June 16, 2017 at 4:46 pm said:

    Nick: I have been struck for some time by the quantity of seeming repetition in the text which has naturally made me wonder about the presence of null characters or null text or verbose text and the like. Maybe someone has produced a measure of this kind of repetition already in which case please direct me to it if you may.

    The first thing that occurred to me is compression i.e. how small can one compress the text without loss of information.

    Obviously looking at the text:

    8%£ 8%£ 8%£ 8%£ 8%£ 8%£ 8%£ 8%£ 8%£ 8%£

    This could now be significantly compressed with losing information content. It is theoretically possible, but highly unlikely this could contain a lot of information dependent on the encipherment method.

    Maybe you know a much better method of estimating the informational content of the text as I haven’t looked into it in much detail. Obviously there are inevitable complications with this; however I think in principle it is something that one should be able to estimate.

    Obviously I personally would be most interested in results for labels rather than sentence text.

  23. Mark Knowles on June 16, 2017 at 6:54 pm said:

    *without losing information content

  24. Mark Knowles on June 17, 2017 at 6:29 pm said:

    Obviously if one can estimate the amount of verbosity in the text this should help better understand the encipherment process. If for example the output deciphered text has roughly half the number of characters from the enciphered text that would be extremely useful to know. I appreciate that evaluating the level of verbosity is likely to be significantly more difficult and varied than my presentation here implies, however it seems to be to me something that could be done to produce meaningfully a measure or different distinct measures which would be of practical value.

