As I see it, there are four foundational tasks that need to be done to wrangle Voynichese into a properly usable form:
* Task #1: Transcribing Voynichese text into a reliable computer-readable raw transcription e.g. EVA qokeedy
* Task #2: Parsing the raw transcription to determine Voynichese’s fundamental units (its tokens) e.g. [qo][k][ee][dy]
* Task #3: Clustering the pages / folios into groups where the text shares distinct features e.g. Currier A vs Currier B
* Task #4: Normalizing the clusters e.g. how A tokens / patterns map to B tokens / patterns, etc
I plan to tackle these four areas in separate posts, to try to build up a substantive conversation on each topic in turn.
Takahashi’s EVA transcription
Rene Zandbergen points out that, of all the different “EVA” transcriptions that appear interleaved in the EVA interlinear file, “the only one that was really done in EVA was the one from Takeshi. He did not use the fully extended EVA, which was probably not yet available at that time. All other transcriptions have been translated from Currier, FSG etc to EVA.”
This is very true, and is the main reason why Takeshi Takahashi’s transcription is the one most researchers tend to use. Yet aside from not using extended EVA, there are a fair few idiosyncratic things Takeshi did that reduce its reliability, e.g. as Torsten Timm points out “Takahashi reads sometimes ikh where other transcriptions read ckh“.
So the first thing to note is that the EVA interlinear transcription file’s interlinearity arguably doesn’t actually help us much at all. In fact, until such time as multiple genuinely EVA transcriptions get put in there, its interlinearity is more of an archaeological historical burden than something that gives researchers any kind of noticeable statistical gain.
What this suggests to me is that, given the high quality of the scans we now have, we really should be able to collectively determine a single ‘omega’ stroke transcription: and even where any ambiguity remains (see below), we really ought to be able to capture that ambiguity within the EVA 2.0 transcriptions itself.
EVA, Voyn-101, and NEVA
The Voyn-101 transcription used a glyph-based Voynichese transcription alphabet derived by the late Glen Claston, who invested an enormous amount of his time to produce a far more all-encompassing transcription style than EVA did. GC was convinced that many (apparently incidental) differences in the ways letter shapes were put on the page might encipher different meanings or tokens in the plaintext, and so ought to be captured in a transcription.
So in many ways we already have a better transcription, even if it is one very much tied to the glyph-based frame of reference that GC was convinced Voynichese used (he firmly believed in Leonell Strong’s attempted decryption).
Yet some aspects of Voynichese writing slipped through the holes in GC’s otherwise finely-meshed net, e.g. the scribal flourishes on word-final EVA n shapes, a feature that I flagged in Curse back in 2006. And I would be unsurprised if the same were to hold true for word-final -ir shapes.
All the same, GC’s work on v101 could very well be a better starting point for EVA 2.0 than Takeshi’s EVA. Philip Neal writes: “if people are interested in collaborating on a next generation transcription scheme, I think v101/NEVA could fairly easily be transformed into a fully stroke-based transcription which could serve as the starting point.”
EVA, spaces, and spatiality
For Philip Neal, one key aspect of Voynichese that EVA neglects is measurements of “the space above and below the characters – text above, blank space above etc.”
To which Rene adds that “for every character (or stroke) its coordinates need to be recorded separately”, for the reason that “we have a lot of data to do ‘language’ statistics, but no numerical data to do ‘hand’ statistics. This would, however, be solved by […having] the locations of all symbols recorded plus, of course their sizes. Where possible also slant angles.”
The issue of what constitutes a space (EVA .) or a half-space (EVA ,) has also not been properly defined. To get around this, Rene suggests that we should physically measure all spaces in our transcription and then use a software filter to transform that (perhaps relative to the size of the glyphs around it) into a space (or indeed half-space) as we think fit.
To which I’d point out that there are also many places where spaces and/or half-spaces seem suspect for other reasons. For example, it would not surprise me if spaces around many free-standing ‘or’ groups (such as the famous “space transposition” sequence “or or oro r”) are not actually spaces at all. So it could well be that there would be context-dependent space-recognition algorithms / filters that we might very well want to use.
Though this at first sounds like a great deal of work to be contemplating, Rene is undaunted. To make it work, he thinks that “[a] number of basics should be agreed, including the use of a consistent ‘coordinate system’. Again, there is a solution by Jason Davies [i.e. voynichese.com], but I think that it should be based on the latest series of scans at the Beinecke (they are much flatter). My proposal would be to base it on the pixel coordinates.”
For me, even though a lot of this would be nice things to have (and I will be very interested to see Philip’s analysis of tall gallows, long-tailed characters and space between lines), the #1 frustration about EVA is still the inconsistencies and problems of the raw transcription itself.
Though it would be good to find a way of redesigning EVA 2.0 to take these into account, perhaps it would be better to find a way to stage delivery of these features (hopefully via OCR!), just so we don’t end up designing something so complicated that it never actually gets done. 🙁
EVA and Neal Keys
One interesting (if arguably somewhat disconcerting) feature of Voynichese was pointed out by Philip Neal some years ago. He noted that where Voynichese words end in a gallows character, they almost always appear on the top line of a page (sometimes the top line of a paragraph). Moreover, these had a strong preference for being single-leg gallows (EVA p and EVA f); and also for appearing in nearby pairs with a short, often anomalous-looking stretch of text between them. And they also tend to occur about 2/3rds of the way across the line in which they fall.
Rather than call these “top-line-preferring-single-leg-gallows-preferring-2/3rd-along-the-top-line-preferring-anomalous-text-fragments“, I called these “Neal Keys”. This term is something which other researchers (particularly linguists) ever since have taken objection with, because it superficially sounds as though it is presupposing that this is a cryptographic mechanism. From my point of view, those same researchers didn’t object too loudly when cryptologist Prescott Currier called his Voynichese text clusters “languages”: so perhaps on balance we’re even, OK?
I only mention this because I think that EVA 2.0 ought to include a way of flagging likely Neal Keys, so that researchers can filter them in or out when they carry out their analyses.
EVA and ambiguity
As I discussed previously, one problem with EVA is that it doesn’t admit to any uncertainty: by which I mean that once a Voynichese word has been transcribed into EVA, it is (almost always) then assumed to be 100% correct by all the people and programmes that subsequently read it. Yet we now have good enough scans to be able to tell that this is simply not true, insofar as there are a good number of words that do not conform to EVA’s model for Voynichese text, and for which just about any transcription attempt will probably be unsatisfactory.
For example, the word at the start of the fourth line on f2r:
Here, the first part could possibly be “sh” or “sho”, while the second part could possibly be “aiidy” or “aiily”: in both cases, however, any transcriber attempting to reduce it to EVA would be far from certain.
Currently, the most honest way to transcribe this in EVA would be “sh*,aii*y” (where ‘*’ indicates “don’t know / illegible”). But this is an option that isn’t taken as often as it should.
I suspect that in cases like this, EVA should be extended to try to capture the uncertainty. One possible way would be to include a percentage value that an alternate reading is correct. In this example, the EVA transcription could be “sh!{40%=o},aiid{40%=*}y”, where “!{40%=o}” would mean “the most likely reading is that there is no character there (i.e. ‘!’), but there is a 40% chance that the character should be ‘o'”.
For those cases where two or more EVA characters are involved (e.g. where there is ambiguity between EVA ch and EVA ee), the EVA string would instead look like “ee{30%=ch}”. And on those occasions where there is a choice between a single letter and a letter pair, this could be transcribed as “!e{30%=ch}”.
For me, the point about transcribing with ambiguity is that it allows people doing modelling experiments to filter out words that are ambiguous (i.e. by including a [discard words containing any ambiguous glyphs] check box). Whatever’s going on in those words, it would almost always be better to ignore them rather than to include them.
EVA and Metadata
Rene points out that the metadata “were added to the interlinear file, but this is indeed independent from EVA. It is part of the file format, and could equally be used in files using Currier, v101 etc.” So we shouldn’t confuse the usefulness of EVA with its metadata.
In many ways, though, what we would really like to have in the EVA metadata is some really definitive clustering information: though the pages currently have A and B, there are (without any real doubt) numerous more finely-grained clusters, that have yet to be determined in a completely rigorous and transparent (open-sourced) way. However, that is Task #3, which I hope to return to shortly.
In some ways, the kind of useful clustering I’m describing here is a kind of high-level “final transcription” feature, i.e. of how the transcription might well look much further down the line. So perhaps any talk of transcription
How to deliver EVA 2.0?
Rene Zandbergen is in no doubt that EVA 2.0 should not be in an interlinear file, but in a shared online database. There is indeed a lot to be said for having a cloud database containing a definitive transcription that we all share, extend, mutually review, and write programmes to access (say, via RESTful commands).
It would be particularly good if the accessors to it included a large number of basic filtering options: by page, folio, quire, recto/verso, Currier language, [not] first words, [not] last words, [not] first lines, [not] labels, [not] key-like texts, [not] Neal Keys, regexps, and so forth – a bit like voynichese.com on steroids. 🙂
It would also be sensible if this included open-source (and peer-reviewed) code for calculating statistics – raw instance counts, post-parse statistics, per-section percentages, 1st and 2nd order entropy calculations, etc.
Many of these I built into my JavaScript Voynichese state machine from 2003: there, I wrote a simple script to convert the interlinear file into JavaScript (developers now would typically use JSON or I-JSON).
However, this brings into play the questions of boundaries (how far should this database go?), collaboration (who should make this database), methodology (what language or platform should it use?), and also of resources (who should pay for it?).
One of the strongest reasons for EVA’s success was its simplicity: and given the long (and complex) shopping list we appear to have, it’s very hard to see how EVA 2.0 will be able to compete with that. But perhaps we collectively have no choice now.
Nick. Eva 1. Eva 02. Eva O3. Eva 4. etc. is wrong. ( bad ).
Page 2r is very important. As I wrote to you. Root. 🙂
Everyone shows you how to write. The root is composed of 4 characters.
C,G,S,L.
All have a value of 3. ( C= 3, G = 3, S =3, L = 3 ).
The author shows you how to write. Numerological system . When each letter has its numeric value.
That’s why the author shows you at the beginning of the manuscript.
On page 1 there is a translation instructions. ( In Czech ).
Hand code breaking methods that I learned 60 years ago at Ft. Devens were fun, but not up to date. Today NSA uses compurtized screening techniques, a few of which are described by
https://static-theintercept-com.cdn.ampproject.org/c/s/static.theintercept.com/amp/nyu-accidentally-exposed-military-code-breaking-computer-project-to-entire-internet.html
Hi Nick – I published my Voynich API to GitHub under MIT license:
https://github.com/robinmackenzie/voynich-api.
The main intention is to present the original EVA interlinear as a resource that can be accessed over the web.
The main feature that is maybe new is the functionality that allows for retrieving text as a sequence of morphemes e.g. [d][a][iin] or [d][aiin] or [da][ii][n] etc in a consistent and repeatable fashion.
There’a s bunch of examples (http://www.datalunch.com/voynich) showing usage of the API to retrieve and analyse various sections of the interlinear. I recreated some of the ‘classic’ experiments e.g. word-length distribution and Sukhotin vowel-identification algorithm.
Hope it is useful. I’d be keen to collaborate with people who are interested to taking this to the next level per your suggestions around NEVA/ EVA 2.0/ voy101 etc.
Great post – lots of useful comments about moving this forward.
All the best,
Robin
Robin: very interesting, and well worth a post on its own, thanks!
Nick: Perhaps crowdfunding and git is possible?
[Insert lots of hand-waving and unsupported assumptions]
In overly simplified terms, what if every character were represented by number (more likely a large set of numbers) that described that particular character. Think a bit image to some resolution. Two characters that look a lot alike would have different, but similar representations.
Through some computer coding magic, you could quantify just how closely two characters resemble each other. You could have characters A, B, and C where B is X “distance” away from being the same as A, and Y “distance” away from being the same as C. It’s a way to derive the probabilities of (B == A) or (B == C) or (B != A or C).
Ideally, it would also be context specific. Which characters surround the character? Where is it in a line? How close are the characters before and after it? Where is it on the page? Which page? Etc., etc.
Essentially, it would be an electronic copy of the physical text. It would take a lot of the human error out of interpretation of the text.
This is all off the top of head. It’s a very large and very difficult problem, hence all the hand-waving and qualifiers.
I would like to share some Voynich fonts with others for non-profit purpose. It’s not professionally made, but it goes just fine with other usual Roman-type fonts in a text. Is there someone who can host those files so that anyone can download them to use?
Young Kim: in which format are these fonts?
I’m hosting TTF files for Eva and v101 at my web site, and there are WOFF files for both embedded as well (even if only the Eva one seems to work).
An important aspect of both files is the mapping between the fonts’ character shapes and ASCII values.
The voyn_101 version of the first word of 2r.4 is ãaI89, NEVA śöaudy. I envisage breaking this down into a transcription at the level of individual stroke with some such representation as
Ç)ĆDĊĨÌÎČQČÝ
with NEVA type diacritics which would be held in a data table looking roughly like this
Stroke Stroke type Variant Glyph Join
Ç C 94 ś Contiguous +1
) 07 Near contiguous
Ć C 12 ö Contiguous
D 83 Space: 1em
Ċ C 20 a Half contiguous
Ĩ I 11 Contiguous
Ì I … … …
Something like this could fairly easily be generated from NEVA and then voyn_101 and the other existing transcriptions could be reverse engineered from the database, or people could specify their own Voynich alphabet. If Rene’s ideas require a machine-readable text as starting point, then it could be designed and specified with his requirements in mind as well.
This is just a suggestion: I think a new transcription is desirable and that it should be both fine-grained and intuitive, but there is no point going ahead unless a number of us are all agreed on what is wanted.
Rene Zandbergen: There are two font families named Voynich Mesa and Voynich Symbol and each family has two types of fonts, WOFF and OTF. The character mapping is based on my own definition and does not follow the usual EVA or EVA-like encoding. That is not because I am not fond of them, but because I needed a normalized or print friendly encoding. By “normalized”, I mean that using Voynich characters in a text does not alter the format of text such as the line space and does not sacrifice its readability. To alleviate the gap between EVA and my encoding, I will provide a transcription of Voynich texts from Folio 1r to 67r2 in a MS Word file. There will be an image of keyboard layout and a sample text captured from my article.
If you can host them, how can I send them to you?
There is really no need for another new font or a new transcription, although I seem to be alone in this, I think there are much better (Voynich) ideas to put energy in, such as cipher related ideas.
Davidsch: that’s all very well, but how reliable can analyses be if they rely on the existing transcriptions?
The funny thing is: translation _should have_ worked with the existing transcriptions. Even Currier’s.
It did not, and there is no guarantee whatsoever, that a modern transcription will make this possible.
There is simply a need to have something more accurate, more independent of assumptions, capturing much more information, and allowing people to verify each other’s results.
Who can tell me how many characters there are in the MS?
How many words, or how many lines?
This sort of basic information is not there.
Some of these numbers depend on the transcription alphabet, but it should also be possible for people to make their own. Ideally in a way that is repeatable and verifiable.
Something that is desperately needed in numerical/statistical analysis is a way to describe strings of text in the MS that are “very similar”. That can’t easily be done with existing transcriptions and Unix “grep”.
That is just one example….
Nick: Phillip Neal is a really sharp guy, so I usually go along with what he says. He said that Glen Claston’s transcription is the best one and the one he usually uses,
Rene Zandbergen: The active set of Voynich characters I defined in my encoding consists of 27 letters. I think it covers over 99% of Voynich manuscript. I agree with others who think that just 27 characters are not enough to transcribe the whole manuscript, but I also don’t believe that it will nullify any attempts to understand the 99% of manuscript text without defining extra characters.
In my personal opinion, EVA or EVA-like encoding is misleading since people show a tendency to get attracted to the transcribed outcome and not to focus on the original Voynich text. It might have been rather different if the transcribed Voynich text looks gibberish than when it looks kind of readable.
In my transcription, the Voynich text looks like gibberish unless it is encoded in the right font. And my encoding is not based on any methodological or statistical theory. The mapping of Voynich characters to Roman alphabets is arbitrary.
Dear Nick! I believe that the ambiguity of EVA-identical VM graphemes is not only a reflection of their real difference (which, of course, can hardly reflect any modifications EVA). But, for the most part, this is the trail of errors in calligraphic gestures that were made by the scribes of the VM. And these mistakes in turn, arise from two circumstances: first, the VM was executed not only by the author but with several other people (possibly friends, possibly assistants); second, these other people don’t quite mastered the technique of cryptography, developed by the author of the VM to modify the standard Latin abbreviations. Accordingly, their calligraphy before reaching normal automaticity, failed, exposing, thus, some standard elements that the author of VM compile their graphemes.
In my analysis VM and the analysis of the principles of decoding you will be able to find some actual confirmation of this point of view (http://limanov.livejournal.com/8084.html).
I’ve experimented with the idea of a crowd-sourced transcription tool.
It would operate in two stages. In the first stage, participants would step through the manuscript and up/down vote transcribed words and word boundaries. In the second stage we’d submit and apply corrections for flagged words/boundaries.
Here’s an example of stage 1:
http://www.voynichese.com/transcriber
It opens the manuscript at a random location. You then use the spacebar to up-vote transcribed words/boundaries or x to down-vote.
The same process would be used to alter a transcription. E.g. transform EVA by replacing “sh” with a new character and crowd-source any corrections.
Job: I think it would be more useful to have different options to vote on, rather than a thumbs-up / thumbs-down. For example, one example came up (when I tried it just now) where there was a dubious space. There, the option would be to choose between space, comma, or nothing at all.
Job, I am very pleased to see your reply, indicating that you are interested in this problem. Voynichese.com is a tool appreciated by many people, and a step in the direction of an improved transcription of the MS.
The link proposes a way to interact with the MS and the transcription.
For me this is right, regardless whether a ‘crowd’ or a selected few, or a piece of software should make the decisions.
What has to be one first, in my very strong opinion, is to agree on:
– which images to use
– how to define the positions on each page
– how to store and allow retrieval of all information
Especially the third point is important. I know that you already implemented an answer for all three in voynichese.com, but for future collaboration on this issue, this should be done in a way that is documented, and acceptable for the users.
Rene: does it have to be an either-or choice between the two sets of scans? For almost all of the kind of purposes we’re talking about here, both are more than good enough: most of the difficult transcription decisions arise from ambiguity rather than from lack of clarity.
Young Kim: the point of EVA was to help a research community to talk, and I don’t obviously see why that’s such a bad thing. :-/
Nick Pelling: I am sorry if I gave that impression to anyone. It’s not what I meant. I agree that having a good transcription is very important thing to move forward.
Nick, for transcription purposes the two sets are mostly equivalent, and why not use both. However, for recording of the location of the text elements, and even more for the measuring of handwriting properties (character size, line spacing, slant angle) and all their variations, a standard should be introduced, and the flatter images from the latest digitisation are preferable.
I imagine the inaccessible images made by Siloé to be even suprerior.
My last post did not format well.
The main difficulty I have with EVA is the treatment of sequences like an, am, ain and ey eey chey. Broken down into strokes, they amount to cin, ciin, ciin; c, ccy, cccy etc, sometimes ligatured and sometimes not. What is more, EVA chooses to group the one set rightwards and the other leftwards. Obviously, the transcribers had to settle on one convention or another, but the result is that we do not really know how many instances of a and e there are in the text
The solution I propose is a transcription at the level of the stroke, distinguishing all discernible variant strokes in the manner of Glen Claston and specifying the position of each stroke relative to its predecessor and successor. I would also like to see information on what lies above and below the stroke. This information would be held in the columns of a data table like this:
col. 1
Stroke type
c
c
y
c
i
i
n
col. 2
Stroke variant
c-3
c-10
y-1
c-4
i-2
i-2
n-3
col. 3
Left ligature
full space
half space
strong ligature
full space
strong ligature
weak ligature
strong ligature
col. 4
Right ligature
strong ligature
full space
half space
strong ligature
full space
strong ligature
weak ligature
Given an analysis of this kind, one could specify many-to-one mappings from strokes to the character sets assumed by the existing transcriptions, e.g.
full space + c-4 + strong ligature + i-2 + half space -> a
A front end to the database like voynichese.com would display the text in the form of characters and you could choose your favourite transcription or specify a custom mapping. I have gone some way towards this with a database containing voyn_101 and new data of my own about white space and so on. I have not yet taken it to the level of the stroke nor defined mappings to EVA but it is what I intend to do next
As for fonts, the problem is that a) they are designed to map on to different transcription schemes and b) you have to download them. Diacritics have the merit that support for Unicode is now more or less universal and they would make it easier to agree standards: however, I am not wedded to the idea.
Does any of this fit in with other people’s plans?
Dear Philip.
apart from (important) details this is very much in what I think is the good direction.
Whether or not to break characters down into strokes even further can be discussed. The capability for people to ‘build their own’ transcription alphabet for analysis purposes, especially in a way that could be repeated by others, is the way to go for me.
When we made Eva, we ‘recombined’ several strokes of the frogguy alphabet back into individual characters. As you may remember, Jacques had broken up the gallows into their halves. He also had ‘a’ and ‘y’ as whole characters.
The question of fonts has also plagued me a bit. Nowadays, they don’t need to be downloaded anymore, at least not manually. My web site has been set up to do it automatically using the WOFF format, but it does not seem to work for all browsers and after initial success, for no apparent reason it stopped working for v101 completely. It may be a copyright issue on the v101 font.
Nick wrote >>Davidsch: that’s all very well, but how reliable can analyses be if they rely on the existing transcriptions?
You are more an expert than me on almost any field, but in my analysis it doesn’t matter if there are errors or bad transcriptions. In good statistical analysis there is room for errors and the “average” will show you any good direction, if there is any.
There is more blabla about the VMS than results and this EVA 2.0 initiave feels to me as a new attempt to circle around the main issue.
Looking for alternative ways of solving the riddle is excellent, but by making a nicer car with a better paint job, does not bring us closer to the “design of the engine” of the car.
Davidsch: the difference between you and me is that I strongly suspect that a lot of good statistical analyses in the past have been defeated by problems inherent in the transcriptions they were using.
Rene
What are the “(important) details” and how far in the direction of the stroke do you think we should go? I bow to you about fonts.
I wonder even if someone with EVA is only approaching a solution. Would he see it at all?
Much believe in VM to recognize their own, or similar language. This is quite possible. Looking at the development of the Indo-European languages.Obout 1 billion possibilities.
I might have to publish my attempts to hope that maybe someone can read it.
Example:
Again and again I see in the VM the word “oror”. This is hard to find.
Or = much
Oror = much too much
Example:
“On the Moorgund, in the Maannischiin (who is Mauch schiint), is the Puir to the Maad (San Meejn). Zi Säggschän (Around Säggschi) wakes r schini Froiw us hertm Schlaaf. (Fischischtrri) chameleon, cheeses, and herbicides. Dernaa reisudsch (grächudsch) ds Early sushi (formerly ds Niächtrru). Schi trüchnd Milchkaffee and ändnd Aichnbrood dr zuä (Brood and Aichn drzuä).
Would you recognize that? That is German !
Philip, my main doubt is indeed whether to go further down into strokes, though I haven’t really thought about it much. Maybe it is just difficult, but could work out.
I am writing up a synthesis of all my ideas on this topic, which is taking a bit more time than I thought, simply because it needs to be organised propoperly.
My main concern is not that the existing transcriptions are not good enough (*). Perhaps it is more that people are not using them properly, or it is too complicated to use them properly.
My main drivers are more standardisation (conventions, formats) which will allow the creation of standard tools and easier access.
Just as an example, now one is almost forced to either work with the interlinear file (or one of its items), or with the GC transcription. Their formats and conventions are so different that it is not practical to switch between both.
The very useful “voynichese dot com” by default works with Eva and one particular text transcription.
Note (*):
Had the Voynich MS been a regular medieval text encrypted using a regular medieval cipher, already the first transcription (First Study Group) would have been good enough to crack it.
The fact that this has not happened is caused by the assumption (regular cipher), not any fault in the transcriptions. There is something very different happening. Having the best possible data to figure out exactly what it is, can only help. I’m afraid that most people are not yet too aware just how far from a regular medieval encryption of a regular medieval text this is….
@Nick: Doing analysis of the transcribed text, of which we suspect it is a cipher, without looking at the original paleographical text and possibilities is a typical mistake of a novice. All succeeded cipher cracking efforts have always been the result of two together, at least.
@Peter: from where did you take that text?
Which seems to be identified as Wallisertiitsch, Dialäkt va nu Tiitschschwiizer im Kanton Wallis und kheerut zär heeggschtalemannischu Dialäktgruppu.
Davidsch: …which is what I’ve been saying since 2006, and which is the point of trying to fix up EVA.
So… what exactly are your objections to what we’re talking about, again?
Here’s a question to illustrate what I perceive as the main problem.
“How many characters are there in the MS?”
I don’t know the answer, but a clever way out of that could be: “it depends”.
Of course, it doesn’t really depend. We’re just unsure about the definition of “character”. So our best guess depends on our assumptions (e.g. the transcription alphabet).
But how would anyone go about making a good guess?
(My guess is somewhere between 60,000 and 180,000 and that is of course an awfully bad guess).
How complete is Takeshi’s transcription of the MS, could it be used to make the count?
Same question for GC’s transcription of the MS.
We don’t even know that.
I am not too concerned that we can’t say how many characters there are in the MS. I am concerned that we don’t even know how to find the answer.
I must confess that I’m almost clueless if it comes to programming, but I’m far from being clueless if it comes to medieval manuscripts. Therefore my main objections to EVA 2.0 may sound somewhat naive or even childish to the experts in computing and statistics here. So please be a little patient with me.
As far as I understand a revised EVA should help to get a better standardisation, better statistical analysis and – in the ideal case – unveil some plain text never seen before. All that based on a new or modified transcription of the given text.
And there we are: what kind of transcritpion and who decides on the value or interpretation of single or combined letters, ligatures and abbreviations without knowing for sure what the underlying language is? From a strictly paleographic view these assignments are more or less easy to make, also for the gallows that are much less unusual as they seem to be. But even a very skilled paleographer gets in trouble when he/she has to decide about ligatures and abbreviations without any language given. Most abbreviations are found in latin texts while most ligatures appear in Early New High German
manuscripts. So if EVA 2.0 will be designed to be neutral concerning the language point, then how can a “one fits all- transcription” be worked out? If the answer is, that every user can assign every value he/she likes, then EVA 2.0 would only become a new and slightly better toy to play around with.
Coming to statistics I still wonder how former statistical analysis of the VM has been done. To compare modern languages against each other is certainly not a point, but what about medieval texts? For example: letter frequencies of modern German are absolutely not to compare with those of medieval (and regionally very different) German ones. Is there more detailed information about the methods of former statistics to get?
And yes, Rene, you are right. The VM is really far from a regular medieval encryption of a regular medieval text. Thats why we all can’t leave it alone, I guess.
Rene,
one way to solve this problem would be to come to a commonly accepted definition of “character” (or graph or else). This is one of the main points I described in my post. A clear definition of character means in the first place to distinct between a single character and its multiple combinations.
A most helpful instrument for paleographers is to make a so called “graph inventory” (related to the given manuscript). Such an inventory does not only cover all graphs/characters appearing in the ms, but also allows to compare or distinct them. For professional paleographers this is a must, and more than this, it’s an indispensable basic in order to identify a certain hand/scribe.
The sheer amount of characters/graphs doesn’t say anything about the contend and only depends of the lenght of the ms. A short poem encrypted in Voynichese wouldn’t make any difference, as for example the unsolved Dorabella Cypher. For the latter we can assume, that the underlying language is most probably English, but in Elgar’s personal context of music and literature it could also stem from the libretto of an Italian opera. And this goes for the Voynich to. If we now still don’t know how to find the answer, we should probably go back to the roots. Boring, I know…
@Davidsch
Yes, you are right, it is Walliserdeutsch.
Höchstaltdeutsch
Therefore it is very important for me to determine the place of origin first. With a radius of 50 km I would be happy.
Thus one could at least limit the basic dialect.
Even then, there are still enough and most do not even have a written proof
Charlotte: do you realize the extent of the debate over the transcription? People have suggested all manner of rules and schemes, and you can find counterexamples for every one of them.
The point of EVA wasn’t to produce a single perfect glyph transcription, but to produce a tolerably close stroke transcription that could be digitally transformed into people’s different schemes. The problem is that a number of the assumptions that were made with EVA now look slightly suspect, and so it’s time to revisit those, in case those assumptions have been accidentally confounding our analyses.
Rene Zandbergen: Regarding using v101 font on the browser.
I am not sure if it is the case. In my case, the font can not be used by publisher unless the Embedding Licensing Rights is set to Installable (no embedding restrictions).
Nick,
no, I’m sorry I don’t realize the whole extend of the debate at all because I’m relatively new in the field and can impossibly know all peoples suggestions in the past.
On the other hand I understand your point very well, and if EVA now needs a revision, so let it be. The point I really don’t understand is, why the general and obvious weakness should remain. From my point of view this waekness lies especially in that frame of tolerance that leads to speculation instead of
reliable facts.
However, I’m eager so see how a new EVA will look like.
Nick, i think your blog swallowed one of my comments.
On the topic of which scans to use, my opinion is that we should use both. My plan was to have a button for toggling between the scans. I began mapping the new scans last year but was interrupted by other work.
The transcription tool exports the transcription-quality data to a public location and generates an overlay to track progress:
http://www.voynichese.com/#/lay:block-quality
I started with a simple up/down vote mechanism in order to address the ambiguity problem mentioned by Nick, however the same approach can be used to gather all kinds of metadata.
I will be releasing the transcription tool as an open-source project. At that point we could iterate on it and have more concrete discussions about what metadata can be captured and how it should be stored.
@Charlotte Auer
I agree with you absolutely what you write about single character and its multiple combinations.
This amount of different characters, which smells formally after combination.
The characters selected by the author can also be combined very well.
Job, this is excellent.
@Nick: All I am saying is, that it’s a waste of time & energy and it blurs the importance and focus, as I tried to explain with the example of the car which gets a new paint job, but still gives no clue on it’s engine.
But, by creating yet another transcription it will of course increase the status of experts more: Scientia Potentia Est.
@Job: You have very nice ideas and software, but without sharing the scan sources the Git source is useles to others.
One problem are transcription mistakes. This is something we could fix without changing the transcription alphabet. An example is the word ‘daidy’ in line f27v.P.2. This word is also transcribed as ‘da dy’ or ‘daijy’.
Another problem is that the manuscript was written by hand. Therefore it is sometimes hard to distinguish between glyphs of similar shape like ‘o’ and ‘a’.
Sometimes a weird variant of a glyph exist only one or two times. Today the transcriptions ignore this kind of weird glyphs. An example is the ‘3’-glyph used in f10r.P.11 ‘3octhy’ and f10r.P.12 ‘3odaiin’. Should we handle weird glyphs as new glyphs or as variants of more common glyphs?
If you want to design a transcription alphabet you have to make some assumptions and if you want to transcribe the manuscript you have to interpret ambiguous glyphs and spaces.
Davidsch: you seem to be missing the two major points, which are (a) that we are precisely trying to use statistical analysis of the transcription(s) to help us work out what Voynichese’s engine is, and (b) that a good number of us suspect that specific technical issues with the transcriptions we use have been causing our analyses to produce inconclusive results. As a result, a number of us is collectively looking to contribute towards fixing those problems: but if you are suggesting something different that will achieve the same end, please feel free to say.
Davidsch: I would also like to hope that the same tools and methods we collectively put in place should make it significantly easier for everyone to perform their own tests and to double-check other people’s tests. As long as we manage to keep the limitations of the transcription visible and in people’s minds, I struggle to see how any of that could be classified as a bad thing.
In the ideal case, the transcription file (or database) distinguishes between many different characters. Currier and basic Eva have 26-36, but extended Eva and v101 have over 200. A new transcription could have even more.
For most straightforward analyses, where a smaller set of characters is assumed, as standard tool would be needed that allows to group these characters into (e.g.) a-like, y-like, cc-like etc.
When Eva was defined, such a tool existed and was used a lot. It’s called BITRANS. Not having it anymore (it runs only in DOS) is a great loss.
Anyone who doesn’t like that the Eva transcription generates more or less pronounceable text could use it to redefine “daiin” into “8aw” (or whatever) consistently and work with that.
As long as everyone has the same tool, all this can be reproduced by anyone.
Davidsch, the scans are those posted/hosted by the Beinecke library.
Rene. And ants.
” daiin ” into ” 8aw ” ( or whatever ). 🙂 It is bad.
daiin = It is bad.
It is read correctly = 8 aiin. ( 8 is the number. Aiin = word ).
When you’re in Prague. Ask Rabbi what the word means.
I can help you right now. And write.
That word and number = Means the name.
@ Rene
Too bad you do not have the program Bittrans any more. If you say it was really good.
It would be well possible that it would run with a dos emulator.
In the hope that it is somewhere in the cellar.
Yes, it would still run in DOSBOX, for example, but this is completely unpractical. When new standards are agreed, someone will surely be able to provide an equivalent tool quite quickly.
@Josef Zlatoděj Prof.
dain or daiin = give
giain or giaiin = go
lain or laiin = want
stuain or stuaiin = have to
and more
That is rumantsch, since I need no Rabbi, since I can ask 60’000 Engadiner.
This is a nice language, Celtic-Latin
La uolp d’era darchiau üna jada fomantada. Qua ha’la vis sün ün pin ün corv chi tegnea ün toc chaschöl in ses pical. Quai ma gustess, ha’la s’impissà, ed ha clomà al corv: «Cha bel cha tü esch! Scha tes chaunt es ischè bel sco tia apparentscha, lura esch tü il pü bel utschè da tots».
The origin of the VM lies with great dignity in the bilingual Tyrol. My suspicion is suspected.
For those who want to know what it means.
The fox was once more hungry. Then he saw a raven on a fir, holding a piece of cheese in his beak. That would taste good to me, he thought, and shouted to the raven, “How beautiful you are! If your singing is as beautiful as your appearance, then you are the most beautiful of all birds. “
@Nick: Making a new transcription for statistical clearance & visibility is very useful.
However, the amount of energy to make a good transcription and discuss it, seems to me impossible and secondly I believe consensus on analysis will only be achieved on the simplest methods. You also saw the endless discussions on very straightforwards counts on letter grouping, which proves my point.
@Rene: That is exactly what I am afraid for: too many characters and too many symbols. Writing software for a transcription that imports a transcript like this is very difficult:
# Text around Brumbaugh’s clock also as scattered
{left of ‘clock’}opa???rsh[*|f][*|ch].dyckhosh[*|e].olchor{l/s|d}
{4:2}da[r|s]al*dy\
@Job: Of course I meant the letter mapping on those scans you made & that process therein.
For example when I only want to analyse the 10th word on every page, that is now impossible. Also you the Rosette page not included. Which is probably the most important page in the manuscript.
In general, please note that my comments are only meant as advice: making no progress is even worse.
@ Peter. I am sorry. But you’re wrong. In the manuscript you can not change letters as you would like.
The handwriting and key has a firm order. For example. 8 aiin. 8 is number, and it has a value for letters ” F ” and ” P “. The word aiin is the name. ( Jewish code ).
Eliška writes in the manuscript. ( Of course he writes it in Czech ). All letters are numbers. All letters are divided into groups. From 1 to 8.
1 = a,i,j,q,y.
2 = b,r,k.
3 = c,g,s,l. Etc.
Peter see page 2 r. The beginning of the manuscript. The root is composed of the letters !!!! C. G. S. L.
Root is the foundation. The letter is the basic of the word.
That’s why Eliška shows you at the beginning of the manuscript. This should be understood by everyone who wants to work on translation.
Eliška also writes that she is using the Jewish code. He also writes that writing is writte in Czech.
I’ve translated the whole manuscript. So I know.
@Josef Zlatoděj Prof.
You write “I’ve been the whole manuscript.” So I know. ”
If you have already translated the entire manuscript, why not let us participate in your results and publish a page. Or make a page on your homepage? It would be a miracle for me, and against any logic when it was actually written in Czech.
1. All pictorial references refer to the southern Alps. The plants tell me the same thing.
2. Why are all the monthly names written in vulgar Latin?
If there were a Ferbruar in the VM, one would write Febre.
Febre = fever
3. Why do we have a German text on 3 pages? And still in an Austrian Dilekt? Why not Czech?
4. On page 66r we have a German text.
” y den mus des ” translated = “y then Need this”
The wonderful thing about this set is the “y”
This means “and”. Thus the whole sentence is “and then need this”
This writing peculiarity occurs only in the Tyrol, and is used only as ” and ”
Look Wiki Language Ladin (only in German) … search the Y
https://de.wikipedia.org/wiki/Ladinische_Sprache
Now you may understand that I must see evidence before I can believe it was written in Czech.
All references to Ladin.
If the Voynich MS text is ever translated, will it be because of a thorough and logical deduction, based on transcription, analysis, process of elimination etc? or will it be a flash of insight while standing in front of the mirror, shaving?
What both options have in common is that they can only work if they explain *all* the unusual features of the text. Not understanding these features is a show-stopper for all would-be translators and decryptors.
Out of all these unusual features let me highlight 3 or 4.
1) The low text entropy.
It was first quantified by Bennet using his own transcription in his own alphabet. Later, many people did similar calculations in other alphabets, and Dennis Stallings wrote an online paper with some numerical results. Recently, Anton Alipov has redone some calculations, and did not get exactly the same results as those generated by Jacques Guy’s Monkey program.
I have seen people argue that the low entropy is probably the result of the use of the Eva alphabet.
So where are we exactly?
– Yes it does depend on the piece of text that is analysed
– Yes it even more depends on the transcription alphabet used
– No two people get the same result
How much of the anomalous value is inherent to the text, and how uncertain is this value? I dare say that the level of confidence is very low here, and that, even though the low entropy is clearly a parameter of key importance.
2) The close-to-binomial word length distribution.
This was found by Stolfi using some part of the interlinear transcription and the Eva alphabet. This metric is one of the least well-defined that one can have, because it depends strongly on the alphabet used, and is also affected by the very uncertain definition of word spaces in the various transcriptions.
So, when using another transcription file, and/or another transcription alphabet, how far from binomial will the word length distribution be?
Very uncertain again.
3) The apparent lack of repeating word sequences.
In theory, this should be independent of the transcription alphabet used, but uncertain word spaces are still a problem. What’s worse here is that I am not aware of any reasonably quantitative analyses. It’s all qualitative, and does not take into account that the MS text is:
– likely to suffer from lack of spelling rules
– probably has its share of writing/copying errors
– there’s something going on with the first word of each line, which likely interrupts ‘standard’ sequences.
So, in summary, of the three pieces of information about key anomalous features we (think we) have, two are based on different transcription alphabets, and all three are based on unknown (and probably different) sets of MS text.
None of if can be repeated.
We don’t know if any of the results is statistically significant.
There’s more. hidden Markov analysis shows that the text does not exhibit the usual exchange of vowels and consonants. This has been analysed by four people that I am aware of. There is some mention of the most recent one by Reddy and Knight in their paper.
This is based on the Currier alphabet.
The situation is fairly desperate.
No consistency, no standards, no way to check any result obtained by anyone.
I’ll come back to this…..
Rene: I think we’re not far from the point that we can clear many of these roadblocks very straightforwardly, so I’m a little more optimistic than you sound in this comment. But there’s a lot of ground to cover nonetheless…
Nick, indeed, it is something that could be done with a little concentrated effort from just a very few.
I’ve been making an inventory of the three main transcriptions that I am aware of:
– The interlinear file
– GC’s file
– The unpublished transcription of Gabriel and myself.
The differences in representation are *very* significant, but I’d hope to hear from Job and Philip on an initiative to align these. No use for me to push ahead alone, if someone else is perhaps doing something different in parallel.
Exactly ! It is a matter of perception of the text. Not transcription.
Although for me personally not necessary,
making another transcription is a great exercise for those who
can not look at it anymore in a new perspective.
Davidsch: perhaps we can compromise and agree that some relatively small technical changes to EVA could yield some big perceptual changes. 🙂
Rene,
your pessimistic conclusion: “The situation is fairly desperate.
No consistency, no standards, no way to check any result obtained by anyone.” hits the nail on the head.
To me the whole Voynich research seems to be as deadlocked and rusted as a shipwreck on a sandbank. No new challenge to be seen so far. And it woulnd’t be of much help to paint the rusty old thing in bright blue just to get it out of the sand. Means, to only send EVA to cosmetic surgery will not solve the internal problems of aging.
Charlotte Auer: if EVA was as badly broken as you think it is, we wouldn’t be trying to fix it. And besides, the world has moved on since the late 1990s when EVA was conceived, we now have a wide range of technical things we can do to make it accessible to everybody, so there’s lots of reasons to be optimistic! 🙂
Since a little more optimistic. After all, you can already cross 4 continents of 5. 🙂
Nick,
I never thought that EVA was broken because of “technical things” that now can be more or less easily renewed. What I’m referring to is the basics needed to get a proper transcription, and this has nothing to do with programming techniques, but with the paleographically correct assignement of values.
To quote Rene’s nice picture: it happens to me finding myself standing in the front of a mirror. But shaving? Never ever! Is this perhaps the reason why the flash of insight will never reach me? So sad.
No, seriously , I am optimistic enough to follow the future of EVA with kind interest.
Peter: that’s cursed it, now we’re bound to get Voynich theories linking the manuscript with Australia (via obscure plants) and Antarctica (via Peri Reis conspiracy theories). :-/
Nick:
The Piri Reis map is visibly a nice thing. One sees the 2 rivers in the Gambia, and the 2 in Montevideo. Even if he reached America after Amerigo Vespucci. So I have no tuban and also no cooking bananas in the VM found. 😉
The same with pre-astronautics. Erik von Däniken wrote a lot, but nevertheless I think we solve the VM faster then he finds a UFO.
Nick: It is simply not possible to solve the ambiguousness of the Voynich manuscript by adding more details to the transcription. The problem is not EVA. It is possible to discuss all the problems with EVA and the images available.
For instance the scribe hasn’t erased erroneous glyphs. Maybe he didn’t care about errors or maybe he was able to correct errors another way. See for instance the ‘sh’ in ‘lsheody’ in f105v.P.33. or the final ‘s’ in ‘sheeodees’ in f105v.P.26. It seems that the scribe added an additional quill stroke to change ‘ch’ into ‘sh’ or ‘e’ into ‘s’. If the scribe was able to correct something looking like an error to him with an additional quill stroke this is maybe worth to investigate. Anyway, no matter if the scribe was correcting errors in a different way or the scribe didn’t care about errors the text of the manuscript didn’t behave like other texts we know.
We shouldn’t blame EVA or a transcription if the Voynich manuscript did not fit our expectations. All we can do is to try our best to describe the manuscript and to look on it with an open mind.
Torsten: I completely accept that there are some (small) sections of the text that will almost certainly be ambiguous whatever transcription scheme is used, I had hoped that this would have been clear from my post. All I can say is that my focus is far less on those sections than on the other (much larger) sections where current transcriptions fail to match (or even to reproduce satisfactorily) what we see on the page.
And yes, I also completely agree with you that the text of the manuscript doesn’t behave like other texts we know. But you have to understand that your statistical conclusion about Voynichese’s inevitable ambiguity is inevitably strongly coloured by the transcription you use, which has many shortcomings you are already aware of. All I’m saying is (a) that I strongly suspect that much of the ambiguity that confounds your (and others’) statistical analyses arose from artefacts introduced by the transcription (both in EVA’s design and in Takahashi’s interpretation of the EVA scheme), and (b) that I’m happy to put the work in to do something about these.
If we collectively follow this road to produce a more nuanced, realistic and usable transcription, you would be free to use it or ignore it as you wish. But prejudging the results of that entire process would seem to be not the best contribution you could be making.
Nick: Each stroke in the manuscript is in some way unique. To transcribe means to remove all non essential details and to describe the only the essential details of repeated strokes or stroke groups. But as long as we didn’t know what the important details are we can only make assumptions while transcribing the text of the VMS.
Takahashi transcription is excellent and the best transcription we have today. In my eyes Takahashi has done a great job in not interpreting something into the manuscript. If he reads for instance ‘ikh’ or ‘cs’ there is always a good reason for doing so.
There is no harm in trying to improve Takahashi’s transcription or in trying to build a better transcription. But even with a better transcription it would be always necessary to check an image or to check the manuscript itself.
But the real problem is another one: We should try to learn from each other. But instead of listening to the arguments of each other most times we only defend our own view.
Torsten: Takahashi’s transcription is indeed very good (mostly). But I don’t believe it (and indeed EVA’s design) is strong enough to sustain strong conclusions.
If you are telling me that you think that people whose strength of conviction about their conclusions is not justified are part of the Voynich problem, I would agree: and would urge you to support initiatives that improve the quality both of the transcription and of the tools which help people easily perform tests on that transcription.
There are some issues with the transcription that cannot be solved in any really adequate manner. Take a look at the top of folio 34r for example.
How are the lines left and right of the plant to be aligned?
There is simply no way to say with certainty what the scribe intended here.
This has an impact for people who are interested in studying repeated text sequences, and they should be aware that this piece of the text is not reliable.
Rene: this is the kind of thing I mean when I say that there are ambiguities in the text that the transcription should not pretend to be representing in a certain, final way.
I think we have to try to be clear about what the point of “EVA 2.0” is: we already have EVA 1.0 to help us talk about Voynichese, so it’s not about communication per se any more.
For me, the points of EVA 2.0 should be:
* to be a living document that we can all contribute to – if I see a transcription error, I’d like to be able to fix it easily and quickly.
* to support analysis via REST interface or JSON etc, so that people can replicate and validate previous statistical experiments, and particularly to understand the assumptions that gave context to those results. The ability to share tests and to embed them into blog posts would be very useful indeed.
* to account for uncertainty and ambiguity better than just “*”
* to mark up variant shapes more clearly – 4, s loops, -n shapes, etc
* to mark up problematic areas more clearly, e.g. Neal Keys, wide gallows
Much as I understand Philip’s suggestion that we use Unicode for shape variants (specifically for printing and technical articles that require accurate reproduction), we perhaps need to also keep one foot in the ASCII camp, so that we don’t lose sight of EVA’s communication aspect.
Perhaps the ASCII rendering of / interface to EVA 2.0 should retain the same basic EVA shapes as before but append a number to each letter to represent the particular shape variant. For example, qoky would be the same as q1o1k1y1 (implicitly), but a round-headed q would give q2oky instead?
Nick, you write ” For example, qoky would be the same as q1o1k1y1 (implicitly), but a round-headed q would give q2oky instead? ”
This should in any case be so as to keep the possibility of a deception as low as possible. “q or 4” and others.
I should also have the possibility to add a new value to a glyph in the library.
Example Glyph “cPccc” in combination several variants are possible.
“cP cc c” “cP c cc” “cPcc c”
The combination makes everything look different, but actually it is the same.
It only creates a lot of confusion for the reader.
We have already done this as children.
A nice example for 90% confusion. Only 5 characters have been changed.
https://www.google.ch/search?q=Neuchatel+kryptogramm&source=lnms&tbm=isch&sa=X&ved=0ahUKEwji0bvLoKvUAhVLa1AKHYE5D8UQ_AUIBygC&biw=1366&bih=673#imgrc=35zAvg69P7UvKM:&spf=1496822034588
Sorry, I’ forget
If the “P” in the combination is now “P1” or a “P2” equal to “q or 4”
This makes the reading much worse
Peter: OK, so what is your suggestion for how we can talk about different variant shapes in EVA?
Nick: Maybe you should read my posts a second time.
I once put myself together with Exel:
Each glyph gets a number. a1, next a2, etc.
The whole VM text I then only in numbers.
I have these numbers in a library. If I eg. The number a1 an absolute value give “em” so dubes me the same in the text as “em” on.
Now I can look at the text. Possibly “um” and not e. Then change me the whole text synonymous immediately at the place on “um”.
Small change in the library, and I just have a new text variant.
I hope you understood me. Is Google translater.
Torsten: I do read your posts, but I share neither your conclusions nor your belief that your autocopying hypothesis explains what you think it does.
The Rugg Fallacy is to think that resembling something statistically is good enough, even in the face of everything else that differs. Auto copying is not significantly different from that, in my opinion.
Nick: No, you didn’t respond to my posts in this thread. The posts didn’t say anything about strong conclusions or about the autocopying hypothesis. The posts say something about the imperfection of transcriptions for the Voynich manuscript and they say something about not listening what others have to say.
Torsten: I read all your comments, replied to many of them, and also referred to them in follow-on posts, so I have no idea what you think I haven’t done (apart from not putting a week of my life on hold to review your autocopying hypothesis in depth, as if I was marking your dissertation).
Nick: Obviously you are not interested in what I have to say about transcriptions for the Voynich manuscript. This is a pity.
Torsten: I’ll say it once again – I have read all your comments, I have replied to many of them, and I have also referred to them in follow-on posts.
If that isn’t good enough for you, please feel free to look elsewhere.
Rene,
I am sorry, but I don’t see any problem with the alignment of f34r.
I see a text of 9 lines carefully flowing round the details of the previously made plant drawing. On the right, the 6 last lines are somewhat displaced to the bottom in order to follow the bow of the right blossom.
What the scribe intended here was simply to get the 9 textlines in a more or less orderly “closed” paragraph as he did on the bottom of the same folio.
It would be very easy to bring the text in the right order and in an absolutely reliable form. At least to me there is no doubt about the alignment.
In case I got you completely wrong, please let me know.
Charlotte: to be honest, I see olaral probably miscopied as olaryl, so the next word is probably a y-gallows-initial word rather than a -y terminal word scrunched up. 🙂
Dear Charlotte,
it is highly unusual for the scribe to align the text so badly around a text intrusion by a drawing, regardless the shape and the size of the intrusion.
Add to that the fact that the lines on the right do line up rather well to the lines on the left, but one further down.
Opinions are possible, but I don’t see how one can be certain.
Nick,
I thought that Rene was only concerned about the alignment that could lead to misinterpretation when using EVA.
I personally don’t care about olaral or olaryl, or y-gallows or whatsoever because I don’t use EVA at all. I’m working with my own tools as for example: a graph invetory related to a database where each single or combined graph has its own datasheet with strict paleographical criteria and additional information like “known regional appearance in contemporary mss” or “typical ligature” etc. Just to name a few. This database allows me to analyze, to compare, to distinct, to include or exclude and to finally make an assignment.
Works very smooth and nothing scrunches up.
Charlotte: there is no Voynich wheel that cannot be invented ten or more times over, so please feel free to tackle the problem however you think best.
Yes, Rene, it is very unusual indeed, but the VM isn’t a usual codex either. This kind of sloppiness – not only on f34r – has its own story to tell. To me it almost screams “time pressure”. But this is another impression or opinion that has nothing to do with the aim to get EVA into a “Jungbrunnen”. Thank you very much for your answer!
Nick,
up to now there was no Voynich wheel ever that turned into the right direction.
So why don’t invent another one, let’s say number 11 or 234? Another new approach to tackle the problem would have the same chance to win the prize money as good old EVA would. And who would judge it? The unsuccessful “very few”?
One of the main problems is the habit of those “very few” to refuse new impulses as long as they don’t perfectly match with their old theories. I don’t speak of the usual suspects, what I mean is that serious impact from other disciplines should be taken into consideration as well.
Of course I’m feeling free to do it my way. Don’t worry!
Charlotte: cracking Voynichese will take palaeography, codicology, cryptanalysis, linguistics, and plenty of ingenuity. It’s all very well bearing the whole load on your shoulders, but there’s a lot to be said for collaboration too. 🙂
Just a general observation about transcriptions, and to caution against the idea of a single one being ‘better’ in all circumstances.
There’s a well-known Arabic-language newspaper whose name can be transliterated in at least three ways:
1) al-sharq al-awsat (supplying the short vowels which the Arabic alphabet lacks)
2) ash-sharq al-awsat (because when it precedes certain letters the ‘l’ in ‘al-‘ takes the value of the following consonant in spoken Arabic)
3) AL-SHRQ AL-AWST (with a small dot under the final T), which is a direct transliteration of the Arabic consonantal skeleton into English equivalents.
Which is best? It depends.
If you want to run an accurate count of the most common letters in a piece of written Arabic, you’d use option 3.
If you want to count the most common words in a piece of written Arabic, you’d use option 1. (Otherwise you’d confuse a word like kataba, ‘he wrote’, with kutub, ‘books’, which are different words and pronounced differently in spoken Arabic but written identically.)
If you wanted to tell a piece of voice recognition software what it would expect to hear when an Arabic speaker mentioned that newspaper, you’d use option 2.
The process of transcription isn’t neutral here, and it won’t be when you transcribe Voynichese either. For some studies, you’ll do better with a glyph based transcription. For others, a stroke-based one might be more useful. I’m really not sure you’re going to find a single transcription which is ideal for absolutely any kind of test that anyone is potentially going to run.
SirHubert: all we can do is work out the best set of practical compromises needed to turn Voynichese on the page into text on the screen, and then use that to try to build a transcription that is close enough to be (with luck) genuinely useful and a little revealing.
I have recently moved my Voynich pages to my own website philipneal.net/voynichsources with a small amount of new material including 1. the long-promised letter from Kircher to Moretus and 2. some proposals about transcription (towards the end of the main page).
‘NEVA proposal’ explains the mapping from voyn_101 to my own NEVA transcription scheme, including a more granular representation of the word separators than previous transcriptions provided.
‘NEVA proposed transcription’ is a simple translation from voyn_101 to the new scheme.
‘Proposed strokes-based transcription’ is the voyn_101 transcription broken down into sequences of strokes with unique identifiers for each variant. It is the output of a script which translates voyn_101 into tuples, one per stroke, with other relevant information such as the separators and white space. These tuples would be the raw material for a database. Clearly the whole thing is far too long and unintiuiive to expose in a GUI: I give mappings to voyn_101 and NEVA glyphs, and it would not be too difficult to reverse engineer a mapping to EVA which would probably be what was displayed on the main page of a project website. None of this is set in stone, and I welcome comments and suggestions.
@ Philip Neal
Here to your persecution. Now you have his death date.
Matteo Argenti (1562-1616) was cipher secretary of the papal curia between five popes from 1591 to 1605, and in this office he succeeded his uncle Giovanni Batista Argenti (1591), who had held it since 1585. After he was released from office, he wrote a 135-page handbook on cryptography, describing numerous encryption techniques developed by him and his uncle. Matteo introduced among other things the general substitution ship given by the following encryption table. The letters of the (restricted) Latin alphabet are replaced by single digits or pairs of digits.
http://www.mathe.tu-freiberg.de/~hebisch/cafe/kryptographie/argenti.html
Thanks Philip. Following a note in the voynich ninja mailing list I updated all my links to your site.
I have been spending some time looking at transcription issues, and added some more details to the end of this page:
http://www.voynich.nu/transcr.html
This is very much work in progress.
To Rene. Your revised page has an excellent analysis of the issues involved and I think that what I propose would solve many of the problems you identify.
To everybody. After further experimentation I think I see my way to creating a database with one row per stroke and columns containing information from which one could generate the Claston text of the MS in the voyn_101 transcription scheme, the Takahashi text in the EVA scheme and the other texts in evmt.doc similarly. (Incidentally, the word ‘transcription’ is ambiguous. I distinguish between scheme = a transcription alphabet, like EVA, and text = a transcription of each page of the manuscript, like EVMT.)
The method involves:
1. Distinguishing as many stroke types as are required,
2. Assigning each stroke to its position in each text, and
3. Parsing groups of strokes into voyn_101 and EVA.
Pros and cons of each step.
1. Most of the stroke variants are distinctions without a difference. I already distinguish more stroke types than are required to represent distinct glyphs and more distinctions will be needed (for instance where Takahashi reads oin for ain). Most of these distinctions will merely be artefacts.
2. There is an implied ordering of the glyphs in each text. Sometimes this is arbitrary (e.g. the point at which we take the text of the rings to begin) but sometimes a real issue is involved, as Rene observed about matching up the split lines on f. 34r.
2a. One positive advantage of abstracting the ordering into metadata is that cases where transcribers have simply omitted material will no longer need an explicit representation like !! and *.
3. Any parsing of strokes into groups involves arbitrary decisions about the analysis of eeey, aiiin etc. I think there is consensus that this is so and that arbitary decisions cannot be avoided.
So is it worth doing? I think it could be, but only as a stepping stone to greater simplicity. From this point on, collaboration would be essential. There is no point producing something that nobody intends to use.
I think most of us would agree about a huge number of cases where the Beinecke scans show that variants in the different texts are artefacts of the copyflo or reflect the limitations of the original transcription schemes. I also anticipate agreement about certain real distinctions not found in the existing texts – “Neal” keys, Nick’s variants of final n and so on.
If a number of us, the more the better, can agree which distinctions to throw out and which to retain, the result would be this.
1. A database of strokes giving variant forms actually found in the scans.
2. A default ordering of the strokes.
3. A default parsing of the strokes into glyphs. The glyphs would look as much like EVA as possible.
4. A default text of the manuscript. This would be exposed in a GUI interface to the database. The default and the variants might also be published as an interlinear file.
Anyone who disagreed with the defaults would be free to define a custom ordering, parsing and text.
If we arrived at this point, Nick’s task #1 would be completed and a preliminary solution of task #2 would fall out of it. We could then discuss tasks #2 #3 and #4 equipped with a reasonably objective representation of the text, the resources to improve on it and agreement within the disagreement about the issues remaining.
What do other people think?
Philip: whilst I agree with almost all of what you say, I’d add that a catalogue of different strokes would need to be relative to a given baseline character size reference (or else what might be a long stroke in Hand 2 might be only a medium stroke in Hand 1). This might lead to the awkward situation where we have different stroke pattern expectations for Hand 1 and Hand 2, a problem which I can’t remember being raised before.
I’m also not quite sure of the role you think the kind of connected strokes (e.g. overlapping EVA ee letters touching at the top or at the bottom) that bothered Glen Claston so much should play in a revised transcription.
Regardless, the most sensible and grounded first step would surely be to collate a set of letter shape variants for each of the glyphs, to form a kind of visual reference for the degree of variation we should be aiming to cover.
There’s also the issue of the overall level of “coverage” for which we should aim: that is, given that there are a fair number of “one-off” fancy gallows that would make no sense to try to reduce to component strokes, along with rarities such as ckhh / -ho (where the right-hand curve of the -h forms the left-hand edge of the o), etc, to what degree are we helping ourselves or hindering ourselves by trying to cover just about every shape?
Here, the first step towards this secondary goal would seem to be collating a catalogue of difficult (rather than ambiguous) multi-stroke shapes that will almost necessarily stretch our ability to transcribe them. Are there 100 of these? Or perhaps 200, or even more? I’ve never tried to physically count or collect these, but perhaps doing so would be a good idea.
Philip,
I have been working for a long time in this direction.I have a handwritten table where all kinds of strokes and their variants are described
(in my opinion).I can make a PDF file including this table and all my other observations, arguments, and assumptions about how Voynich
alphabet was constructed.But I do not have my own site or blog where to post it.If you advise me how this can be done, I can start preparing
it and publish it ASAP.
Milen: if you would like your table, commentary, and conclusions to appear as a blog post here, that would be fine.
Thanks Nick.
I’ll be ready in about two weeks.
I take it that EVA is a framework consisting of
– the EVA transcription scheme
– the EVA data structure: the interlinear file and metadata
– the EVA tools, now approaching obsolescence
– the EVA text (as in Masoretic text): essentially Takahashi plus variants
What one wants is a new framework supporting a better “Masoretic text” in an EVA-like transcription scheme which would become the standard for statistical work.
My starting point is the problem of integrating the Claston text with EVA. I am now satisfied that the thing could be done on the lines I have indicated. The basic idea is to replace the EVA data structure with a fine-grained strokes-based representation of the text, plus metadata, and use it to generate the existing glyph-based texts, at the least Claston and Takahashi and perhaps the others too. I believe that something of this sort could be the basis for EVA 2.0 both as a framework and a “Masoretic text”.
I agree about the first step, defining coverage in terms of a reference set of glyphs, but I would put the matter slightly differently. For each stroke variant required to generate the Claston and Takahashi texts from a single underlying dataset, one would identify the glyph variant in which it occurred and include that glyph in the reference set with data including its first occurrence in the manuscript. With this as a baseline, we would work towards a new agreed set of glyphs and strokes. The other texts from FSG on might also be used.
The details of this are, of course, subject to discussion. I would want to start by defining a maximal set of strokes and glyphs and then ruthlessly prune it of redundant distinctions and obvious mistakes. Real distinctions not previously recorded (“Neal” keys, -n with flourish) might also be added at this stage and I shall be interested to see Milen’s table of strokes and variants.
1. I say “strokes”, but I actually envisage a data structure of triplets: (left-join : stroke : right-join) – this is where I would tackle Claston’s problem about ee connecting with ee.
2. I would retain phenomena like fancy gallows and ckhh in the underlying strokes-based data structure but probably exclude them from the glyph-based text and web interface which we would generate from a database and offer to the wider Voynich community as the replacement for EVA. For those who wanted them, there would be tools with which you could drill down and create your own custom text.
But if EVA 2.0 is going to happen, two things are essential: an agreed plan and a convincing consortium of participants and supporters. Ideally, all those involved in the original EVA should be consulted and persuaded of the need for a new version to be called EVA 2.0. Then those who intend to be active contributors would agree on a design for the framework and a plan for creating it. I would certainly want to be involved.
Here is important to remind ants. Text analysis is very important. I’m also very glad Rene is finally learning. Finally he starts writing about the numbers.
( I’ve been writing to you for many years ). 🙂
Ants. Your entire alphabet is bad. In the whole manuscript, for example, ee is not written anywhere. You ants can not even identify the characters of the letters.
Learning is going very slowly.
I can also write to you that no rus handwriting no translates.
Because it is written and encoded in the Czech language.
I see several points that require improvement, and the one that I am now concentrating on is the framework. With this I mean the method to represent all transcriptions in a consistent way.
I fully agree that the best way forward is having several key ‘actors’ agreeing and collaborating on a good way forward. This is exactly the way things happen in international collaboration projects in the scientific and engineering world.
At the same time, in the Voynich ‘amateur’ world, progress has mostly been achieved by significant personal efforts that produced something that was clearly recognised as something useful. Only in the very early days of the Voynich MS mailing list, there was considerable collaboration between several active members.
I very much welcome further discussion and collaboration.
I am not interested in collaboration in order to pursue this or that theory.
I am interested in creating improved data, and standards to represent it.
Ant. Quality data is very important. I give you that. So watch carefully what I write to you.
Letter Yale . It is written in it.
Concerninq the Cipher MS.
( Jewish substitution. It’s loaded in numbers ).
So it is very clearly written here !
Czech kniha the Cipher 1,2,3. ( MS = 1,2,3 ).
Michael and Ethel write here.
Czech language : Czech kniha je šifrována čísly.
English language : Czech book the cipher 1,2,3.
So I can write you one important piece of information. No academician and engineer from the US, Britain, Holland, Russia, Japan, etc. Manuscript will not be translated.
The only ant that could possibly understand is Hurych.
Rene. 🙂 One can tell you. Eliška when she began to write a manuscript. She was 12 years old. She was 12 years old. 🙂
@Josef Zlatoděj Prof.
The Cipher MS, (MS = manuscript)
This was the name where Voynich gave the book before Yale renamed it after his death to Voynich MS.
The MS is simply a MS, and not 1.2.3
Sorry, but you just have too much imagination
@ Peter. Are you sorry ?
Look at the letter, it’s Yale. Letter M. The sign of the letter M. ( Mark of M ).
The character is written from two. ( I + 2 ) . 🙂
Number 2 is written in reverse. About 180 degrees.
You have to see it. Now learn. ( Jewish substitution ).
I = 1. 2 = 2. S = 3. = ( Jewish numerological system ).
1 = a, i, j, q, y.
2 = b, r, k.
3 = c, g, s, l.
The whole system im based on 8 numbers. And it will cover the entire alphabet.
( otherwise, I wrote it on the blog ). 🙂
What I write to you. Of course, Eliška writes in the manuscript. Eliška writes : The cipher is based on numbers. Sion numbers. And I write in Czech.
That’s how Eliška ( Elizabeth ) of Rosenberg writes.
@Josef Zlatoděj Prof.
I was on your website and have viewed it and read it.
1. The letter was written by Mrs Voynich after the death of Mr Voynich. With the note to open it only after her death.
2. the said M, also occurs in other places in the letter. It is a purely hand-written feature. I read the letter.
3. Your described detail to the M, comes out on the envelope. Who is so stupid, if it should be written in the letter?
4.Someone knows to make the word “concering” meaning Czech cipher, is the summit of the imagination.
Furthermore, I hold firmly, if one looks at the VMS times more, falls on the most words (string) is quite short.
Your theory, with a numerical decoding method you described,
1 = a, i, j, q, y.
2 = b, r, k.
3 = c, g, s, l.
Would shorten the words by half. Most words would be 1-4 letters long. Very unpredictable.
This logic comes closest to a combination technique.
1 symbol = 1-3 characters.
I described this already in 2013 with Klaus Schmeh. I think you have misunderstood my 1-3 technique.
@ Peter.
1. On the cover of the letter. Instructions for translation are written. It is the same as on page 116.
2. Jewish numerological system : 1= a,i,j,q,y. 2 = b,r,k. 3 = c,g,s,l. 4 = d,m,t. 5 = e,h,n. 6 = u,v,w,x. 7= o,z. 8 = f,p.
3. Now watch :
Concerninq.
conce rninq.
czech kniha.
4. The whole letter is encrypted. Ethel studied Slavic languages.
5. I know Klaus says
6. I know what you write.
7. The German does not find a manuscript.
8 = 🙂
I wonder how far can the new EVA program help me at all, if I look at the historical sequences.
I just take the Tyrol-South Tyrol without taking any notice.
Around the 12th century the area had a high immigration due to the copper and silver mining, the localities were bigger than today.
After the Pestepedemie (1360), the majority of people came from Slovakia today. (Southern Habsburg)
At the same time, the Jews were expelled from Venice, Vienna, Basel, etc., and the Alps were a kind of escape castle.
There is still the original Alemanian, cetisch-latin. (Vulgar lat.)
How far did the linguistic and literary cultures intermingle?
It corresponds to a soup with vegetables after the blender.
If I look at this more closely, I can well understand that many people recognize their own language.
I also do not believe that the code is the heavy one, but the language.
Rare document. Slavic (Alpine Slavic)
https://en.wikipedia.org/wiki/Freising_manuscripts
A couple thoughts on this:
1) If you’re going to pursue an agenda involving a stroke-based transcription, I’d suggest that rather than “moving towards ‘EVA 2.0′”, one consider “moving towards Frogguy 2.0”. I have *never* (ever, ever, ever…) understood the rationale for prioritizing pronouncability in assigning ASCII equivalents in a transcription alphabet. I’m sure that with practice folk become proficient at working with EVA, but one could say the same thing about the QWERTY keyboard — that doesn’t make it a good design. Is there really a good case for representing Currier ‘4OFESCC9’ as ‘qoklcheey’ rather than as ‘4olpxctcc9’? Shouldn’t a transcription prioritize visual resemblance over pronouncability?
2) I want to vigorously agree with Nick using Voyn-101 to illustrate the point that trying to aggressively capture variants/weirdos does not imply/entail a stroke-based transcription.
3) Nick says, “Yet some aspects of Voynichese writing slipped through the
holes in GC’s otherwise finely-meshed net, e.g. the scribal flourishes on
word-final EVA n shapes, a feature that I flagged in Curse back in 2006. And
I would be unsurprised if the same were to hold true for word-final -ir
shapes.” On the one hand, I get that transcription is time consuming and
labor intensive, and there are limits to our ability to judge what may or
may not be important a-priori. On the other hand, I think there is an almost
certain futility in chasing hypotheses that involve assigning functional importance to subtle variants/weirdos because of the accuracy requirements that imposes on the original scribe(s) as well as the transcribers.
4) To what extent should derivation of a transcription alphabet be informed by paleographic expertise?
Just my $0.02…
Karl
Hello Nick, I had promised to do a PDF file on this topic and I’m ready.
This is the link to my Google Drive:
https://drive.google.com/file/d/0B_T290o0O-dPSFJlRGtXdmxsZDg/view?usp=sharing
Best regards,
Milen
I wasn’t sure what post to make this a comment to, but this seems as good a choice as any…
I don’t think I’ve ever seem a mapping between Voynich 101 and other transcription alphabets (Rene’s current page http://www.voynich.nu/transcr.html shows the images that accompany the transcription, but doesn’t do an explicit mapping). For my own purposes I’ve been developing a mapping between Voyn 101 and Currier, and thought I’d share what I have to date in case others might find it useful.
Because my preferred tool set is Unix-based (grep, sed, awk, etc.), I had to introduce the non-canonical character ‘~’ into the Voyn 101 character set to replace non-ASCII characters in the transcription — these account for 0.215% of non-whitespace characters. This leaves 73 non-whitespace characters (in order of decreasing frequency; ‘N’ crosses into the 99th percentile):
o9ac1e8hyk4m2C7snpKgHAj3z5(~difuM*Z%NJ6xI?+WGYE!tSP&lF#LUbTQ|w$V\XrRqDv^B
I’ve mapped 53 of those into Currier:
* Single Voyn 101 characters to multiple Currier characters:
d -> CCC l (lower case ‘L’) -> XC r -> YC v -> CF C -> CC I (capital ‘I’) -> III
* Multiple Voyn 101 characters to single Currier characters:
Ie -> H Iy -> U (again, capital I’s on left hand sides there)
* Single Voyn 101 character to single Currier character (I’ve included a non-canonical character ‘.’ in Currier to represent the half-space in Voyn 101):
Voyn101: abcefghijkmnopqstuwxyz~?.,-=918427KHA35(M*Z%NJ+WGY!
Currier: A2CEVBFIBPMNOJK22V#RRT**/.-#9S84Z8QXAZZ9M6UZDWZ*WRZ
* Unmapped ASCII Voyn 101 characters: 6ESP&F#LUTQ|$V\XRD^B
I approached this from two overlapping directions: 1) listing the Voyn 101 characters in order of decreasing frequency, and using an interlinear of the first few folios from the D’Imperio transcription with the Voyn 101 transcription to figure out matches until I was out into the 99th percentile, and 2) visual inspection of the key image for the lower case Voyn 101 characters. Apologies for errors due to typos in my notes, mistakes in the D’Imperio transcription, etc.
As I said above, I hope there are others out there who will find the above useful. Corrections and extensions for the unmapped characters welcomed and encouraged.
Karl
Hi Karl,
I have a vague recollection of setting up a complete table of all known alphabets on that page at my web site, but I seem to have changed my mind and now have it in separate tables.
GC stated on several occasions that he would never agree to anyone converting his transcription to Eva. Now it is clear that such a conversion / translation isn’t possible anyway, without changing it, since there is no 1-to-1 mapping.
For my own use I have a table that maps almost all v101 characters to what I would call ‘nearest Eva’. This is a lossy transformation, but it helps me to ‘read’ the v101 file.
I haven’t published this, but I could.
What could be done, and wouldn’t seem to violate GC’s wish, is to create a superset of v101 and extended Eva, of which the two alphabets are then subsets.
I’m about to publish my own old transcription of 1999. For this I needed a modified transcription file format, because I want the same format to be able to represent all important files. I’ve made very good progress there, but it needs a few more weeks.
Once this is available, the suggestion I made above could make sense. The two files could also be represented in a consistent alphabet, but of course they would still show differences.
There will be much more about this in the coming weeks/months.
Rene,
Additional transcriptions are always useful, so I look forward to seeing the release of your transcription.
Glen really, really, really had serious heartburn with stroke-based transcription schemes, didn’t he…(although unless he put some explicit GPL/copyright terms on the text of the transcription [or he has the ability to haunt you], I’m not sure how translation into EVA could be stopped).
To get an initial handle on how Voyn 101 compares to the D’Imperio, I ran my conversion script on f75 and f76 and manually created an interlinear file, which I then compared using an awk program I threw together. A couple caveats:
* Converting Voyn 101 to Currier is lossy (plume details on Currier Z’s; “cC”, ‘Cc”, and “ccc” all map to “CCC” — that potentially is a significant distinction in terms of Bio B “word” morphology; etc.).
* The Biological B “dialect” has much less weirdo glyph cruft than Herbal A
* I ignored white-space characters for the purpose of the comparison
* I didn’t do any domain-specific pattern matching in errors, so differences of opinion regarding the existence of a ligature generates multiple errors (if D’Imperio reads “CC” and Voyn 101 has “S”, that’s counted as two errors [a mismatch and an extra character]; similarly, a ligature with a gallows generates 3 errors)
* It’s marginally possible some errors may be due to my conversion script
Based on that two-folio sample, the rate of disagreement between the D’Imperio and Voynich 101 converted to Currier is ~4.3%. Converting the Takahashi transcription from EVA to Currier in my queue — when I’ve got that done, I’ll throw that into the comparison as well.
By and large, I’m inclined to assume most of that reflects errors in the D’Imperio given the limitations of the Copyflo images, although I’ve always been a little worried about using the Voyn 101 transcription given that Glen definitely Had A Theory. In some cases, that will have a neutral impact on transcription — if you think the plume variations on Currier Z’s are important, and want to catalog them and include that info in your transcription, more power to you. In some cases, that will be irrelevant to the transcription — Glen was an acute observer of the “partial space” phenomenon, but it wasn’t relevant to his theory, so there’s not likely to be bias in his perception of them. On the other hand, Glen cared very, very much how the text lined up with Strong’s supposed numerical shift sequence, which at least potentially could bias whether he saw “cc” or “C” in his alphabet (for instance) (unless he utterly sequestered his transcription effort so that there was no feedback from his deciphering efforts in the process).
Speaking of white-space, a question for Rene — back in ’06 you posted a message to the mailing list arguing the position “I would be so bold as to say, that the very unusual idea that the spaces are randomly or arbitrarily inserted, really only comes out of the frustration that the VMs has remained undeciphered for so long. One is tempted to consider unusual possibilities. Nobody has, however, made any progress when using the assumption that spaces could be meaningless, so there is no great argument for this. On the other hand, there is an excellent argument indicating that spaces are word separators, namely that the so-called labels (stand-alone words in the MS) also occur in the MS main text, surrounded by spaces.
“While this does not necessarily mean that every group delimited by spaces also signifies a ‘word’, it is strong evidence that the spaces are neither randomly inserted, nor deliberately depending on preceding ‘dy’ or following
‘qo’.
“Statistics further confirm this. The single character frequency table shown before is one.
“Another one is that the space character is the one that has the highest conditional entropy w.r.t preceding and following characters, both for Latin sample texts and the VMs. It really stands out w.r.t. all other characters. (This only disproves the hypothesis that spaces are ‘typically’ inserted before and after certain other characters).”
Do you still hold that view? I’m sure at some point I’ve posted data like the following (counts from the D’Imperio transcription of Biological B):
Count 3-gram Count Bigram
110 9/R 14 9R
160 9/2 6 92
300 9/8 20 98
302 9/E 21 9E
575 9/O 7 9O
1365 9/4 10 94
That there are only 0.733% as many “word”-medial 94’s as there are space-straddling 9/4’s (and 1.2% as many “word”-medial 9O’s) requires explanation if spaces are *only* word separators…
Karl
Hi Karl,
just a quick initial response…
Your work will be much simplified once I have the various transcriptions all in the same file format. While the conversion is done almost entirely by S/W tools, the new versions will have to be checked somehow, so it may all take some time.
Anyway, your figure of 4% disagreement between such wildly different transcriptions can make one a bit more hopeful about the general quality of historical transcription efforts.
On the meaning of spaces, this was recently double-checked by Marco Ponzi in the Voynich Ninja forum, confirming that the label words tend to appear in the text separated by spaces, to a very large percentage.
At the same time, I have reason to believe that the label words are actually words picked from the text. (Unfortunately, I am also aware of arguments against that).
Since the label words have a completely different Zipf graph (much flatter), this ‘picking’ was not arbitrary. The suggestion is that these are somehow key words.
Rene,
While I already mentioned this in an email, for the benefit of other readers — I also compared f1r – f24v in D’Imperio’s transcription vs Voyn 101 translated to Currier (using a slightly updated/fixed version of my sed command file). Surprisingly, that sample is only ~2x as many (space-less) characters as the 1st two folios in Bio B. Of the 15559 characters in the manually-aligned interlinear, the transcriptions disagree at 4.63% character locations (alternate readings + added/deleted characters). Same caveats as previous comment re:differences in ligatures counting as multiple mismatches. So that confirms a ~4.5% rate of disagreement between the two transcriptions on non-space characters.
On the spaces as word delimiters issue:
Two key high-level methodological issues relate to this:
1) If is true, how does that constrain possible solutions to the nature of the text?, and
2) What testable predictions follow from ?
WRT issue #1 above, the claim that glyph sequence in one place in the text represents the same plaintext letter sequence as glyph sequence in another place _is not a theory-free claim_. Currier “word” “SOEOR” occurs 2x in f1-f24 and once in Bio B (and there are a bunch of other words containing “OEOR” in both sections) — *BUT* — in f1-f24 OE = 4.675% and OR = 3.872% of digrams (spaces not included); in Bio B: OE = 4.644% and OR = 1.044% — that doesn’t give me warm fuzzies that SOEOR in Herbal A has the same meaning in Bio B.
WRT issue #2: The spaces-as-word-delimiters school has one edge over the alternative in that Stolfi already showed that labels follow the same binomial(ish) length distribution (http://www.ic.unicamp.br/~stolfi/voynich/00-12-21-word-length-distr/). What (if any) other testable predictions does this theory make? Should the glyph frequency distribution in labels match (in a chi^2 sense) the distribution in running text?
Fairness demands observing that the some-spaces-are-inserted school makes two testable predictions that I don’t think have been explored:
a) A verbose cipher with rule-based space insertion to conceal the lengthening of words will produce space-delimited word fragments with a similar binomial(ish) distribution, and
b) the rank vs. frequency of the space-delimited word fragments will show a similar fit to Zipf’s Law
Karl
Karl: while I have long argued that the most likely balancing mechanism (i.e. to mitigate the lengthening effect of verbose cipher) was contraction (-8-) and truncation (-9-), there must surely be some places where long hard-to-remember words (e.g. obscure names, e.g. appeal to authority) had to be enciphered. These would end up unnaturally long when verbosely enciphered, and would – I predict – have had to be split up by inserting a fake space.
I’d suggest the place to look for these tells would be Q13: two consecutive longish words, neither ending in -y or -dy, and not exhibiting ‘classic’ Voynich word structure…
One problem I perceive, and I decided to resolve, is the fact that the various existing transcription files all use different formats and conventions. I decided to create a new convention that can represent all of them.
I already wrote about that in the voynich.ninja forum.
To cut a long story short, the information is contained at my web site, and without including any links here, can be accessed by going to the home page and click the button “8. Transcription of the text”.
The new bits are starting at the section called “The state of affairs in 2017”. There are also references to more detailed explanations on another page reached via de “Read more” buttons.
I would very much appreciate hearing back from “Job”, because there is a very clear next step that could be taken, namely linking the transcription items (loci) to their locations on the folios of the MS.
Rene: the release of your transcription, intermediate format, and a first draft of an extraction tool is very big news here! I will of course be posting about all this very shortly…
Thanks Nick.
Since I tend to be critical, I also have to be critical about my own transcription, which was made 18 years ago, without the benefit of the digital scans.
I am sure that it can be improved in many places, but it still is a record from the not too distant past.
And I don’t feel like repeating the exercise, that’s for certain.