02Jun 2017

Voynichese Task #2: parsing Voynichese into tokens…

As I wrote before, I think we have four foundational challenges to tackle before we can get ourselves into a position where we can understand Voynichese properly, regardless of what Voynichese actually is:

* Task #1: Transcribing Voynichese into a reliable raw transcription e.g. EVA qokeedy
* Task #2: Parsing the raw transcription to determine the fundamental units (its tokens) e.g. [qo][k][ee][dy]
* Task #3: Clustering the pages / folios into groups that behave differently e.g. Currier A vs Currier B
* Task #4: Normalizing the clusters i.e. understanding how to map text in one cluster onto text in another cluster

This post relates to Task #2, parsing Voynichese.

Parsing Voynichese

Many recent Voynichese researchers seem to have forgotten (or, rather, perhaps never even knew) that the point of the EVA transcription alphabet wasn’t to define the actual / only / perfect alphabet for Voynichese. Rather, it was designed to break the deadlock that had occurred: circa 1995, just about every Voynich researcher had a different idea about how Voynichese should be parsed.

Twenty years on, and we still haven’t got any consensus (let alone proof) about even a single one of the many parsing issues:
* Is EVA qo two characters or one?
* Is EVA ee two characters or one?
* Is EVA ii two characters or one?
* Is EVA iin three characters or two or one?
* Is EVA aiin four characters or three or two or one?
…and so forth.

And so the big point of EVA was to try to provide a parse-neutral stroke transcription that everyone could work on and agree on even if they happened to disagree about just everything else. (Which, as it happens, they tend to do.)

The Wrong Kind Of Success

What happened next was that as far as meeting the challenge of getting people to talk a common ‘research language’ together, EVA succeeded wildly. It even became the de facto standard when writing up papers on the subject: few technical Voynich Manuscript articles have been published since that don’t mention (for example) “daiin daiin” or “qotedy qotedy”.

However, the long-hoped-for debate about trying to settle the numerous parsing-related questions simply never happened, leaving Voynichese even more talked about than before but just as unresolved as ever. And so I think it is fair to say that EVA achieved quite the wrong kind of success.

By which I mean: the right kind of success would be where we could say anything definitive (however small) about the way that Voynichese works. And just about the smallest proof would be something tangible about what groups of letters constitute a functional token.

For example, it would be easy to assert that EVA ‘qo’ acts as a functional token, and that all the instances of (for example) ‘qa’ are very likely copying mistakes or transcription mistakes. (Admittedly, a good few o/a instances are ambiguous to the point that you just can’t reasonably decide based on the scans we have). To my eyes, this qo-is-a-token proposition seems extremely likely. But nobody has ever proved it: in fact, it almost seems that nobody has got round to trying to prove anything that ‘simple’ (or, rather, ‘simple-sounding’).

Proof And Puddings

What almost nobody seems to want to say is that it is extremely difficult to construct a really sound statistical argument for even something as basic as this. The old saying goes that “the proof of the pudding is in the eating” (though the word ‘proof’ here is actually a linguistic fossil, meaning ‘test’): but in statistics, the normal case is that most attempts at proof quickly make a right pudding out of it.

As a reasonably-sized community of often-vocal researchers, it is surely a sad admission that we haven’t yet put together a proper statistical testing framework for questions about parsing. Perhaps what we all need to do with Voynichese is to construct a template for statistical tests for testing basic – and when I say ‘basic’ I really do mean unbelievably basic – propositions. What would this look like?

For example: for the qo-is-a-token proposition, the null hypothesis could be that q and o are weakly dependent (and hence the differences are deliberate and not due to copying errors), while the alternative hypothesis could be that q and o are strongly dependent (and hence the differences are instead due to copying errors): but what is the p-value in this case? Incidentally:

* For A pages, the counts are: (qo 1063) (qk 14) (qe 7) (q 5) (qch 1) (qp 1) (qckh 1), i.e. 29/1092 = 2.66% non-qo cases.
* For B pages, the counts are: (qo 4049) (qe 55) (qckh 8) (qcth 8) (q 8) (qa 6) (qch 3) (qk 3) (qt 2) (qcph 2) (ql 1) (qp 1) (qf 1), i.e. 98/4147 = 2.36% non-qo cases.

But in order to calculate the p-value here, we would need to be able to estimate the Voynich Manuscript’s copying error rate…

Voynichese Copying Error Rate

In the past, I’ve estimated Voynichese error rates (whether in the original copying or in the transcription to EVA) at between 1% and 2% (i.e. a mistake every 50-100 glyphs). This was based on a number of different metrics, such as the qo-to-q[^o] ratio, the ain-to-oin ratio, the aiin-to-oiin ratio, the air-to-oir ratio, e.g.:

A pages:
* (aiin 1238) (oiin 110) i.e. 8.2% (I suspect that Takeshi Takahashi may have systematically over-reported these, but that’s a matter for another blog post).
* (ain 241) (oin 5) i.e. 2.0% error rate if o is incorrect there
* (air 114) (oir 3) i.e. 2.6% error rate

B pages:
* (aiin 2304) (oiin 69) i.e. 2.9% error rate
* (ain 1403) (oin 18) i.e. 1.2% error rate
* (air 376) (oir 6) i.e. 1.6% error rate

It’s a fact of life that ciphertexts get miscopied (even printed ciphers suffer from this, as Tony Gaffney has reported in the past), so it seems unlikely that the Voynich Manuscript’s text would have a copying error rate as low as 0.1% (i.e. a mistake every 1000 glyphs). At the same time, an error rate as high as 5% (i.e. every 20 glyphs) would arguably seem too high. But if the answer is somewhere in the middle, where is it? And is it different for Hand 1 and Hand 2 etc?

More generally, is there any better way for us to estimate Voynichese’s error rate? Why isn’t this something that researchers are actively debating? How can we make progress with this?

(Structure + Errors) or (Natural Variation)?

This is arguably the core of a big debate that nobody is (yet) having. Is it the case that (a) Voynichese is actually strongly structured but most of the deviations we see are copying and/or transcription errors, or that (b) Voynichese is weakly structured, with the bulk of the deviations arising from other, more natural and “language-like” processes? I think this cuts far deeper to the real issue than the typical is-it-a-language-or-a-cipher superficial bun-fight that normally passes for debate.

Incidentally, a big problem with entropy studies (and indeed with statistical studies in general) is that they tend to over-report the exceptions to the rule: for something like qo, it is easy to look at the instances of qa and conclude that these are ‘obviously’ strongly-meaningful alternatives to the linguistically-conventional qo. But from the strongly-structured point of view, they look well-nigh indistinguishable from copying errors. How can we test these two ideas?

Perhaps we might consider a statistical study that uses this kind of p-value analysis to assess the likeliest level of copying error? Or alternatively, we might consider whether linguistic hypotheses necessarily imply a lower practical bound for the error rate (and whether we can calculate this lower bound). Something to think about, anyway.

All in all, EVA has been a huge support for us all, but I do suspect that more recently it may have closed some people’s eyes to the difficulties both with the process of transcription and with the nature of a document that (there is very strong evidence indeed) was itself copied. Alfred Korzybski famously wrote, “A map is not the territory it represents”: similarly, we must not let possession of a transcription give us false confidence that we fully understand the processes by which the original shapes ended up on the page.

Posted in: Voynich Manuscript

97 thoughts on “Voynichese Task #2: parsing Voynichese into tokens…”

Milo Gardner on June 3, 2017 at 12:21 pm said:

Thank you for the clear four part explanation. Beginning with transliterations is often used in attempting to translated ancient texts. Parsing patterns hidden within transliterations differs with respect to the actual text. Each text must be studied separately.

For example, Egyptian hieratic math texts fall into different classes. The 26 line EMLR is the simplest. The EMLR scaled rational number 1/p and 1/n sometimes multiple ways, such as 1/8 was scaled three times, twice with one LCM and once by two LCMs, as a beginning student was introduced to unit fraction math around 1900 BCE.

The 51 member 1650 BCE RMP 2/n table concisely rational numbers by one LCM in a manner was exactly reported in the introduction to the Kahun Papyrus, an 1800 BCE text.

Hieratic weights and measures texts that discussed grain volume topics included two LCMs when scaling to quotients and remainders. Quotients were scaled to 1/64 units and remainders were scaled to 1/320 units. Scholars beginning in 1906 transliterated to cubit units of a 1900 BCE text, the Akhmim Wooden Tablet, housed in the Cairo museum. By 1923 Peet scaled this five problem text to a 1/320 unit and suggested he understood quotients and remainders, when only remainders were fairly parsed. By 2002 Vymazalova showed that two part quotient and remainder answers, obtained by multiplying an unknown initial value by 1/3, 1/7, 1/10, 1/11 and 1/13 were multiplied by 3, 7, 10, 11 and 1/13 and returned the same value 64/64, a unity value, Incirrectly concluding that Peet’s 1923 1/320 paradigm was correct. Only in 2006 was it published that 64/64 was the initial and final term operated on in a manner that scholars had not correctly seen in a 100 year old decoding project.

Improperly combining, and throwing the baby out with the bath water, all hieratic unit fraction texts, as many linguistic scholars practice to this day. Linguistic scholars incorrectly conclude that Egyptian division was based in single false position, a math concept modified by medieval scribes to visually solve roots of second degree equations.

The actual division operation used by hieratc scribes from 2000 BCE to 1650 BCE was inverse to the multiplication operation, a number theory property hidden in the scribal shorthand notes, the same rule that modern math uses in modern base 10 arithmetic. As an aside, unit fraction arithmetic formally ended when rational numbers were encoded in 1585 AD by Stevins as approved by the Paris Academy, and simplified in ways that our school children memorize today, without being told Egyptians 4000 years had developed the number theory based rules for our four arithmetic operations, in a closely related base 10 number system.

Conclusion. Extreme care must be taken when combining transliterated data bases that encide language and mathematical information.

Best Regards,

Milo Gardner
Reading the past by allowing the ancient texts to speak for themselves.
Torsten Timm on June 3, 2017 at 5:26 pm said:

Dear Nick,

the text of the Voynich manuscript didn’t fit your expectations for an enciphered text. No problem, you can explain it with systematic copying errors. Unfortunately the copying error rate varies from 1.2 up to 8.2 %? Again this is no problem, you can explain the 8.2 % with systematic transcription errors. But why the number of transcription errors for Currier A is much higher then for Currier B?
nickpelling on June 3, 2017 at 9:55 pm said:

Torsten: I’ll come back to the 8% figure in a separate post, it’s not as if I’m trying to run away from it. But there is obviously also an issue to be tackled about what are scribal errors (missed letters, reversed pairs, miscopied letters, etc) and what are transcription errors.

My general point about statistical tests is that you can calculate whether any differences oppose your null hypothesis, but you need to do a certain amount of preparatory work first.

As for the difference between A and B, the obvious explanations would be that they were written by different people (Currier’s Hand 1 and Hand 2), and that the different writing styles presented different transcription challenges. But what is perhaps more interesting is that we should be able to cross-reference different transcriptions to try to isolate at least some of the differences.
Young Kim on June 3, 2017 at 11:28 pm said:

People who visit this venue including myself have been seeing lots of debate going on between others over everything about Voynich Manuscript and
also hearing many claims that the mystery of Voynich manuscript has been finally solved. But it is really unfortunate that we don’t actually find one who witnessed any evidence that supports those claims. So, here I have a proposition to make. Why don’t we set up a Voynich Manuscript Translation Challenge with a prize? Any individual or a group of people who teamed up together can submit their translation results with a deadline. If I am not wrong, Nick has once tried to crowdfund his own project before and I think he could set up a crowdfund to raise the prize money for the open challenge. I recommend Nick Pelling, Rene Zandbergen, and Job (sorry that I didn’t catch your name) to be the organizing members to start with. Both Nick and Rene have their reputations in this community and it seems to me that Job could organize the web-site that might be needed. One or a few selected paragraphs out of the whole manuscript selected by the organizers can be used by participants. They can use any font or transcription that they pleased to use, but they should make them available to others freely. If I may, I would like to add one thing that the approach and method the participants are taking should address Voynich text in its original form, not using in encoded form. For example, the word ‘8aiin’ should be used as in the original Voynichese letters in their presentation so that anyone who is not familiar with the manuscript can understand their explanation. Some may argue that who could possible have the authority to make the decision on the winner, but I think the translated outcome will speak itself out.

What do you think?
Torsten Timm on June 3, 2017 at 11:59 pm said:

Nick: Another possibility is that the scribe was purposefully writing ‘oi’ instead of ‘ai’.

See for instance page f8r. On this page it is possible to find at least four ‘s’-glyphs that where changed by an additional quill stroke from ‘e’ into ‘s’. See for instance the ‘s’ in ‘chsey’ in line f8r.P3.16. In my eyes the scribe was writing ‘cheey’ and has later changed the first ‘e’ into ‘s’. The word ‘cheey’ occurs 174 times and the word ‘chsey’ occurs only twice. This correction suggests that the scribe was purposefully writing ‘chsey’ instead of ‘cheey’. Maybe the scribe was also writing purposefully ‘oiin’ instead of ‘aiin’.

Anyway, we can’t change the transcription to improve our statistics. We have to accept that the scribe wrote two times ‘chsey’ even if the word ‘cheey’ exists 174 times. In the same way we have to accept that the scribe sometimes wrote ‘oi’ even if ‘ai’ is twenty times more common then ‘oi’.

BTW: The existence of ‘chsey’ beside of ‘cheey’ suggests that it is not possible to parse ‘cheey’ as [ch][ee][y]. In the same way the existence of ‘qokesdy’ beside of ‘qokeedy’ suggests that it is not possible to parse ‘qokeedy’ as [qo][k][ee][dy].
Rene Zandbergen on June 4, 2017 at 8:16 am said:

Scribal errors and transcription errors both almost certainly exist.

The transcription errors we can fix now, and yes, Takeshi did introduce a number of consistent ‘features’.

As regards the scribal errors, we are stuck with them. We can at best guess about some possible cases, but we will only know if (ever) the text can be read.
nickpelling on June 4, 2017 at 8:37 am said:

Rene: if Voynichese is strongly structured (and I don’t believe we can easily eliminate this possibility) then in many cases we can suggest likely corrections. For example, qtoy could be qoty, qdain could be qodain, oiin would be aiin, etc.
Torsten Timm on June 4, 2017 at 9:59 am said:

Nick: We can for sure correct the transcription errors. For instance we can correct ‘schol sair’ into ‘schol saim’ in line f8r.T3.21. We can also mark glyphs hard to identify like ‘y’ in ‘chy taiin’ in line f8r.P1.5. But we can’t know what a scribal error is and what not.
If we would replace all 335 instances of ‘oi’ with ‘ai’ we would not only use our own interpretation of the text we would also wipe out some information. Lets see what happens if we do it as an experiment. At first there are words like ‘qoiiin’. Would you replace ‘oiin’ with ‘aiin’ and therefore ‘qo’ into ‘qa’ in this cases? Secondly, beside the most frequent word ‘daiin’ also the word ‘saiin’ exists. The word ‘daiin’ exists 863 times and the word ‘saiin’ occurs 144 times. How will you handle the suggestion to correct all instances of ‘saiin’ into ‘daiin’?

BTW: ‘qtoy’ and ‘qdain’ doesn’t exist in the Voynich manuscript.
nickpelling on June 4, 2017 at 10:04 am said:

Torsten: the issue of chsey is very interesting indeed! It may be that what we are glimpsing here instead is not so much scribal error as one of the evolutionary steps in the formation of the writing system.
nickpelling on June 4, 2017 at 10:09 am said:

Young Kim: I think that EVA has given a sense of false confidence to a whole generation of Voynich researchers. We’re not even remotely close to the point where this kind of competition would be anything more than trollbait, sorry. 🙁
nickpelling on June 4, 2017 at 10:34 am said:

Torsten: I wasn’t making a specific point but a general point. Many of the things which ‘just look wrong’ to Voynich experts such as yourself do so because they seem to violate one or more of the strong structuring “rules” which appear to dominate Voynichese. Words such as ‘qoiin’ violate so many of the structuring rules simultaneously that suggesting corrections becomes extremely hard (for what it’s worth, this seems more likely to me to have been a copying omission error for ‘qodaiin’), but in the (I believe) majority of cases, the scribal copying errors aren’t quite as abstruse as that.

My point remains that if the nature of Voynichese is that it is strongly structured (but miscopied), then we already know more than enough in a very large number of cases not only to identify likely scribal errors but also to offer likely corrections. None of this would be easily apparent for someone arriving at Voynichese from cold.
Rene Zandbergen on June 4, 2017 at 11:02 am said:

Nick, your:
“For example, qtoy could be qoty, qdain could be qodain, oiin would be aiin, etc.”
is precisely what I meant with:
“We can at best guess about some possible cases”.

Young Kim: prize money is a bad idea and I will not have anything to do with it.
Torsten Timm on June 4, 2017 at 11:46 am said:

Nick: My point was about the impact of your corrections for the text in general. When you start correcting words you should define when it is allowed to correct a word. Otherwise you will end with a text full of similar or equal words. Even without any corrections sequences like ‘qokeedy qokeedy qokedy qokedy qokeedy’ in line f75r.P.38 are at least strange.

BTW: It is possible to find an exception for nearly every rule you are able to define for the manuscript. In my eyes this is a rule for the Voynich manuscript.
Mark Knowles on June 4, 2017 at 12:05 pm said:

Young Kim: A prize is something I have thought about. Please email me at [email protected] to discuss this further
nickpelling on June 4, 2017 at 1:54 pm said:

Torsten: I don’t believe that we can know for certain whether any corrections we can suggest are actually correct. However, what we can say in a very large number of cases is that “[X] appears malformed according to our best current understanding of how Voynichese appears to work, and the word as it was originally formed probably looked like [Y]”.

It may well be that a rectified transcription of this form – though necessarily incomplete and subject to guesswork – may prove to be a substantially better starting point for future analysis than an unrectified transcription.
nickpelling on June 4, 2017 at 1:59 pm said:

Young Kim: for the avoidance of any doubt, I think any such prize would be fool’s gold. The only current way to prove a Voynichese decryption would seem to be via the block paradigm, i.e identify a passage whose plaintext appears elsewhere and demonstrate how the 1-to-1 mapping works: and you don’t need a prize to do that.
Torsten Timm on June 4, 2017 at 2:02 pm said:

Nick: The Voynich manuscript only contains similar words. How can you say that one similar word is malformed and another not?
Mark Knowles on June 4, 2017 at 2:25 pm said:

Nick: So am I to understand that you are saying that if one can find no such passage elsewhere it would be impossible to prove a Voynichese decryption? (You use the word “current” and I am not sure what you mean by that here.)
nickpelling on June 4, 2017 at 2:32 pm said:

Mark: because there seems to me to be a high likelihood that some kind of abbreviation is going on, any decryption will very likely involve some kind of creative reconstruction of that-which-has-been-abbreviated. And the only way to prove that is correct would be to identify a parallel text with the same contents.
nickpelling on June 4, 2017 at 2:34 pm said:

Torsten: though it contains similar words, the statistics of their occurrences are far from flat.
Mark Knowles on June 4, 2017 at 2:38 pm said:

Nick: Surely there are other ways to prove, for all practical purposes, a decryption. The individual needs to very precisely and unambigiously describe their method of decryption (pretty much as an algorithm) and provide the text of a full decryption of the manuscript using this method. Then cannot someone check the decryption by selecting a random passage and then apply the method described and verify the text matches. Obviously the resultant text must be meaningful, intelligable and readable. One would also most likely expect the text to have some consistency with the images and overall consistency throughout the manuscript.

Sure one could make theoretical objections to this as one could with the block paradigm. Such as:

It is conceivable that the author encrypted yibberish.
It is conceivable that the is more that one consistent plausible solutions to the manuscript. (I think this vanishingly small.)
The manuscript is a hoax.
Mark Knowles on June 4, 2017 at 2:43 pm said:

Nick: In reply to your last comment as I only saw that after I posted my previous comment. I am also of the opinion that abbreviation has occurred maybe even radical abbreviation. However when you mention creative reconstruction then we have to be aware of reasonable levels of imagination/creativity. I think abbreviation has to and would conform to some pattern of abbreviation and therefore could fit within that framework.
nickpelling on June 4, 2017 at 2:49 pm said:

Mark: the issue is one of proof, specifically how to prove that a given decryption is correct. For meta-theories (such as hoax or gibberish), different kinds of proof would be needed.
Mark Knowles on June 4, 2017 at 3:00 pm said:

Nick: Given my understanding of your block paradigm; correct me if I am wrong as I have not studied it in detail:

I would say it certainly makes sense to compare the Voynich with other texts to see if one can identify commonalities. And given that your Block Paradigm approach seems eminently sensible if you can find such a parallel text.

However it seems to me that potentially proving a connection in the way you suggest could be fraught with difficulties. One could believe there is a clear parallel between a part of the Voynich and another text, however one could easily be wrong. For example Rene mentioned to me a drawing in John Bunyan’ s Pilgrim’s Progress which had a similar appearance to the Top Right rosette on my “favourite” page. Now he mentioned it illustratively, but someone could make a literal parallel. There could be only a partial parallel between two texts.
Mark Knowles on June 4, 2017 at 3:16 pm said:

Nick: I think there is no way one could prove the hoax theory for certain as there could be just one line in the whole manuscript which is meaningful and the rest could be nonsense. Proving that there is no such line would be impossible I think. So I feel sorry for the hoax theorists as they have an up hill battlements of demonstrating by statistical means or others that it is extremely likely to be hoax.
nickpelling on June 4, 2017 at 3:24 pm said:

Mark: the only way someone could prove the hoax theory is if they had historical evidence that it was a hoax, i.e. a 15th century letter describing making it or selling it. Which is possible but… somewhat unlikely.
nickpelling on June 4, 2017 at 3:27 pm said:

Mark: what I’m suggesting is that having a block match would mean that the nature of the mapping would become extrenely clear within even a line of text, and almost beyond any doubt inside two or three lines. Whereas a carefully-manipulated decryption could be sustained for perhaps even pages without it becoming any clearer whether or not it was correct.
Mark Knowles on June 4, 2017 at 3:31 pm said:

Nick: For clarity I take your use of the word proof here to mean of extremely high likelihood i.e. virtually certain. (e.g. 99.99% probability or whatever) Ultimately one will have to exercise human judgement as whether a claimed “decrypted” text is nonsensical or meaningful.

To quote Rene in an email he sent me:

“While some time was spent to define some criteria, in the end, it will be simple. The correct solution will be recognised immediately.” (I hope that have not misrepresented him with this quote, but I don’t think I have.)

I don’t have quite the level of confidence he expresses in this quote, but I am inclined to the view that it should not be too difficult to spot the correct solution. I felt it important to have criteria and a testing procedure to be fair and rigourous.
Mark Knowles on June 4, 2017 at 3:40 pm said:

Nick: I agree that having a beautiful neat block match would be wonderful. However it seems to me that we have to contend with the possibility that there is no viable block match out there. I don’t object to people looking for one of course not. If you or anyone else has an idea where to look for such an example I think that is a worthy cause. I would think it unwise to pin all one’s hopes for proof on that.

I must confess that I have very little familiarity with medieval herbal or astrological manuscripts and maybe there are many very closely parallel manuscripts; if so they should be invaluable in deciphering the manuscript.

I agree that Astrological charts look like a good place to start looking for parallels.
nickpelling on June 4, 2017 at 3:53 pm said:

Mark: while it is entirely possible that an extraordinarily clever researcher / analyst could propose the correct decryption (and for all the right reasons), I think they would (because of what I suspect will be an interpretative component of any decryption) still have a further mountain to climb to find a way to prove the correctness of their decryption.

In the end, just about the only way of doing this that I can currently see would be to find a parallel text and use that to demonstrate how the two halves mesh together. The big (and, I think, novel) point about the block paradigm is that it proposes we should instead use historical and textual analysis tricks to identify the block first, and only then work back from there.
Torsten Timm on June 4, 2017 at 3:54 pm said:

Nick: If I understand you right you argue that you want to use the frequency to decide if something should be replaced or not. There are 196 words for which you suggest to replace ‘oiin’ with ‘aiin’. There are only nine words containing a sequence ‘chse’. If you want to replace ‘oiin’ with ‘aiin’ I didn’t see any good reason for not replacing ‘chse’ with ‘chee’.

BTW: Of course the statistics of their occurrences are not flat. Have you noticed that ‘aiin’ is more frequent then ‘ain’ and that also ‘oiin’ is more more frequent then ‘oin’? This is not a coincidence. Therefore the frequencies must build a geometric series.
nickpelling on June 4, 2017 at 4:15 pm said:

Torsten: no, you didn’t understand me right. I want to use the nature of Voynich’s strong structuring in order to help predict corrections to mistakes that appear to have been in the text right from the start, rather than to just use instance counts or geometric series.

As far as chse and ches goes, I’m sure we could both easily find some cases where we might well be looking at an sh where the horizontal bar has been accidentally omitted: there are plenty of other non-obvious things going on on f8r as well that make me wonder whether this might be some kind of transitional page. But all of this is far beyond the scope of what I can reasonably debate in the small margins of a comment field. :-/
Mark Knowles on June 4, 2017 at 4:26 pm said:

Nick: The question of an interpretative component of any decryption is an important one. We can both speculate to what extent there will be one or not. When it comes to single word isolated labels I would think it much easier to get around questions of interpretation.

I agree that with enough interpretation a random jumble of words could be made to have meaning. So again in the end some human judgement will have to be involved.

I have no problem with the block approach especially if you can provide me with text identifications on that basis it would be great. In the meantime until a suitable block is found we must work on the basis that one might not exist.

In fact I believe what others have suggested is a small(3 word), and I think justifiable, “Block paradigm”. (The Europe, Africa, Asia bizarre circle divided into 3. Along time ago I thought this circle represented Venice, but now I accept this very peculiar medieval representation of the continents.)
Torsten Timm on June 4, 2017 at 5:01 pm said:

Nick: If I understand you right you decide by your understanding of the nature of Voynich structuring what a copying error is. Is this correct?

BTW: The word ‘chshy’ only occurs twice. In the same way it would be possible to argue that ‘chshy’ is a miscopied ‘chsey’. Therefore such an explanation did explains nothing.

BTW: The idea behind the geometric series argument was that it is not possible to make a black or white decision based on numbers of a geometric series.
nickpelling on June 4, 2017 at 5:02 pm said:

Mark: I’ve posted about quite a few possible block paradigm matches:
http://ciphermysteries.com/2014/12/07/introducing-the-block-paradigm-for-voynich-manuscript-research-part-1
http://ciphermysteries.com/2014/12/24/voynich-block-2-the-recipe-section
http://ciphermysteries.com/2014/12/30/voynich-block-3-magic-circles
http://ciphermysteries.com/2016/02/10/the-voynich-zodiac-section-a-block-paradigm-match

Plus various posts trying to pursue at least one block source of Q20:
http://ciphermysteries.com/2016/01/24/the-book-hidden-inside-voynich-quire-20
http://ciphermysteries.com/2016/01/29/the-italian-colour-recipes-research-tree
http://ciphermysteries.com/2016/01/28/quire-20-this-is-how-the-book-is-found
nickpelling on June 4, 2017 at 5:18 pm said:

Torsten: the core idea is to build up a (completely optional) set of adjustments to the basic transcription that attempts to correct sequences that seem not to fit the core Voynichese word template, and which also seem to have a straightforward alternative reading.

It would not be hard to make up a decent-sized list (perhaps as many as a thousand? I don’t actually know) of these: and it would also be easy to build up a list of words that don’t “look right” but for which there is not an obvious alternative reading, such as the chshy/chesy example you give (though I doubt this latter list would be quite as large).

It’s perhaps important to say that for me this isn’t simply about word instance counts in the way that your Voynich study has followed: rather, I think we should be able to build up Markov diagrams built around parsed tokens, particularly for different clusters. But that is a substantial topic for another day.
Torsten Timm on June 4, 2017 at 6:36 pm said:

Nick: The Voynich manuscript contains beside the word type ‘chedy’ also the word types ‘chey’, ‘cheedy’, ‘ched’ and ‘chsdy’. All this word types are typical for Currier B and rare or missing in Currier A. Someone with the core Voynichese word template for Currier A in mind would probably dismiss even an instance of ‘chedy’ (the third most frequent word type) as misspelled version of ‘cheody’.

BTW: My word grid for the Voynich manuscript is only the most simple way to describe connections between similar word types. You can call a common glyph combination a token and build Markov diagrams for this tokens. But this didn’t change the fact that you still describe common glyph combinations and similar word types containing this glyph combinations.
Young Kim on June 4, 2017 at 8:23 pm said:

Nick Pelling, Rene Zandbergen, and Job: I am really sorry if I offended you in any possible way with my previous post and I sincerely apologize for that.

There are the pros and cons in my proposition and I didn’t think much about the cons of the idea of open challenge with a prize, to be honest. I don’t know if people get a different impression when it was called Challenge Award rather than just Challenge with a prize. The award doesn’t have to be necessarily in a monetary form though. It shouldn’t be a matter, I guess, since I know there are open challenges to the public offering a prize in various academic fields. Personally I don’t think the award money on Nobel Prize tarnishes the spirit of the honor.

My intention was to encourage people to come forward with their works on the Voynich Manuscript in a tangible form. Since we have been hearing many self-claims, I think it is time to see them in front of our eyes. I thought the translation of a paragraph long Voynich text could be enough to show the proof of acceptance. Or a sentence long Voynich text would be acceptable if how Voynich sentences comprise a paragraph can be explained. Maybe it is a bad idea in the first place, or maybe we are not there yet, but I was hoping such an open challenge with or without a prize would motivate people in a constructive way.
nickpelling on June 4, 2017 at 8:25 pm said:

Torsten: I will be discussing clusters in Part #3.

The point about Markov chains is that – if you do them properly – they encapsulate a great deal of information in a very compact way, far more than just frequency counts and word grids.
Ken on June 4, 2017 at 8:57 pm said:

sequences like ‘qokeedy qokeedy qokedy qokedy qokeedy’ in line f75r.P.38

Obviously qokeedy means “buffalo” :-).
Torsten Timm on June 4, 2017 at 8:58 pm said:

Nick: The problem with Markov chains is that they are based solely on its present state. Therefore they are only suitable on glyph level but not on word level for the Voynich manuscript.
nickpelling on June 4, 2017 at 10:34 pm said:

Torsten: that’s not actually true. A full Markov chain would include every node in a network, you should only collapse those nodes together (or merge nodes via hidden nodes) that share the same behaviour. The answer is much more complex than you think.
Torsten Timm on June 5, 2017 at 12:12 am said:

Nick: You misunderstood my argument. The problem is not the Markov chain the problem is the missing word order for the Voynich manuscript.

BTW: If you think that the Voynich manuscript is to complex for the auto-copying method you should explain your point of view.
nickpelling on June 5, 2017 at 7:01 am said:

Torsten: as long as you understand that Markov states aren’t necessarily the same as tokens, we’re doing OK here. 🙂
Rene Zandbergen on June 5, 2017 at 8:26 am said:

Young Kim: no offense whatsoever 🙂
Mark Knowles on June 5, 2017 at 9:21 am said:

Young Kim: I have been exploring the idea of prize money.
Rene Zandbergen on June 5, 2017 at 2:55 pm said:

The Voynich MS was solved twice last week.
One solution has not yet been published.
The other is on a web site. I’ll ask the author if it’s OK to provide the link. It’s in Slovak….

Shall we just split the prize money between them?

Seriously though. It seems that both Nick and myself would be considered to be ‘judges’ in this. Now both of us have clearly stated that this is a bad idea, and I can say that we both know the ‘world of Voynich “research”‘ quite well.
This judgment is being completely ignored.

That kind of proves the impossibility of the idea.
J.K. Petersen on June 6, 2017 at 8:20 pm said:

Young Kim wrote:

” I thought the translation of a paragraph long Voynich text could be enough to show the proof of acceptance. Or a sentence long Voynich text would be acceptable if how Voynich sentences comprise a paragraph can be explained. ”

From my experience, a sentence is not enough. I’ve been able to extract whole phrases and even a few sentences in a small number of languages with cohesive systems that are documentable, which do not rely on anagrams, and which DO generalize to certain other parts of the manuscript. But I know they are not solutions. Besides not generalizing sufficiently, I know that the results are simply the consequence of matching languages to passages in the VMS that exhibit similar structure. There is enough text in the VMS that one can find inter-relatable patterns for almost any short phrase.

Many of the claimed solutions look plausible if the methodology is flexible enough to allow subjective interpretation during some part of the process. Anagram solutions that do not follow a specific pattern for unraveling the characters, for example, are highly suspect, along with those that have some help from Google Translate to wrestle them into grammatical form that doesn’t actually exist in the translation.

Take Strong’s solution as an example. It looks plausible and won him publication in academic journals, but if you look at his notes, you’ll find that half-way through, he abandoned defensible methodology, for what he claimed was a Trithemius cipher, and applied subjective selection to cherry-pick words that sounded good together and related to some of the drawings. Without careful analysis of his notes, one cannot see the flaws in his method.

So, every step of the methodology has to be scrutinized along with the translation (and not everyone claiming a solution seems capable of clearly describing the method).

It would probably be a monumental task, and possibly a colossal waste of time, for unpaid judges to go through thousands of ill-conceived or downright crackpot solutions that would be submitted if prize money were involved. Not to mention the difficulty of assessing “solutions” in dozens of different languages.
Davidsch on June 16, 2017 at 2:16 pm said:

This message is for the person(s) that will make a new transcript, if any.

Is it possible to include 3-dimensional information of the letter or word or line itself? That would increase the shelf-life of EVA 2.0

For example the word Fachys, is on location (x,y, folio) 20,60 on f1r.
Davidsch on June 16, 2017 at 2:21 pm said:

Something else I forgot to mention. I am incapable of reading everything there is, so I do not know if this has been mentioned ever before:

The EVA P and EVA F seem to exist in two flavours:
one with a straight left finish, and one with a left curl that seems to be a ligature of eva [c] + long eva [q] + eva [L]

All together they make 4 characters to transcribe for the current F and P.
Mark Knowles on December 2, 2019 at 5:02 pm said:

Having found the EVA alphabet unsuited directly to my needs, although certainly vital in so far as the transcriptions are written in it, it looks to me that the Currier alphabet is closer to what I am looking for of the alphabets that I have seen so far, though in the odd way I may produce my own version as I probably might parse the odd thing differently.

Currier says:

cth -> Q
cph -> W
ckh -> X
cfh -> Y

Which fits with my own working notion of these each being one symbol, rather than the 3 in EVA or the 4+ in stroke transcriptions.

However I don’t know where other people have documented their other parsing alphabets as Currier is quite old. It would be interesting to see which other parsing alphabets have been produced, so I can decide which I will adopt rather than having to invent my own or use a modified Currier alphabet.

Obviously, when I have decided on the alphabet I can do a find and replace on the transcription to put things in a form useful to me.

When I do frequency counts or the like I don’t think I will find it helpful to treated cth as three characters.
Mark Knowles on December 2, 2019 at 6:31 pm said:

The idea of ‘p’ being one letter and ‘ch’ two letters just seems odd as the former is a significantly more complex symbol than the latter.
nickpelling on December 2, 2019 at 6:44 pm said:

Mark: for the squiddly-umpteenth time, EVA is a partially stroke-based transcription alphabet, designed so that you can make all those stroke-grouping choices yourself. Think ee or eee should be single glyphs? With EVA, you get to choose, that’s the point. It’s an unparsed transcription.
Mark Knowles on December 2, 2019 at 7:15 pm said:

Nick: Yes, I know it is. It’s just that one thinks that something like:

t -> qp
p-> qs
k-> lp
f -> ls

would be more stroke based. Being only “partially” stroke based just makes it feel like it is neither fish nor fowl.

I can invent my own parsing, but it is surprising that there seems little discussion of different parsing standards or parsed alphabets. I am not saying everyone needs to agree on how to parse EVA into some other system, but it would be nice if there were a least a few agreed on different frameworks i.e. “I parse using parsing standard/alphabet 4(A)”, so anyone could know how someone else is parsing Voynichese and so anyone could adopt a parsing they wish to.

I would ideally like not to have to invent my own way of parsing, but look at a few different suggestions and then decide which variant I am going to go with and possibly change the parsing I am using if I feel it is unsatisfactory.
nickpelling on December 2, 2019 at 7:27 pm said:

Mark: as I keep on saying, parsing was a hot topic twenty years ago, but almost nobody now seems to care about it. I’ll try to put together a post summarizing somee of the debate, but it’s a bit of a blurry topic. 🙁
Mark Knowles on December 2, 2019 at 8:04 pm said:

Nick: It seems hard to understand how parsing could not be a hot topic unless there are generally agreed ideas about parsing already. I am now becoming suspicious that for some people EVA is their parsing i.e. their text is unparsed EVA.

I just have a subset of Voynichese that I want to apply tests to and in order to do that I need to parse the EVA for that text from the transcription into something else that I find suitable that I can then apply the tests to. I just don’t think that I will find the frequency count for EVA-h very helpful, because I think it unlikely EVA-h acts as an independent character.
nickpelling on December 2, 2019 at 8:34 pm said:

Mark: you’re absolutely right, lots of recent researchers don’t look any further than EVA, and we’re all worse off as a result.

But if you don’t want frequency counts of EVA h, don’t count EVA transcription values, do your parsing pass first. Simples.
Mark Knowles on December 2, 2019 at 9:41 pm said:

Nick: Out of curiousity how do you, yourself, parse EVA?

I can invent my own parsing and have been doing so, but it is definitely of interest to me how others have chosen to do it, even if I don’t necessarily choose to do it the same way. It is also worth knowing why others have chosen a specific parsing i.e. what were their reasons for their specific parsing decisions. They may have knowledge or ideas that I am not aware of that affected their parsing choices. I think it is an area where we don’t all necessarily have to agree to have productive exchange of ideas as to different people’s choices and perspectives.
nickpelling on December 2, 2019 at 10:04 pm said:

Mark: I have a number of overlapping ideas about how Voynichese should be parsed, but no way of rigourously evaluating them.

For example:
* I suspect that ckh (etc) should be parsed as k then ch. But I might be wrong.
* I suspect that aiin (etc) should be parsed as a single token. But I might be wrong.
* I suspect that ol al or ar qo am should all be parsed as single tokens. But I might be wrong.
* I suspect that ok ot op of might all be separate tokens (as opposed to o+k etc). But I might be wrong.
* And so on.

Cheers, Nick
Byron Deveson on December 2, 2019 at 11:54 pm said:

At the risk of looking foolish (a probable certainty) I think that it maybe feasible to computer generate all the possible combinations of parsings of a page of VM and then throw these against possible blocks of text. Barbaric, yes, but I think it might work and I don’t see the reading of the VM as the end of the line. I feel that reading the VM could be the start of things rather than an end.
I am assuming that the number of possible parsings is manageable. Do you have any feel for the number of parsings that may be involved? I realise that the number of possible combinations is factorial but I feel that if, say, the combinations of the ten most likely tokens were used then that might be enough to yield a partial solution that could then be refined. I assume that each solution could be checked to see if it rated as a better fit to natural language and this could be used to “hill climb” to a better solution.
J.K. Petersen on December 3, 2019 at 5:37 am said:

Mark Knowles wrote: “I can invent my own parsing and have been doing so, but it is definitely of interest to me how others have chosen to do it, even if I don’t necessarily choose to do it the same way….”

Within minutes of first seeing them, I knew that existing transcription tools were not accurate enough or flexible enough for my needs, and I noticed right away that some of the transcripts made it difficult to search for patterns that break over lines or which are at the beginnings of blocks, etc., along with other problems related to analyzing the text. I took these into consideration when designing my own transcription systems (all of which I began in the hard-to-read B&W photocopy-version-of-the-VMS days). As of the second transcript, mine are also color-coded so that common patterns and repetitions are clearly visible when you glance at each page.

So I have my own transcripts and also my own fonts that I created sometime in 2008 and 2009 (from that time on I have been working on refinements and other variants). I tried as hard as possible to map the VMS glyphs to shape-mates. This is a difficult task, since I was ALSO trying to design it so Voynichese could be combined with normal typing (in English, French, German, Italian, and Spanish) without changing the keyboard map (very challenging since these languages use characters in the ALT register that I needed for the VMS chars and many ALT chars are used for common symbols).

I have been moderately successful and might release this font (which I call BEVA) because there’s a real need for something that can be used to write VMS glyphs without constantly copying/pasting or changing keyboard-maps. I don’t see this as a replacement for EVA. It’s simply something I wanted and that I’m willing to share.

.
But getting back to transcripts and parsing the text…

It is not a simple task to create an accurate transliteration of the VMS text. There are many subjective decision-making points. In existing transcripts, there are combinations of the main shapes and repetitive letter-patterns that are not even acknowledged. There are clear linguistic assumptions in some of them which I think are best avoided. How can you do accurate computational attacks if glyph-variations are not even recorded?

I know that subjective interpretations are inevitable. For example, are some shapes ligatures? Are the half-spaces meaningful? Is the shape that resembles “a” a letter or a combination of c + i? I think the answers to SOME of these questions is in the text but I don’t see consideration of these things in most transcripts. We are fortunate to have many pages, and there are clues in how it is written—I think these should be given more attention.

.
This subject is really too big for a blog comment, so I’ll leave it at that. If you don’t know what you are transcribing, certain details will be difficult to distinguish or record accurately. We don’t even know if the glyphs themselves are the informational portion of the message or if we should be “reading” some other aspect of the text, like the distances between tokens or the heights of the glyphs.
Mark Knowles on December 3, 2019 at 1:20 pm said:

When one produces a transcript there is inevitably a loss of information from the original text in the Voynich manuscript. One inevitably decides which details are relevant and which not. Do the slight colour variations of the text have a significance? Do the varying angles of the text have a significance? Should the variations in the cracking ink that Newbold studied be included in the transcript? We normally assume that these details are just a function of natural human variation and imprecision in writing and the natural limitations of the writing tool. I tend to concur with these assumptions.

As pointed out there are variations in how shapes are drawn and we have to determine if these variations are meaningful and so constitute distinct characters or whether the variations are natural variations. We have to decide what are spaces and what are not; an issue in my own word that I have been taking very seriously.

Then further down the line we make parsing decisions as discussed.

In every step of the process from the original Voynich text to the parsed text there is scope for making errors, so ultimately it is down to one’s own judgement. These are judgements that we have to make in order to perform statistical and other analyses of the text.

If one wants not to make any decisions and risk the loss of information then one has to work with the original text in the Voynich manuscript, but that really removes the possibility for so many of the kinds of analysis that we want to make.

That is why I proposed 3 forms of representation:

1) Stroke representation alphabet with the flexibility to include more of the detail and varied interpretation of the text.

2) EVA for straightforward communication

3) Parsed or Glyph or Character representation alphabets, which would be the text in the final parsed form. Where different people would have different forms of this alphabet dependent on how they choose to parsed the text and where these alphabets are kept the same or as similar as possible when the parsings are the same. So if, for example two people parsed “ch” the same way then they might as well conform to the same character(s) for this parsing even if for other parsings they differ, where this is possible. So one could have something like PEVA1, PEVA2 etc. for the different ways of parsing. Then one person could say that I am using PEVA3 to parse and then everyone else would know what they are doing.

Where there is overlap then the stroke transcription, EVA and the PEVAs should be the same where possible. So it could make sense for the stroke transcription to use “o” for the same shape that EVA uses “o” and some PEVAs might do the same.

I just find EVA on it’s own pretty inadequate.
Mark Knowles on December 3, 2019 at 1:21 pm said:

Nick: Thanks for your parsing ideas.
Mark Knowles on December 3, 2019 at 1:54 pm said:

JKP: How do you, yourself, parse Voynichese?

I am trying to build up an idea of different ways of parsing the text to help me finalise how I intend to parse it before I can perform the tests that I want to.

Nick has kindly given his own ideas, but more other ideas I think would be valuable.
Mark Knowles on December 3, 2019 at 2:05 pm said:

One parsing paradigm that I like, though I can see in areas it could be problematic, is that every joined up entity is parsed as a single character, so “ch”, “qo” and “cPh” would be one character, but “as” would be two characters as they are unjoined, as far as I have seen. Now I can see that in the case where we have “iii” they appear to operate as a single entity, but do not appear to be joined up to one another. Despite this I feel that the joined up paradigm with some modification is one that I like, though I may change my opinion on this.
nickpelling on December 3, 2019 at 3:45 pm said:

Mark: the bit you’re missing is this – that despite the fact EVA is capturing ~90% of Voynichese, why hasn’t that been enough for us to make any significant progress with working out what it is?

Putting in an extraordinary amount of secondary effort to capture, say, half of the remainder into some kind of putative ‘SuperEVA+’ seems to me like a largely unproductive use of research time.

I’m sure we already have enough with the transcriptions we have, what is missing is a quite different set of work.
Mark Knowles on December 3, 2019 at 4:18 pm said:

Nick: I don’t think they are mutually exclusive.

I have my own working theory as to what is going on with Voynichese, though more work has to and needs to be done by me on this; this is why I am now facing the problem of parsing. Without having a parsing that makes sense to me the statistical tests I want to carry out might not yield interesting results. Using my example if I were to parse EVA-h as “h” then I doubt statistical with respect to it will help me find what I am looking for.

Having a satisfactory parsing is really a precursor to the next step of doing the work of figuring out what’s going on. Other people must have considered these parsing decisions and come to their own conclusions. I just think a way of collating the different popular groups of parsing that different people use would certainly make it easier for people like me in future to choose a way of parsing that they prefer and share results knowing that people are parsing the same way or not. In addition it means that parsing decisions can be discussed and justified. At the moment it seems down to individuals to enquire as to what each other person is doing. Making up one’s own way of producing a parsing approach feels like reinventing the wheel rather than referring to a standard set of different approaches. I probably will just ask around, so I make a more informed decision as to my parsing approach, but I don’t think that is the ideal way of going about things.
nickpelling on December 3, 2019 at 6:05 pm said:

Mark: for me, statistical tests should be devised to first work out which parsing is correct, because nobody yet knows. Making a best guess at a parsing schema so that you can then run statistical tests seems a bit back-to-front to me. 🙁
Mark Knowles on December 3, 2019 at 7:08 pm said:

Nick: That is a good and fair point. However, how can one start performing statistical tests without a parsing to work with? One can certainly start with one parsing and then as data from your tests comes in, change the parsing that one uses in response to the results from those tests. I don’t think one can realistically do tests with all possible parsings and determine which is the most appropriate on that basis, as that sounds like a mammoth task. I think one should start with what seems to oneself the most plausible parsing and then be open to modifying it if that makes sense.

Also I don’t yet know how other people are parsing in order to perform their tests. Are they using unparsed EVA, because that is in fact also a parsing of the raw text?

It could be argued that t, p, k, f are each two characters, but EVA parses them as one character, so whatever you use you are parsing. I think from what you said that you would parse “cph” as two characters whereas my inclination is to parse it as one, though I am open to changing my mind.

How would you start to carry out any tests without a parsing to work from?

It seems a tricky issue, but your point makes sense.
nickpelling on December 3, 2019 at 7:31 pm said:

Mark: you must admit that plausibility hasn’t got us very far, right?

So my overall point is that I think people are far too quick to draw all manner of wobbly conclusions from statistical tests based purely on some presumed or guessed parsing schema. Which is a bit pants.
Mark Knowles on December 3, 2019 at 9:04 pm said:

Nick: I agree, but I am not sure that we should start with the implausible, though one should be open to it, as sometimes what seems implausible can turn out to be true.

Nevertheless, I feel I have to start with what seems the most plausible to me, but on the basis of absorbing other people’s ideas. I can certainly try other variations of my parsing if I find my initial parsing lacking.

So are we saying that we need statistical tests to decide the parsing and a parsing to perform the statistical tests; which comes first chicken or egg? This is a situation which could lead to complete inertia.

I think if people declare explicitly I have performed these tests with this specific parsing then one can easily evaluate their results on that basis. It just feels like there isn’t much openness about the parsings used, which is the real problem. That’s why I suggested a variety of standard parsing people tend to use say: PEVA1, PEVA2, PEVA2A etc. (The terminology doesn’t matter) Then someone can say that this test was carried out using PEVA4B. I imagine in the end there will be a small number of distinct parsings people tend to use as some will drop out of usage. One would expect the parsings that become the most popular will be informed by the test results of the different parsings. Then someone like me can look at the standard parsings and decide which one I will adopt, without having to reinvent the wheel from scratch.
Rene Zandbergen on December 3, 2019 at 9:24 pm said:

There is a lot more to Eva than just the capability to use it in effective communication. I don’t mean that it is much easier to learn and makes it much easier to remember Voynich words just from their transcription.

One of the main advantages over the previous systems is that it allows to represent much more of the MS than e.g. the Currier or FSG systems.

One may argue whether one represents ckh as one, two, three or four characters, as is the case with Currier, FSG, Eva and Frogguy respectively. However, only Frogguy and Eva can represent the less frequent combinations like ck or ckhh.
As soon as one combines such forms into single characters, one ends up with a very long list of different characters. Currier is already at 36 and does not handle these cases. GC’s v101 does, and he is around 200, IIRC.

Experimenting with different parsings is useful, even necessary I would say, and it is inevitably an iterative procedure. I have done this. For statistical tests I prefer to use “Cuva” which is explained somewhere at my web site, and things I called Reva and Reva2 which are not there. When doing this, one necessarily loses a lot of the smaller distinctions, including the rarer characters.

However, the word structure that is so typical for the Voynich text remains. It is as big an obstacle for making sense of the MS as before.
nickpelling on December 3, 2019 at 10:01 pm said:

Rene: from my perspective, EVA has been a huge success in just about every practical sense. The only bit that hasn’t happened is the part where people put aside their differences to work together to determine the actual parsing schema for Voynichese.

Unfortunately that, I think, is the necessary step that needs to be taken before you can be sure that any other statistical test you carry out is producing signal rather than noise.

All of which, I suspect, forms a sizeable part of the reason why things like Voynichese’s word structure remain a frustrating mystery. 🙁
Peter M. on December 3, 2019 at 11:32 pm said:

What I need to work sensibly with.
Word =( single character capture VM ) = ( implementation EVA for the PC ) = ( view normal ABC ).
If I have to think in EVA, I am faster with paper and pencil.
I don’t need to re-evaluate the whole manuscript for EVA, I think some pages are enough.
2-3 Botany, Astrology, Nymphs, Prescription part. about 12-15 pages.
Why should I count single characters ? I don’t really know what they mean.
But what I know is that some characters only appear in the back. This is not a normal behaviour.
Why not count the endings and compare them with normal applications ? I did it, and got a little smarter.
Example EVA ch / sh
Is sh another character ch, or is there a combination to ch = sch ?
I take (ch = en) and set an ending ( iin = tis ) en-tis.
sh = ven – tis = ventis, 8a – iin = to (a) tis = totis.
ch – 89, entum, entus ? ventum ventus ? totum, totus ?

And if they did not die, then they still work with EVA today. 🙂
Mark Knowles on December 4, 2019 at 2:04 pm said:

Rene: I definitely think EVA has the advantage that it is almost a stroke transcription(excluding t,p,k,f) which gives it a degree of flexibility in representation, as it has the capacity to represent the constituent parts of a wide variety of shapes. I would prefer a fuller commitment to the stroke transcription, but I can see that this would impair the pronouncability of words and so have one negative impact.

Nevertheless one cannot doubt its success at bringing about a unifying transcription rather than there being a variety of disparate competing ones.

However I think the problem is the assumption that one must have either EVA or something else rather than EVA and something else. I think there is a place for a low level stroke transcription and a high level glyph transcription(s) sitting side by side and serving different purposes. I think deciding between whether EVA or Currier is the alphabet to use is a false choice, why not EVA and a high level Currier style alphabet. I could liken it very loosely to programming languages where one doesn’t argue that there should be only 1 programming language that is used in all situations, but rather the idea that one can have more than 1 that serve different purposes. I am not, of course, arguing that we need as many forms of representation as there are programming languages, but I think more than 1 type of Voynichese representation would not go amiss.

As should be clear from my “diplomatic cipher” perspective there being more characters is not a concern of mine. So I don’t see that having a form of representation where there is a very long list of characters as a problem provided it is not the only form of representation.

I will look at your “Cuva” and “Reva1” and “Reva2” if you choose to make these 2 available.

It might be nice to compile a list of the parsings that different people choose to use.

I think otherwise the parsing process that people use can be very opaque, one from another, it would also help for people to agree on a small number of commonly used parsings and develop a dialogue as to the arguments for each given parsing.

I want to start with my preferred parsing and I think looking at other people’s parsings would be a good place to start. I have my own ideas, of course, but I don’t want to have to completely reinvent the wheel if I don’t have to. In order to do the analysis that I want to I have to start with a preferred parsing and I am pretty sure EVA in its totality isn’t that for my requirements.
J.K. Petersen on December 4, 2019 at 6:58 pm said:

Mark Knowles wrote: “It might be nice to compile a list of the parsings that different people choose to use.”

Do you mean it would be nice to compile a list of the transcription alphabets that people choose to use (e.g., EVA)?

Or do you mean it would be nice to compile a list of the transcriptions that people choose to use (e.g., Takahashi’s transcription)?

When you say “parsings” what that usually means is how individual “syllables” within Voynich strings are broken up, which is quite a different subject. Parsing is usually about the analysis of strings and sometimes creating breakpoints between the different components of each string.
Mark Knowles on December 4, 2019 at 9:18 pm said:

JKP: As I have said before on occasions I don’t want to get into a long discussion about semantics.

You say: “Do you mean it would be nice to compile a list of the transcription alphabets that people choose to use (e.g., EVA)?”

No, not explicitly, though certainly that is part of it. Some people may parse the text without putting it into a specific public alphabet. Currier has produced a specific alphabet, but, for example, Nick has suggested a variety of options for how the text may be parsed without specifying an alphabet into which they will be parsed. Ultimately at the end of the day one necessarily has to parse into an alphabet of one’s choice in order to carry out tests, however I think people doing that often don’t make those alphabets clear or public.

No, I am not talking about different transcriptions, ideally one should have one stroke transcription, but different parsings, though there could be loss of information through the parsing process in certain instances e.g. if someone decides to parse “a” and “o” as the same character.

I think the way I use the verb is consistent with the definition, but if you have an alternative verb that you think I should use instead then feel free to suggest it. Anyway, I think I have made it pretty clear on this page and on Voynich Ninja what I am talking about. I am a bit concerned about getting dragged into a long conversation about terminology and losing site of the original subject I am trying to address.
nickpelling on December 4, 2019 at 9:23 pm said:

Mark: if you want to decompose EVA gallows into pairs of smaller characters, you can. (A number of people have discussed radically stroke-based ways of looking at and interpreting Voynichese, so this is not a completely new angle.)
Mark Knowles on December 4, 2019 at 9:36 pm said:

Nick: I think there is a case for a strong stroke based alphabet in addition to others. I am not saying that I believe individual strokes are the constituent parts of the underlying language rather that there is a case for that form of representation. It does seem to me the fact that “p” is 1 character and “ch” 2 is incongruous. I wasn’t suggesting this was a new idea, merely that I am not familiar with a popular Voynich purely stroke based transcription alphabet. I think EVA is quite good as a form of stroke transcription, but I think it loses it a bit on the gallows characters.
Mark Knowles on December 4, 2019 at 9:41 pm said:

Nick: I think one thing that I like about decomposing EVA gallows into 2 characters is my conviction that the “q” in “qo” is the left side of “t” and “p” i.e. in both instances we have geometrically a ‘4’ shape, which really means a right-angled triangle, I think it can be proven empirically that they share the same shape.
Mark Knowles on December 4, 2019 at 9:51 pm said:

Nick: You say:

“* I suspect that aiin (etc) should be parsed as a single token. But I might be wrong.”

Yeah, that is one I am very unsure on. Though I haven’t studied it much. What to do with the multiple “i” is definitely an area where other peoples ideas are very interesting.

“* I suspect that ol al or ar qo am should all be parsed as single tokens. But I might be wrong.”

I am pretty sure “qo” should be and I have some sympathy with your perspective on the others.

“* I suspect that ok ot op of might all be separate tokens (as opposed to o+k etc). But I might be wrong.”

I personally, am not at this stage remotely close to considering them as anything other than separate.

When it comes to the likes of “cph”, despite its size I am inclined to view it as one.

As I said, I am open to changing my ideas on all this if I find my tests point in that direction.
nickpelling on December 4, 2019 at 10:06 pm said:

Mark: run your tests, and see where they all lead. The one thing you won’t find poking its head over the parapet – as the Friedmans had worked out half a century ago – is a straightforward natural language, so please try to devise a better success metric than “looks like a natural language”. :-/
Mark Knowles on December 4, 2019 at 10:40 pm said:

Nick: Well, given that I am working with a very specific small subset of Voynichese I am open to starting with the simplest explanation, namely a one-to-one correspondence I.e. “a natural language”, but until I do these tests I cannot say anything about what I expect to find. Given my notion that a large portion of the text of the Voynich is null meaningless nonsense filler text then it is hard to say what kinds of patterns I would hope to find with the rest of the text. This hypothesis could certainly be flawed though it doesn’t appear to be one that anyone else has explored in any depth.

Unfortunately, I need to make a few more parsing decisions before I can start, so I will have to get more idea what other people are doing so that I can have more confidence in these. Whilst, I very much like many of your parsing suggestions I want to see what others are doing, so at least I can reach an overall opinion where possible based on other people’s experiences. I will look into Rene’s CUVA and see if I can get any specific thoughts from the Ninja people. I think, if I start with poor parsing decisions it could inhibit my progress, so whilst I can’t be sure I have the correct parsing I can do my best.
Mark Knowles on December 4, 2019 at 10:51 pm said:

Nick: I have looked at Rene’s CUVA, but I am not sure it is right for me. Looking again at Currier it fits with my instincts, though I will need to track down the full alphabet to see what I think.
J.K. Petersen on December 5, 2019 at 5:06 am said:

Mark, no I am not “arguing semantics”, I am trying to understand what the heck you are talking about, but after reading all your posts again, I think I finally understand.

You don’t mean “parsing EVA” (EVA is already parsed by those who invented it), you mean “parsing Voynichese glyphs” [into plaintext].

Voynichese glyphs and EVA are two different things. If you do your own breakdown for the glyph components, then you can give it your own name because it has nothing to do with EVA.
nickpelling on December 5, 2019 at 8:43 am said:

JKP: parsing turns a stream of signals into a stream of semantic tokens. So it is entirely valid to talk about parsing both Voynichese and EVA into tokens.
J.K. Petersen on December 5, 2019 at 10:02 am said:

Yes, I’m aware of that, Nick, but his specific question (on the forum, where he started a thread on this) was “How are people parsing EVA?”…

…except none of his follow-up posts were about parsing EVA (transcriptions), they were all about the breakdown of individual VMS glyphs into one, two, three or more pieces, which means he wasn’t actually asking about parsing EVA. He was asking how people formulate a transcription alphabet.

After reading every post on the forum and here twice, I finally realized he asking about “parsing” Voynichese glyphs (breaking up the glyphs, mapping the glyphs) using his own system, which means he would be creating an alternative to EVA, not “parsing EVA” (EVA has already been mapped to Voynichese, it’s not something we do, it’s been done).

If I am reading all the posts correctly, what he is asking is, “How do people parse Voynichese glyphs?” (presumably into a transcription system/alphabet).
Mark Knowles on December 5, 2019 at 10:54 am said:

JKP: Generally I am open to whatever vocabulary that the general consensus is, as I am really interested in addressing the underlying question not whether one uses this term or that term. I am happy to use the term slicing, mapping, projecting, breaking up, assigning etc. as long as everyone understands what I am talking about. I have little interest in terminology, but rather have an interest in the underlying question expressed in the terminology. I hate long semantic discussions, because it distracts from the topic that I want to explore.

I used the term “parsing” as I noticed Nick using it and it seemed a eminently sensible usage given my experience of how people use that term in computer programming and I couldn’t/can’t think of a less ambiguous term. Any term is going to have a degree of ambiguity as there is no very specific vocabulary in this instance designed for usage with Voynichese.

I have tried to absorb the vocabulary that Voynicheros tend to use, so as to be as clear as possible, though I am sure that there is room for improvement.
nickpelling on December 5, 2019 at 3:05 pm said:

JKP: ah, I understand (I don’t spend a lot of time chez ninja), thanks.

I’d agree that Mark seems to be confusing transcription issues (which definitely aren’t parsing) with parsing issues over there. Parsing is what you do with a transcription, not what you do to create a transcription.
Mark Knowles on December 5, 2019 at 5:04 pm said:

Nick: I think you misunderstand what I am saying. I think one starts with an initial transcription produced by a manual process of something like image recognition on the base scans of the Voynich into text in an alphabet that one has decided is the most appropriate to transcribe into; an alphabet which should be chosen to well represent the kind of shapes that one sees on the page.

Then one can parse that transcription from the initial alphabet(such as EVA) into another alphabet(such as Currier) this means that we now have a transcription written in the Currier alphabet, in theory one could parse that again to produce another transcription in another alphabet or even parse it back into EVA.

So taking an example part of an EVA transcription:

“qochcphd al”

Parsing into Currier we have:

“4OSW8 AE”

This continues part of a transcription on the Voynich is the Currier alphabet.

So I would say that parsing can be what you do to create a transcription.
Mark Knowles on December 5, 2019 at 5:09 pm said:

CORRECTION

*This continues part of a transcription on the Voynich is the Currier alphabet.

->

This is then part of a transcription in the Currier alphabet.

One could have bypassed EVA and produced a transcription in Currier directly from the images of the Voynich, if one preferred. However given the main transcription is in EVA it makes sense to parse that transcription into a transcription in one’s preferred alphabet.
Mark Knowles on December 5, 2019 at 5:24 pm said:

The need to parse an EVA transcript into a transcript in another alphabet is because I want to do simple tests like frequency tests; however frequency tests on EVA-i and EVA-h seem unlikely to be useful as I doubt that they correspond to underlying components of the language, so I want to parse the EVA transcription into an alphabet which I think better corresponds with the underlying components of the language(I think some might call these glyphs). Now I might modify my alphabet over time as my thoughts about the structure of the language develop or as I experiment with different structures which I hope may better correspond with the true underlying components of the language(I believe Nick calls these components tokens, but any term will do.). I just want to start in a good place with my best guess at what the underlying structure may be, based on my own thoughts and the thoughts of others.
Mark Knowles on December 5, 2019 at 5:38 pm said:

My notion is that one starts with a low-level transcription produced directly from the Voynich manuscript in a stroke based alphabet, this could be EVA or something even more stroke based. Then one parses that into a higher-level transcription in an alphabet which better represents the true units/components/token/characters/symbols/glyphs? of the language(or whichever other term people want to use). Then one applies tests on the resultant transcription.

I just need to finalise my idea of the higher-level transcription alphabet that I plan to start with. I like Currier though I think “qo” should be one character/token not two. I like Nick’s suggestions, but I think “cph” should be one character not two. I haven’t yet found a copy of the full Currier alphabet and I need to enquire about other parsings, so that I can feel a bit more confident about my way of parsing. This is what I am trying to do and why I am making a fuss about this.
J.K. Petersen on December 5, 2019 at 10:04 pm said:

Nick Pelling wrote: “Parsing is what you do with a transcription, not what you do to create a transcription.”

Nick, you are better than I at succinct and cogent summations.
nickpelling on December 5, 2019 at 11:37 pm said:

Mark: the point of EVA is to enable people to generate as many experimental personal transcriptions as they like. But like Spotify playlists, please don’t expect anyone else to be hugely interested in yours. :-/
Mark Knowles on December 6, 2019 at 12:13 pm said:

Nick: Well, I don’t have a spotify playlist. I don’t expect other people to be interested in how I parse Voynichese, but rather, as I have made clear, I am much more interested in how other people are parsing Voynichese, so as to help me better decide what my parsing decisions should be. It should be evident that I am not trying to persuade other people that they should parse Voynichese the way I choose to, as for my research I just want to get a clearer idea of what are better parsing decisions and what are worse, so I am keen to compare different people’s parsings. For example, I am still not at all confident as to what to do with the “i”s and associated letters, but I am sure they need to be parsed in some way.
J.K. Petersen on December 7, 2019 at 12:46 am said:

The main difference between the various transcription alphabets (and there are relatively few) is how much detail they go into in terms of mapping individual parts of glyphs that might be interpreted as ligatures or abbreviations.

I don’t know if studying the alphabets themselves is going to help you that much (other than familiarizing yourself with the general concepts), because researchers haven’t reached any agreement on how to break down the glyphs.

Some transcription systems interpret complex glyphs as one glyph, others as more than one (I have several transcripts for this very reason). This is an individual choice based on study of the overall text.