USC’s irrepressible Kevin Knight and Dartmouth College Neukom Fellow Sravana Reddy will be giving a talk at Stanford on 13th March 2013 entitled “What We Know About the Voynich Manuscript“. Errm… which does sound uncannily like the (2010/2011) paper by the same two people called, errrm, let me see now, ah yes, “What We Know About the Voynich Manuscript“.
Obviously, it’s a title they like. 🙂
As I said to Klaus Schmeh at the Voynich pub meet (more on that another time), what really annoys me when statisticians apply their box of analytical tricks to the Voynich is that they almost always assume that whatever transcription they have to hand will be good enough. However, I strongly believe that the biggest problem we face precedes cryptanalysis – in short, we can’t yet parse what we’re seeing well enough to run genuinely useful statistical tests. That is, not only am I doubtful of the transcriptions themselves, I’m also very doubtful about how people sequentially step through them, assuming that the order they see in the transcription of the ciphertext is precisely the same order used in the plaintext.
So, it’s not even as if I’m particularly critical of the fact that Knight and Reddy are relying on an unbelievably outdated and clunky transcription (which they certainly were in 2010/2011), because my point would still stand regardless of whichever transcription they were using.
In fact, I’d say that the single biggest wall of naivety I run into when trying to discuss Voynichese with people who really should know better, is that hardly anyone grasps that the presence of steganography in the cipher system mix would throw a spanner (if not a whole box of spanners) in pretty much any neatly-constructed analytical machinery. Mis-parsing the text, whether in the transcription (of the shapes) and/or in the serialization (of the order of the instances), is a mistake you may well not be able to subsequently undo, however smart you are. You’re kind of folding outer noise into the inner signal, irrevocably mixing the covertext into the ciphertext.
Doubtless plenty of clever people are reading this and thinking that they’re far too smart to fall into such a simple trap, and that the devious stats genies they’ve relied on their whole professional lives will be able to fix up any such problem. Well, perhaps if I listed a whole load of places where I’m pretty sure I can see this happening, you’ll see the extent of the challenge you face when trying to parse Voynichese. Here goes…
(1) Space transposition cipher
Knight and Reddy are far from the first people to try to analyze Voynichese word lengths. However, this assumes that all spaces are genuine – that we’re looking at what modern cryptogram solvers call an “aristocrat” cipher (i.e. with genuine word divisions) rather than a “patristocrat” (with no useful word divisions). But what if some spaces are genuine and some are not? I’ve presented a fair amount of evidence in the past that at least some Voynichese spaces are fake, and so I doubt the universal validity and usefulness of just about every aggregate word-size statistical test performed to date.
Moreover, even if most of them are genuine, how wide does a ciphertext space have to be to constitute a plaintext space? And how should you parse multiple-i blocks or multiple-e blocks, vis-a-vis word lengths? It’s a really contentious area; and so ‘just assuming’ that the transcription you have to hand will be good enough for your purposes is actually far too hopeful. Really, you need to be rather more skeptical about what you’re dealing with if you are to end up with valid results.
(2) Deceptive first letters / vertical Neal keys
At the Voynich pub meet, Philip Neal announced an extremely neat result that I hadn’t previously noticed or heard of: that Voynichese words where the second letter is EVA ‘y’ (i.e. ‘9’) predominantly appear as the first word of a line. EVA ‘y’ occurs very often word-final, reasonably often word-initial (most notably in labels), but only rarely in the middle of a word, which makes this a troublesome result to account for in terms of straightforward ciphers.
And yet it sits extremely comfortably with the idea that the first letter of a line may be serving some other purpose – perhaps a null character, or (as both Philip and I have speculated, though admittedly he remains far less convinced than I am) a ‘vertical key’, i.e. a set of letters transposed from elsewhere in the line, paragraph or page, and moved there to remove “tells” from inside the main flow of the text.
(3) Horizontal Neal keys
Another very hard-to-explain observation that Philip Neal made some years ago is that many paragraphs contain a pair of matching gallows (typically single-leg gallows) about 2/3rds of the way across their topmost line: and that the Voynichese text between the pair often presents unusual patterns / characteristics. In fact, I’d suggest that “long” (stretched-out) single-leg gallows or “split” (extended) double-leg gallows could well be “cipher fossils”, other ways to delimit blocks of characters that were tried out in an early stage of the enciphering process, before the encipherer settled on the (far less visually obvious) trick of using pairs of single-leg gallows instead.
Incidentally, my strong suspicion remains that both horizontal and vertical Neal keys are the first “bundling-up” half of an on-page transposition cipher mechanism, and that the other “unbundling” half is formed by the double-leg gallows (EVA ‘t’ and ‘k’). That is to say, that tell-tale letters get moved from the text into horizontal and vertical key sequences, and replaced by EVA ‘t’ (probably horizontal key) or EVA ‘k’ (probably vertical key). I don’t claim to understand it 100%, but that would seem to be a pretty good stab at explaining at least some of the systematic oddness (such as “qokedy qokedy dal qokedy qokedy” etc) we do see.
Regardless of whether or not my hunch about this is right, transposition ciphers of precisely this kind of trickiness were loosely described by Alberti in his 1465 book (as part of his overall “literature review”), and I would argue that these ‘key’ sequences so closely resemble some kind of non-obvious transposition that you ignore them at your peril. Particularly if you’re running stats tests.
(4) Numbers hidden in aiv / aiiv / aiiiv scribal flourishes
This is a neat bit of Herbal-A steganography I noted in my 2006 book, which would require better scans to test properly (one day, one day). But if I’m right (and the actual value encoded in an ai[i][i]v group is entirely held in the scribal flourish of the ‘v’ (EVA ‘n’) at the end), then all the real content has been discarded during the transcription, and no amount of statistical processing will ever get that back, sorry. 🙁
(5) Continuation punctuation at end of line
As I noted last year, the use of the double-hyphen as a continuation punctuation character at the end of a line predated Gutenberg, and in fact was in use in the 13th century in France and much earlier in Hebrew manuscripts. And so there would seem to be ample reason to at least suspect that the EVA ‘am’ group we see at line-ends may well encipher such a double-hyphen. Yet even so, people continue to feed these line-ending curios into their stats, as if they were just the same as any other character. Maybe they are, but… maybe they aren’t.
Incidentally, if you analyze the average length of words in both Voynichese and printed works relative to their position on the line, you’ll find (as Elmar Vogt did) that the first word in a line is often slightly longer than other. There is a simple explanation for this in printed books: that short words can often be squeezed onto the end of the preceding line.
(6) Shorthand tokens – abbrevation, truncation
Personally, I’ve long suspected that several Voynichese glyphs encipher the equivalent of scribal shorthand marks: in particular, that mid-word ‘8’ enciphers contraction (‘contractio’) and word-final ‘9’ enciphers truncation (‘truncatio’) [though ‘8’ and ‘9’ in other positions very likely have other meanings]. I think it’s extraordinarily hard to account for the way that mid-word ‘8’ and word-final ‘9’ work in terms of normal letters: and so I believe the presence of shorthand to be a very pragmatic hypothesis to help explain what’s going on with these glyphs.
But if I’m even slightly right, this would be an entirely different category of plaintext from that which researchers such as Knight and Reddy have focused upon most… hence many of their working assumptions (as evidenced by the discussion in the 2010/2011 paper) would be just wrong.
(7) Verbose cipher
I’ve also long believed that many pairs of Voynichese letters (al / ol / ar / or / ee / eee / ch, plus also o+gallows and y-gallows pairs) encipher a single plaintext letter. This is a cipher hack that recurs in many 15th century ciphers I’ve seen (and so is completely in accord with the radiocarbon dating), but which would throw a very large spanner both in vowel-consonant search algorithm and in Hidden Markov Models (HMMs), both of which almost always rely on a flat (and ‘stateful’) input text to produce meaningful results. If these kinds of assumptions fail to be true, the usefulness of many such clever anaytical tools falls painfully close to zero.
(8) Word-initial ‘4o’
Since writing my book, I’ve become reasonably convinced that the common ‘4o’ [EVA ‘qo’] pair may well be nothing more complex than a steganographic way of writing ‘lo’ (i.e. ‘the’ in Italian), and then concealing its (often cryptologically tell-tale) presence by eliding it with the start of the following word. Hence ‘qokedy’ would actually be an elided version of “qo kedy”.
Moreover, I’m pretty sure that the shape “4o” was used as a shorthand sign for “quaestio” in 14th century Italian legal documents, before being appropriated by a fair few 15th century northern Italian ciphers (a category into which I happen to believe the Voynich falls). If even some of this is right, then we’re facing not just substitution ciphers, but also a mix of steganography and space transposition ciphers, all of which serves to make modern pure statistical analysis far less fruitful a toolbox than it would otherwise be for straightforward ciphers.
* * * * * * *
Personally, when I give talks, I always genuinely like to get interesting questions from the audience (rather than “hey dude, do you, like, think aliens wrote the Voynich?”, yet again, *sigh*). So if anyone reading this is going along to Knight & Reddy’s talk at Stanford and feels the urge to heckle ask interesting questions that get to the heart of what they’ve been doing, you might consider asking them things along the general lines of:
* what transcription they are using, and how reliable they think it is?
* whether they consider spaces to be consistently reliable, and/or if they worry about how to parse half-spaces?
* whether they’ve tested different hypotheses for irregularities with the first word on each line?
* whether they believe there is any evidence for or against the presence of transposition within a page or a paragraph?
* whether they have compared it not just with abjad and vowel-less texts, but also with Quattrocento scribally abbreviated texts?
* whether they have looked for steganography, and have tried to adapt their tests around different steganographic hypotheses?
* whether they have tried to model common letter pairs as composite tokens?
I wonder how Knight and Reddy would respond if they were asked any of the above? Maybe we’ll get to find out… 😉
Or you could just ask them if aliens wrote it, I’m sure they’ve got a good answer prepared for that by now. 🙂
Hi Nick
I’ve been looking for stats on two things – if you or anyone here can offer a link, I’d be glad.
(i) maximum line-length for the text and also if that’s possible, any evidence of standard margin-widths.
(ii) details of variation in spaces between lines. I’m wondering if space left between each 5th and 6th line (or 6th and 7th) might be consistently wider or narrower than the rest.
Diane
TL;DR version of your very thorough post: garbage in, garbage out.
Ivan: actually, the post was more about the opposite side of the coin – that if you don’t want garbage out, be very choosy about the garbage you bring in. 🙂
Diane: maximum line length very much depends on how you join the EVA strokes together to make actual letters, so it’s a fairly subjective measure. And I don’t think anyone has measured things in a way that will give you the second datum you’re looking for – Philip Neal mentioned having added some additional metadata to the copy of the transcription he was working with, but I don’t recall those metadata including physical measurements.
It would be easier if a scale appeared in the Beinecke zoom, wouldn’t it?
Yes, when I said line-length, I did mean in mm. or inches.
Suppose I could try a letter to the Beinecke..
Thanks.
D.
I would go, but I am on the other side of the country, in New York. There are a couple of Voynicheros in California, though, so I’ll mention it elsewhere. Represent.
Nick, I wish I could attend Knight and Reddy’s event at Stanford (I live not so far away). Howsomever, because of my dependency on reading lips (and more often than not ending up at the back of the auditorium) I’m somewhat useless at live performances/lectures. I do have an acquaintance at UC Santa Cruz who might be interested in their talk. I’ll try to contact her by phone after I finish this note.
As far as the use of various size “4” or “9” figures as cipher, here is my translation of at least those figures as well as the use of the “&” here is my use in translating whole discussions:
8 = aes
& = aes
9 = g or k
tiny 9 which loop crosses to back of the down-stroke = ex
c = c
smaller c (which sometimes has a bar attached) = e
The elaborate large curlicued “P” represents “B” or “P”, capital B or P which can also imply full words such as “Pliny”, “Botticelli”, “Prescription”. One can also “tack onto” that large initial P various other curlicues which gives you such words as Especially, or “Especies”.
The difference in the alphabet letters “R” and “S” is very small: If the letter looks like a question mark without the dot, it is the Volsci/Cyrillic sibilant capital “C”. If the letter looks like a backward capital S, it is the letter “r”.
The latin letters “U” and “V” are represented by “eo” or “oe” in context to what word is being expressed:
Ennyway, what I have discovered lately, is that Athanasius Kircher may have been somewhat “Latin-impaired” when he was identifying various elements of the Alban Hills and Lakes”.
Still fun. I hope you will “dwell upon” my alpha-linguistic-cipher notes for a little while at least — and compare them with your EVA and D’Imperio’s nonsense. Oh yes, there’s a word for you: “dwell”. Several weeks ago, I tried to demonstrate how one could stretch that “ell” glyph (you called it a “bracket”) and insert it into another word in that sentence.
The “loopy double l” phoneme has manifold uses, in that other syllables can be tacked onto either side without having to add any other vowels. Same thing goes for the “tl”.
Example: xmpl (“ex” is represented by that tiny “9”, “m” is represented by what looks like a fish-hook with two points).
n e oea ave fun decoding y’all !
🙂
In response to Diane’s query re line lengths and spacing:
What I’ve found when it comes to the Water Lily folio and the crocus folio, so far, is that the discussion for each of the botanicals seems to first identify each specimen’s historical uses. Later discussion further down the “stem” appears to be brief references to historical or philosopher/mythologists works/legends.
Folio 35r Crocus (saffron, and legend of Crocus & Smilax)
Folio 11v : Possibly a mulberry fruit, which I am focusing on today because there are only 6 lines of discussion!
Later!
bd…..
Rich
Most people interested in this manuscript read Nick’s blog, I expect an will share details here if they can and wish.
Otherwise, I have friends close to the library who might spare some time if all else fails.
D
I’ve been drafting an email to send to Kevin (we know each other from grad school), and figured I share the current draft:
Kevin,
After having it in my pile of things to read, I finally had a chance to read through “What we know about the Voynich Manuscript” in detail, and thought I’d offer some friendly feedback/suggestions (basically, a cross between what I would have written as a reviewer and some info you may find of interest). Hope you find this useful. Any new findings in the talk you’re giving this week?
• Choice of transcription alphabet. While I’m partial to Currier myself, decisions regarding whether Currier ‘X’ is a single character or some combination of ‘S’ and ‘F’, or that ‘G’ is a single character rather than ‘IE’ impact entropy calculations, word length distributions, HMM results, etc. While Currier justifies his choices in the paper you cite (http://www.voynich.nu/extra/curr_main.html), this is a point worth mentioning.
• Transcriptions. While I’m not a huge fan of the European Voynich Alphabet (EVA) (http://www.voynich.nu/extra/eva.html), the interlinear transcription at http://www.ic.unicamp.br/~stolfi/voynich/Notes/062/L16+H-eva/text16e7.evt is worth looking at for several reasons: a) it integrates multiple transcriptions including (but not limited to) those behind currier.now, b) translation from EVA to Currier is straightforward using ‘sed’ or a similar tool, and c) the comment blocks contain accumulated data on suggested plant IDs, etc. from various sources. Another transcription worth looking at is Glen Claston’s (http://notakrian.pbworks.com/f/voyn_101.zip). When it come to recording variants while transcribing an unknown script there is a tension between “splitters” and “groupers”, and Glen is a “splitter” (partly because he was pursuing Leonell Strong’s polyalphabetic theory, which involves changing alphabets as well as a cyclical shift sequence, and Glen believed that some of the variants signaled alphabet shifts). Also, IIRC, Glen’s was the first transcription done from Yale’s hi-res color scans.
• With regard to “word” length distribution, a number of people (myself included) have suggested that some spaces are inserted according to some rule(s), at least partially explaining the correlation between word endings and beginnings pointed out by Currier. Slightly more than a fifth of the spaces in the Biological B pages in D’Imperio’s transcription, for instance, are a ‘9’ followed by a ‘4’, and there are a very small number of ’94’ digraphs without spaces (possibly scribal errors). Whether inserting spaces between some specified set of letter pairs in a plaintext in this way produces something more like a binomial distribution, as seen in the Mss, is a testable hypothesis, but I haven’t tried it yet.
• With regard to HMM models of the Voynich Biological B text, you should take a look at Mary D’Imperio’s paper “An Application of PTAH to the Voynich Manuscript” (http://www.nsa.gov/public_info/_files/tech_journals/Application_of_PTAH.pdf).
• You cite Gabriel Landini’s paper in your references, but I didn’t see a discussion in Section 5.3 or elsewhere discussion his findings re:long range correlations in the Mss. Also of interest in that regard is Mark Perakh’s paper applying the Letter Serial Correlation test to the Mss. (http://www.talkreason.org/Mark%27s%20sites/Mark%27s%20perakm%20site/members.cox.net/marperak/Texts/voynich2.htm).
• An analytic tool I tried but didn’t write up (although I don’t know that it produced any deep insights) was applying algorithms for learning certain classes of regular languages from only positive instances to the vocabulary of Biological B “words”. In particular, I tried one of Angluin’s algorithms for learning a k-reversible regular language from positive examples, and Garcia & Vidal’s algorithm for learning the k-Testable Language in the Strict Sense family of regular languages (http://users.dsic.upv.es/grupos/tlcc/papers/fullpapers/GVO90.pdf)). I’m not sure if better algorithms for learning REs from only positive examples have been developed since.
• You say in Section 5.1, “Notably, the text has very few repeated word bigrams or trigrams, which is surprising given that the unigram word entropy is comparable to other languages.” There is some possibility that multiple letters in Currier’s alphabet are actually the same character. For instance, if there is a “word” in the Biological B vocabulary containing an ‘S’, with high probability the same word occurs with ‘Z’ substituted for the ‘S’, suggesting they might actually be the same letter. It has been suggested that some of the “gallows” characters may be interchangeable (f57v, for instance has both ‘B’ and ‘V’ in the same position in the repetitions of the key-like sequence). Failing to combine such interchangeable letters (if they are in fact interchangeable) will obviously reduce the number of longer repeats.
• A cipher hypothesis you don’t mention is what folks on the Voynich mailing list generally refer to as a “verbose (monoalphabetic) cipher”, in which some plaintext letters correspond to combinations of multiple Voynich characters. For instance, if u = ‘S’, r = ‘C89’, o = ‘4OF’, b = ‘CC89’, and s = ‘CC9’, then “ZC89/4OFCC89/4OFC89/4OFCC9” on line 13 of f77r would correspond to the word “uroboros” (I’m not suggesting that as a serious crib, but as an illustration of the idea). This type of cipher would explain the low h1 and h2 values. If this is the type of cipher involved, the trick is figuring out what the breakdown into combinations is — the “words” 4OEAM, OEAM, and EAM all occur: are they (4O-E-AM, OE-AM, and E-AM) or (4OE-AM, O-E-AM, and E-AM), etc? An attempt has been made to apply genetic search to this problem (http://voynichattacks.wordpress.com/tag/genetic-algorithm/).
• Nick Pelling offers some observations on your paper at http://ciphermysteries.com/2013/03/09/this-week-a-talk-at-stanford-on-the-voynich-manuscript
• His blog also mentions a new paper on arXiv.org called “Probing the statistical properties of unknown texts: application to the Voynich Manuscript” (which I haven’t had a chance to read yet).
• Have you accumulated a list of on-line machine readable corpora of 15th cent. texts in various likely languages? I’ve found http://quod.lib.umich.edu/c/cme/ for Middle English, but haven’t found a similar resource for e.g., early 15th century Latin herbal or alchemical mss.
Hope this is useful feedback/info — don’t decipher the Voynich before I do :->
Karl
It is somewhat simpler:
Some weeks ago I had published statistic reports, based on EVA. For this I used my new N-gram software “ngraman”, where N can represent an arbitrarily high ordinal number.
If this is applied on EVA (N=25), with spaces and other special characters ignored, you will get completely all lexical items (words, phrases, particles):
http:/goo.gl/iyqiY
BTW. The differences between A and B Currier languages were determined similarly
(N = 14):
goo.gl/cUXq1
And, that one who distrusts in my tool, should have a look at its application on modern literature (Joyce, Ulysses):
http://goo.gl/sWCkS
Joachim: EVA was explicitly designed as a stroke-based (or, more accurately, a component-based transcription), which means that (for example) “ch” and “sh” are each transcribed as two separate partial characters, despite their plainly being a single character when actually written.
EVA’s authors did this not to encourage people to run statistical tests on the EVA corpus, but to encourage researchers to try out different hypotheses about what the correct alphabet should be (i.e. once you’d combined groups of strokes such as ‘ch’ into a single token).
Addendum:
When many lines of commentary end with the “8” and “9” characters (especially in the botanicals) you are looking at the sentence ending (or nomenclature ending) aes ceus. Some instances of the same phrase are represented by only the tiny “9” = “ex” or “excuse” or “exeus”.
Another botanical I’m getting ready to solve/interpret is what even Boenicke tentatively identifies as an “artichoke”. Nope, never seen an artichoke with dried stem top. B-u-u-t — maybe a tree fruit – mulberry? Morus Alba? “Silkworm food” (the leaves, anyway)? We’ll see. I’ll keep you posted.
One item of interest, which I didn’t find in the botanicals or pharmaceutical sections, is the fruit of the mandragore. See Boenicke 408, folio 83v. The script which appears beneath the globes tells us that the watered down fruit juice can ease pain.
So, Nick, my reason for posting all of this is so that maybe you or a “Voynichero” may be able to observe to the “Knight-Reddi” presentation at Stanford with a “newer point of view”. Too bad I can’t participate (I wouldn’t be able to get close enough to the podium to read the speaker’s lips).
Nick, does that mean that you despise any statistical analysis based on EVA, not only mine?
If one glyph is expressed by several roman letters, it means for my N-gram tool only that the N is to be incremented. (The internal comparison processes always remain the same). And the determined frequencies of di- and tri-grams may indicate then one single letter to choose from.
Joachim: “despise” is the wrong word, “despair” is far closer.
It’s just that when I saw your list of n-grams headed up by “ch” (with “sh” not far behind), it did make me feel as though I hadn’t really managed to get my central point across in a 2000-word post. Which is this: that the whole point of statistically analysing an EVA stroke transcription is to work out what the best non-stroke transcription is.
The practical problem is that almost all Voynich researchers seem to have lost sight of this. 🙁
Karl: thanks for cc’ing that here, hopefully someone will attend the lecture and let us know KK’s & SR’s responses…
Nick & friends,
I’m hoping that “someone” will “consideringly” read my responses on this particular discussion page, and take a “laundry-list” of questions to be directed at Knight and Reddi (if K & R will even open a Q and A dialogue).
I THINK I understand that you would like to keep the “Voynich Manuscript” a mystery for at least a few more years. So, my next adventure will be a visit to the historical museum in San Jose California. As part of my duties as Senior Records Clerk in the City Clerk’s Office, I indexed every item of the public records. I was also responsible for overseeing the yearly microfilming of those records (for safe storage). It is one reason why UC Berkeley has microfilm copies of the earliest missionary correspondence between “headquarters” in Spain and Portugal (and maybe Rome/Frascati).
What is most aggravating to me, is that I never set eyes on either the manuscripts themselves, nor got to read the microfilm contents! Long story! But I am now considering contacting San Jose’s Historical Museum in order to schedule a viewing of the documents which are now in “climate controlled” preservation cases for scholarly review.
I’ll do my best to see if there is any resemblance to Boenicke ms 408. (In the 1970’s the City of San Jose contracted with a professional translator, who was unable to do a meaningful translation. His reason was that the San Jose documents were written in a “clerical hand” used by the clerks of the Royal Courts.
It may be months before I can write. Maybe this will give you “breathing room”?
Cheers ! 🙂
Nick: Partly I can understand you. You do not like ‘ch’ and ‘sh’ as an eye catcher. Therefore I have for you a N-gram analysis, starting with the longest possible terms (including word breaks):
http://goo.gl/CqSyi
For codebreakers remains a problem: Where do we see text scrambling resulting from encryption when lexical items are ordered so natural language like?
Joachim: EVA is a stroke-based interim transcription, a precursor to a glyph-based transcription. So when I see frequent strings starting “h…” (i.e. partway through ch / sh / cfh / cth / ckh, cph), I feel extremely uncomfortable making any sort of inference from them.
In my opinion, the two questions that need considering most in order to solve the Voynich Manuscript are:-
(1) How did the original author group EVA strokes togetehr to form complete letters? (And what evidence do we have to support whatever conclusion we draw about this in preference to other possible final transcriptions?)
(2) What is the correct order / sequence for parsing letters in a word / line / paragraph / page? (And what evidence do we have to support our conclusions on this?)
bdid1dr:
A question to K&R:
What is their opinion (if any) about:
“Discussion and conjectures” by Prof. J. Stolfi
(which agree completely with my own investigations):
http://www.dcc.unicamp.br/~stolfi/voynich/00-06-07-word-grammar/#s.disc
“..severe constraints on cryptological explanations”
“Semitic languages such as Arabic, Hebrew, or Ethiopian could perhaps be transliterated into Voynichese, but not by any straightforward mapping.”
Not straightforward, I agree, as I found different meanings for EVA-k, single letters for “ii”, “ch”, indefinite word separation etc.
I can certainly prove that the Voynich is not encrypted, by N-gram observed distribution of the lexical items.
Joachim: if you think the text isn’t written straightforwardly, then surely you are saying that it *has* been encrypted (i.e. in the sense of “hidden” or “concealed”)?
Your n-gram test results indicate (correctly) that a lot of structure is present, which is precisely the kind of thing more modern cipher systems (such as polyalphabetic substitution) disrupt and destroy.
However, there are still a thousand other ways to hide text that predate all that kind of “data-flattening” mathematical trickery, so I can see no obvious reason to eliminate “old-fashioned” encipherment just yet. 🙂
Nick: Yes, that’s right, and I’m talking myself in my documentation (about Arabic) of an encryption stage. And I’m still somewhat away from deciphering VMs completely. The content seems kind of more cryptic than the language itself 🙂
BTW. Did you see my pdf on language A and B (significant) differences?
Joachim: no, I’m sorry to say that I haven’t seen your PDF on A/B language differences. Can you pass on a link to it, please?
Nick: The link to a pdf where A – B differences
are shown:
http://goo.gl/cUXq1
(only the 14-grams, for A complete).
Here’s a downloadable mpeg4 version of the Stanford lecture video if you don’t want to mess around with the streaming:
http://zodiackillerciphers.com/2013-03-13-Stanford_Voynich_Lecture.mp4
(addendum: video file size is 233MB)
Nick: after reading your article more carefully, I see an approach to clarify some things by statistical analysis.
I speak of n-grams, language-independent analysis: identification of lexical items by their observed frequencies. So, I now can submit an N-gram EVA evaluation that assesses all particles within an expression even in all lower N orders. Important to know: When reading EVA, all spaces and special characters are ignored.
The result here:
http://goo.gl/807jN
Nick and Friends,
While you’ve all been focusing on Knight and Reddy, I’ve been cruising “Carmina Brigiensia” and “Carmina Burana”. I have found a perfect match for that “berry” which appears on Boenicke 408’s folio 11v: Wikipedia has a very good discussion of:
Carmina Burana : “The Forest” – is an elaborate illustration, of which one feature stands out from all the rest. See for yourself, and you may be able to understand my translation of why that “mulberry” signifies so prominently, yet just as obscurely, in our VM-ystery folio 11V. It is all about the leaves of the white mulberry tree, which were “pabulumox”/fodder/food for the silkworm larvae until they began to spin their cocoons. In this case I guess I can pun that the proof is in the “pabulumox” (pudding?) 🙂
Happy St. Patrick’s Day! Though I am part Scots-Irish, I can still do the Irish Step Dances and reels (Google has it for their logo today).
Ennyway, my take on Knight-Reddy’s presentation as I was able view online, pretty bad! (I read lips and body posture to get at least some clues to what is being discussed.)
So y’all will just have to limp along without me — ahem!
Proceed!
bdid1dr 🙂
If one supposed that the text’s line length = paragraph length, then occurrences of that
” pair of matching gallows (typically single-leg gallows) about 2/3rds of the way across their topmost line”
might serve the same purpose as the cartouche does in hieratic.
whether it enclosed the name of a deity, person, animal, thing or something having aspects of several (as stars do) might be an additional complication, of course.
diane
Diane: the cartouche shape means “name” whereas I suspect Neal keys are more like a general-form enciphered medieval bracket pair. But apart from that, we’re signing from the same (antiphonal) songsheet. 🙂
Dear Nick,
Thank you, I’m yet to be blessed with grandchildren.
Diane
Nick
I’ve just done what I should have done *before* offering the ‘cartouche’ comment.
i.e. googled “Voynich” AND “cartouche” …
so enervating.
I should put this into the forum, but replies there are rare.
Can someone explain why curious groups such as dain, dain, qokedy dain (etc.) are treated as a function of encoding, rather than an indication of original language?
I mean, why does dicussion not focus on languages which regularly show a pattern of a,a, b, a, a?
Similarly with char-groups that appear only in certain positions?
On the other side of the coin, why is ‘4o’ taken as a function of the language rather than a result of encipherment. (Is encipherment a word?), but the top-line gallows, reversely (I’m sure that’s not a word!).
Illustration; suppose the ‘4o’ translates ‘and’. Sometimes it would appear in a word ‘candy’ but its appearance as itinial ‘And…..’ would reflect a usage natural to some languages and not others. (In English it is found in that position regularly only in translation, and mostly of Biblical texts or renderings from Aramaic.
What if dain, dain, qokedy daiin was equivalent i~ in one or another language ~ to e.g. ‘Verily, verily, I-say-to-thee, with VerilyPlus… .[that shall such and such occur].
So why does no-one seem to spend time on matching those patterns with various grammatical forms?
This reminds me of in interesting paper written in 1995 by Clive Holes, ‘The Structure and Function of Parallelism and Repetition in Spoken Arabic: a sociolinguistic study’, Journal of Semitic Studies, Spring issue, pp.57-82 in my p/copy though the bibliography looks a bit short.
On p.79 he said,
But there are cases of interaction – I’ll cite my usual example of Majid’s treatise on navigational astronomy and method.
Diane
interesting paper about internal assonance and rhythm in spoken colloquial Arabic:
a substitution cipher from Spain which some thought might prove to be Voynichese.
http://languagelog.ldc.upenn.edu/nll/?p=4337
Not my area – so no comment except that this paper by William Pourquet may interest people working on Voynichese
http://gtalug.org/wiki/Meetings:2013-04
(On that site, the paper is available in pdf)
I wonder about the possibility of the author not being able to read his own manuscript. Personally I doubt this possibility is the case nevertheless I thought I might raise it.
It could be the case that in his fervour to the produce a watertight cipher he might have over done it and produced something so complex and intricate that he couldn’t read it back himself. It could be that in the process of writing it he piled new cipher rule upon new cipher rule and so drowning in rules and so confusing himself and failing to follow his own rules correctly. If he made it so complicated and possibly confused and muddled that he couldn’t read back his own manuscript it would help explain why we find reading it so difficult.
This is just a thought and whilst it probably not the case it is certainly plausible I think. Naturally I hope this is not the case.
While this is an oldish post to be commenting on, there are some methodological issues it raises that have been on my mind recently and it seemed a more apropos post than some of Nick’s more recent methodology-related items.
* Re: Nick’s “staging point” model for solving the Voynich Manuscript, EVA 2.0, etc. — working from both ends is a useful way to build a bridge (or a tunnel, although not a tower :->), and there are some “other end” issues that it’s critical to think about if you’re interested in solving the Voynich. For instance…
* Nick asked in this post, “whether they have compared it not just with abjad and vowel-less texts, but also with Quattrocento scribally abbreviated texts?” That’s a really good question — and to answer it, you need a machine-readable corpus of 15th Cent. scribally abbreviated texts. In general, any research program with a serious shot at solving the Voynich will need machine readable corpora of various 15th century languages. Accumulating a list of resources for that should be a priority. U. Michigan has a good site with Middle English texts (URL in an earlier comment of mine above); several years back Rich SantaColoma mentioned http://titus.uni-frankfurt.de/, although that doesn’t have much for 15th Century.
* Which leads into what I think is a key point re: methodology — if you have a cryptanalytic hypothesis regarding the Voynich Mss., *actually trying to decipher the Mss. should be the very last step in your research program*. There are two reasons for this: 1) Whatever tools you’ve developed to assist in the decryption, you need to know the distribution of results you get when you apply the tools to texts where you know the answer, otherwise you don’t know whether the result you get using the tools on the Voynich actually “rings the bell” as it were; and 2) it helps prune dead ends up front — if you think the Voynich is some kind of polyalphabetic cipher at the glyph level, then you should probably think about how to get such a cipher to generate a ciphertext with 2nd order entropy lower than natural language plaintexts before you invest an enormous amount of time in trying to crack the Mss.
* Let me toss out what I suspect Nick will consider an extremely questionable proposition for discussion: It’s rational to assume (at lease initially) that “Neal keys”, “Grove words”, etc. are not critical to understanding the nature of the Voynich cipher (if cipher it is) — not because that is necessarily true, but because if it isn’t then the odds of figuring out how to decipher the Mss are slim to zero. Yes, there is an element of the old joke about the drunk looking for the keys he lost in the alley under a streetlight “because the light is so much better”. On the other hand there is probably value in testing and eliminating simpler hypotheses first (and doing so may shed light on what’s going on in the cipher if it is more complicated).
* Nick writes, “I’ve also long believed that many pairs of Voynichese letters (al / ol / ar / or / ee / eee / ch, plus also o+gallows and y-gallows pairs) encipher a single plaintext letter.” — I share Nick’s views on this; examining the digram stats for Biological B suggests that the tall pole in pursuing this line of thought is figuring out how to move beyond the obvious A[ERMN] and O[ER] combos (sorry, I speak Currier) to figuring out how to break apart [FP][SZ]C*89 words.
* In Nick’s “Voynichese Task #1: moving towards ‘EVA 2.0’…” post, he raises the problems associated with the legacy EVA interlinear transcription. He’s absolutely right that many of the transcriptions there were converted to EVA from older transcription alphabets and were based on lower quality data (the copyflo, etc) than later transcriptions, *but*…reproducibility is a key issue with transcription, and interlinears are a useful tool for estimating sections of text that may need a closer look. (An advantage more recent transcribers have over the FSG, etc., is the ability to compare a transcription rendered in a Voynichese font with the text in the images — Glen Claston did that when developing his transcription.)
Enough semi-random musings for now…
Karl
Karl: I indeed also don’t believe that a decryption assault need answer all the questions before pressing the big red Start button. But almost all of the most basic cryptanalytical tools – word length, frequency counts, contact tables, HMMs, etc – rely directly and irrevocably upon a preceding assumption about how that text should be parsed. And I therefore strongly believe that most analyses carried out to date are close to worthless.
Transcriptions are both a blessing and a curse, because too many people use EVA ‘raw’, ie without pausing to think about parsing. In many ways, we still haven’t properly begun to analyse Voynichese. 🙁
The importance of the hidden Markov results is grossly underestimated.
It is yet another tool that may help in finding the right way of parsing the text.
Note that the results that have been described by D’Imperio and by Reddy & Knight were not based on Eva but on the far less analytical Currier alphabet.
Doing some iterative study with this tool can lead to two possible outcomes.
– Either it is possible to come up with a parsing method that generates a text that shows vowels and consonants.
– Or it isn’t.
The way forward for both cases is quite different.
Rene: I am certain that the algorithms that try to construct HMMs are easily confused by verbose cipher. It may look obvious to us that (for example) there is a high chance that EVA ol or EVA al should each function as a (parsed) token, but this is not something that would emerge naturally inside an HMM.
The technical difficulty is that there is no distinction between verbose ciphertexts’ (unparsed) covertext properties and the same ciphertexts’ (parsed) syntactic and semantic properties. It’s a mess. 🙁
Nick: I thought I should comment on the relative frequencies of the “4” character and the “4o” character. Now if the Voynich was like a standard diplomatic cipher, which we know it isn’t, then the difference in frequency would probably be due merely to the fact that “4” maps to an uncommon letter whereas “4o” maps to a common letter. Now whilst there is more going on in the Voynich something along these lines may still be the case. In other words there is no connection between the “4” and the “4o” except possibly they constitute neighbouring letters of the alphabet. Now, as I said, the Voynich is not like a diplomatic cipher, however in the Voynich there may also be no connection between the “4” and the “4o” other than their similar appearance. Again, this kind of situation could apply to other similar looking characters in the Voynich.
As so often, if I can think of the right search term, there’s usually something here to relieve the intense irritation of having a question I can’t express well enough to have the tech’y types understand, let alone answer.
I’ve been bothered for weeks about the limits of statistical analyses determining language and for determining .. grammar or morphology etc.
I keep asking people what does it take before the analytical model gets broke? Trouble is, that most of those I’ve asked have for some reason all confused my various f’rinstances for theories. I say ‘would it break if the person were transcribing a language some of whose sounds they couldn’t hear – the way Europeans often cannot ‘hear’ tonal variations, or the difference between Japanese ‘d’ and ‘r’ and so on. Then the person replies something like ‘Oh, you think Voynichese is Japanese do you?’.. and that ends the conversation.
Would anyone here care to add to Nick’s explanation of what makes the analytical methods/programs break?
Because it seems to me that in the space between normal routines and expected outcomes, on the one hand, and Voynichese on the other, we have a case of things-that-break-the-program. So how long is the list of possibilities?
And extra thanks to Nick for outlining some in this post. Water in the research desert.
Diane: morphology plays a role in European languages but not here. Ketiv degrades morphology – you have to complete the flexion in your head, passive and active forms don’t produce different endings etc. etc. One can see instantly, that here is little space for morphology. You have to put more attention on taxonomy and primarily apply your statistical methods to target languages and not that much to Voynichese. In fact, you even don’t need the most of the findings about Voynichese, like the putative Currier A/B difference…
Everything works fine with statistics, no breaks. Statistics will bring you to the target. Why to bother for weeks? Read what you haven’t read from my docs or read a good book (my recommendation: Nietsche The Gay Science e.g. aphorism 355 The Origin of our Conception of “Knowledge”). Here the futile search in VMS is explained.
Dear Ms. Diane. Don’t worry at all. The manuscript is written in the Czech language. As it is written on its many pages. It is not, of course, a current modern participle. But the medieval Czech language.
What’s important :
1 read the letters correctly.
2. to see well.
3. know the Jewish numerological system.
4. know the Czech language.
If a scientist does not know these four points that I have written here, he will never translate the manuscript. It won’t crack.
The author drew himself on a large folio. And like a Fish. Eliška specified it. And she wrote = goldfish. (of course in the Czech language). At the same time, he writes there that he waves his hand at everyone. The movement of the hand is clearly visible there. The fish has two blue eyes. These are also easy to see. That it is a goldfish. It is also easy to see.
When a scientist can understand why Eliška drew herself as a goldfish. Thus, he has a chance to understand the style in which the text of the manuscript itself is written.
I’m just showing you things. which you can even understand. If you look at that picture.
Eliška writes in Czech. But the text is written with a Jewish substitution.
This means, for example, you see the letter = o. But since it is a substitution, you can read the letter = O. Or =Z.
( Jewish substitution number 7 = O,Z ).
Or you see the letter =a in the manuscript text.
( Jewish substitution number 1 = a,i,j,q,y ).
Or you see the letter = c in the text.
( Jewish substitution number 3 = c,g,s,l ).
That it is a Jewish substitution. It shows everyone (or rather those who know more) at the beginning of the manuscript page 2r.
The root is drawn there. C, G, S, L.
When a scientist cannot understand this. So for him, the manuscript is just a big waste of time.
I’ve been writing and showing you this for ten years. Maybe more than a year. And you still don’t understand? This is all kind of weird. Can you see well?
I wish you much success in your endeavours. It takes more effort. Then it will surely work.
Prof: you know for sure, Jaroslav Hašek was 100 year dead now a few days ago. The second best Czech story teller…
Darius –
“Statistics will bring you to the target” – except they haven’t, not since the mid-1950s when such tests were first applied to Voynichese
What I’d like to know isn’t so much about Voynichese, or the written text in the Voynich manuscript, as what factors might be responsible for modern analytical-statistical tests failing to render it less opaque.
In the post above, Nick mentions some of those factors.
Diane: I think the core of the problem is that we observe many different kinds of problems (LAAFU, BAAFU, etc) on all kinds of levels, all at the same time. Even though a typical Internet-style theory may partially account for one of these aspects, it is pretty much guaranteed to fail abysmally with the rest.
What I described in Curse 2006 was a kind of “architected” cipher system, made up of different types of secret writing, all neatly dovetailed to fit together. And I still suspect that this is what Voynichese will turn out to be – like a kind of ‘cryptic intarsia’ where everything locks together so tightly you can barely conceive that it was made from separate pieces of wood.
Nick,
I bow to your knowledge of ciphers and decryption but must admit that such a system seems to me to fit ill with the manuscript’s appearance – that is of a pocket-sized handy manual or reference, which shows signs of having been meant to be carried about and of showing signs of regular use and in some cases of hard use.
I have difficulty imagining that if meant to be used by someone on the move or, so to speak, as a field guide, that the person would be in a position or inclined to stand or sit trying to work out which cipher system was being used in a given section of a given page and then (even if they’d memorised all the cipher-types) working out what the passage meant. Hell, it would be easier just to commit the whole thing to memory and use the pictures alone as memory-prompt.
If the manuscript were one made in the seventeenth-century or even late in the sixteenth (which the images do not support and neither, I understand does the paleography), then we might call it a lab. notebook or something. But I can’t quite see it for the Vms, somehow unless the key was linked to a basic series such as .. I don’t know.. say … whatever glyph is first on the page or something, or cycled through some series well known to the maker like CISiOJANUS.
I’ve always thought well of a system that might use the a version of the Guidonian hand. After all, if you could tap out and memorise entire contrapuntal score that way, ringing those myriad possible changes, then it would be easy enough to use the same to do the equivalent with sounds-as-glyphs. But that’s no more than seeing dragons in clouds. I’ve never so much as looked into whether there’s historical evidence for it, though admit it’s one reason that number-based cipher systems intrigue me.
Enough from me; I’m just very glad it’s your field and not mine. 🙂
H2LL4 D3An2
h2Ll4 d3An2
(h+2L/L4)=d3*a-n2
Hello Diane
This is cryptology. It all looks different, but it’s the same system.
Only the vowels have been replaced by numbers. A=1, E=2, I=3, O=4, U=5.
Quite clear in the first case. Everything written in capital letters. In case 2 a few small letters and in case 3 I just added a few mathematical signs. Here the illusion is that you have to calculate to understand it. Which is nonsense, of course.
As far as the system is concerned, I have the same opinion as Nick.
I’ll give you an example. Manuscript page 78r.
Scientists see pipes here. Why do they see pipes? Because they can’t read handwriting.
When the scientist Darius and or another scientist is able to read the text of the manuscript. So the word will read there = Ž.I.L.A. This word is Czech.
The magic of this word is that you only need to change the diacritical mark and you will read the word = Ž.í.l.a.
The author (Eliška from Rožmberk) is writing to you on this page of the manuscript. That 6 women (daughters) + 1 mother live in the castle.
(the wife of John II of Rožmberk gave birth to a total of 6 women, daughters).
So Eliška drew you = ž.í.l.u. ( English = vein )
She divided that vein into 6 parts. That means 6 daughters. + one part on the right side means mother.
All this is written in the text of that folio. When a scientist cannot understand this. So he will never be able to find out what is written in the text of the manuscript.
So they are not pipes somewhere in Rome. It’s just an ordinary vein ( ž.í.l.a ). And it is divided into 6 parts (daughters) + 1 mother.
Everything is written there in the Czech language. But hidden by magic= Jewish substitutions.
Darius :-))). You write about manuscript nonsense. You try, but it’s not enough. Work more. Fingers crossed for you student.
And you can take a picture with Hašek at the Wailing Wall. 🙂
Otherwise, Darius. If you do not see the letters C,G,S,L on page 2r. Well, that’s a shame. The author shows you this at the beginning of the manuscript. It shows Jewish substitution!
I already wrote about it. The base of the plant is = the root!
The base of the word is = letter !
A plant (emerges) grows from a root =!
The word is formed = from the letters !
Is it hard to understand? A scientist of your caliber should figure this out pretty quickly. Or not ?
Eliška therefore shows it at the beginning of the manuscript. For the scientist to understand quickly. Of course, there are different scientists. Scientist A. Scientist B. Scientist C. etc. It follows that there are also scientists who are trying to do this. For example, ten years or more. And they are still at the beginning. In the begining. And the goal is out of sight for them. I just hope that’s not the case for you.
A neolithic cipher finally cracked.
https://www.bbc.com/news/uk-england-london-64162799
It just needed a fresh set of eyes.
Josef Zlatoděj prof. na Hrad!
“He spent numerous hours on the internet and in the British Library consulting pictures of cave paintings and “amassed as much data as possible and began looking for repeating patterns”.
This is more likely to be the factor behind his discovery rather than simply being “a fresh set of eyes”.
To žila se Eliška s šesti dcerami
ale měla čas zabývat se scriptami?
@Darius. It can be seen that you did not understand anything.
Eliška did not live with 6 sisters. Eliška had 5 sisters. So she lived in a castle with five sisters. Eliščina’s mother Anna Hlohovská gave birth to 6 daughters. (this is shown on the page. And also written).
Byron Deveson
Don’t tell Gordon Cramer. He’ll claim it’s micro writing. Someone was spying for the Neanderthals.
@Darius. You could perhaps read a word in the handwriting. That is, letters and words. (am aMaR ). Which means = ( am aDaR = am Adar ).
The letter a has the value of a number = 1.
The letter m has the value of the number = 4.
That means 14 Adar.
That’s the day and month. When was Eliška of Rosenberg born.
Otherwise, Eliška’s full date of birth. It is written and (drawn) at the beginning of the manuscript. It is also hidden in a symbolic plant.
Where Eliška writes: There are 14 green leaves. And there are six and six golden leaves.
If you look closely at the green leaves. So there you will see two letter characters. (J and T). The numerical value of these characters is 1 and 4.
In the leaves. So the year is hidden. 1466.
Eliška’s full date of birth. 14 Adar 1466. ( 14 February 1466 ).
Is it hard for scientists to understand?
We in Switzerland are glad that the Neanderthal was spied out. Emmental cheese is protected under trademark law.
That would have legal consequences. You’re lucky!
Prof: you know, the word ‘amar’ in Aramaic means to say/speak (Voynichese ‘8a’). You better learn the language of the Lord to understand the message.
You are right, the authors of the origin material were Jews, but not your Czech Jews, they wouldn’t write for goys anyway. Only the involvement of Kafka would be somehow imaginable, but in that case… yes, on 85v must be the Kafka castle with merlons, the place where the Voynich committee sit and issues permissions for overnight stays, which so far nobody was able to obtain.