I’ve spent some time this week working on Alexander d’Agapeyeff’s 1939 challenge cipher, which seems (to me) to be a tricky combination of substitution cipher and transposition cipher. It’s also not clear to me what language the plaintext will turn out to be in: English, Russian (d’Agapeyeff was born in Russia), French, or perhaps even a Saharan language (d’Agapeyeff spent some time working as a cartographer in Lake Chad).

In practice, this meant I needed a metric reflecting how well a given piece of (transposed) text resembles language (but without knowing the language, alphabet or the substitutions): its generic “languageness”, for want of a better term.

I started by using H2 measures of entropy, but quickly became dissatisfied with the results. Rather, what I really wanted was a metric that was much ‘spikier’, i.e. that would sharply spike up the more language-like a given transposition was. But what could I use?

Calculating Suffixity

I devised the idea of using (what I call) “suffixity” as a metric, i.e.:

  • Build a suffix array (an ordered table containing the strings in each position in the text)
  • Sum up the matching length of every adjacent entry in the ordered table
  • Finally, normalise this value against the length of the ciphertext.

So for the (industry-standard) text string BANANA* (where * is the EOF character), start by listing all the strings in the order they appear in the text string:

  • BANANA*
  • ANANA*B
  • NANA*BA
  • ANA*BAN
  • NA*BANA
  • A*BANAN

Then sort these strings alphabetically:

  • A*BANAN
  • ANA*BAN
  • ANANA*B
  • BANANA*
  • NA*BANA
  • NANA*BA

Then sum up the matching lengths of adjacent entries in your ordered array:

  • strmatch(A*BANAN, ANA*BAN) = 1 —> total = 1
  • strmatch(ANA*BAN, ANANA*B) = 3 —> total = 4
  • strmatch(ANANA*B, BANANA*) = 0
  • strmatch(BANANA*, NA*BANA) = 0
  • strmatch(NA*BANA, NANA*BA) = 2 —> total = 6

Finally, normalise this raw total value (6) against the length of the text (6) to get a normalised suffixity of (6/6) = 1.0

The higher the suffixity value, the more language-like the text.

C Implementation of Suffixity

Because suffix arrays are used in the Burrows-Wheeler Transform (as a data compression guy, this is one of my favourite compression algorithms), clever computer scientists have worked out various ways to generate suffix arrays in essentially O(n) time. And so I used Yuta Mori’s sais-lite library, which implements the induced sorting algorithm (there are many others, but I was specifically looking for a simple C implementation).

My C function to calculate the matching length looks like this (no big surprises here):

static inline int strmatch(const uint8_t * a, const uint8_t * b)
{
	const uint8_t * base = &a[1];

	while (*a++ == *b++)
		;

	return (int) (a - base);
}

My C function to calculate suffixity has one extra trick (which you may or may not agree with), in that it tries to avoid rewarding tripled letters (e.g. SEPIA AARDVARK). This is not a final implementation, but I think it’s reasonably close to capturing the core idea:

#include "sais.h"

int calc_suffixity(const uint8_t * pau8Text, int * suffix_array, int len)
{
	sais(pau8Text, suffix_array, len); // Construct a suffix array

	int suffixity = 0;
	for (int i = 0; i < len - 1; i++)
	{
		const uint8_t *a = &pau8Text[suffix_array[i+0]];
		const uint8_t *b = &pau8Text[suffix_array[i+1]];
		int s = strmatch(a, b);
		if ((s >= 3) && (a[0] == a[1]) && (a[0] == a[2]))
		{
			s = 0;	// Don't reward triple letters!
		}
		suffixity += s;
	}

	return suffixity;
}

You then normalise the returned suffixity value by dividing it by the length of the text. (It could be argued that you should divide it by (length – 1), but that’s an issue for another day.)

In summary, I believe that suffixity should be an effective metric to use with hill-climbing algorithms when tackling transpositions of unknown substitutions; but, as always, there may well be many other cryptological applications for suffixity that I haven’t yet thought of. 😉

Is Suffixity Already Known By Another Name?

Of course, it’s entirely possible that suffixity is nothing new. Given that my crypto bookshelves are focused more on cryptography (code-making) than on cryptology (code-breaking), I would be very happy to hear from crypto people who know the literature well.

All the same, O(n) algorithms for building suffix arrays are fairly new, so it would not surprise me if suffix arrays had played no real part in practical code-breaking to date. We shall see!

Future posts will try applying suffixity to various codebreaking challenges. I’m already able to say much more about d’Agapeyeff’s challenge cipher than I was before, so I’m expecting that that will be a post (or two) all on its own.

9 thoughts on “Introducing the “suffixity” metric (perhaps)…

  1. milongal on April 18, 2021 at 10:12 pm said:

    My first reading of it asked “Does this tie into suffix trees in any useful way” – although I soon realised that the only similarity is the word “suffix” – and the preservation of all the characters in the string (that is sticking the prefix on the suffix rather than just dropping the prefix) is what makes this interesting.

    I assume the whole idea here is that over a large text you would expect a lot of repetition – provided a relatively simple encryption was used or a very large ciphertext was intercepted (if I understand what you’re proposing correctly I’m thinking it wouldn’t be very effective against a fairly “plain” enigma machine (e.g. 3 wheels, no reverser, no plugboard) – unless you had a pretty significant chunk of ciphertext – or have I totally missed the point?

    I know you said it isn’t complete, but the 3 letter thing only checks the beginning of the string in the current iteration. I wonder whether you’d be better off modifying the strmatch method to baulk on the 3 letter (it would complicate the logic a little in that method – but there’s part of me that thinks such logic more obviously belongs there – because the triple letter stops it being a match).

    I guess on the face of it I’m not really seeing whether this is too much more useful than existing statistical analyses……but maybe when I’ve re-read it a few times it will dawn on me.

    A long time ago you posted us some challenges (from memory only 1 or 2 were ever solved). Have you tried measuring the suffixity of them to see what sort of results might occur?

    My 2c – random thoughts….

  2. milongal: to be precise, suffixity is designed for the scenario where you have an unknown transposition of an unknown alphabet and an unknown language. So not Enigma. 🙂

  3. milongal on April 19, 2021 at 8:29 pm said:

    I probably should make sure I read and re-read all the words. oops 😉

    I read the whole thing thinking you were proposing a generic analysis tool for all situations. my bad.

  4. milongal: no worries, it’s a pretty specialist post. 🙂

  5. Hi Nick,

    I did not read all (sorry), but on the point related to: “dividing by length” vs. “dividing by length-1”, what I have done in entropy calculations is to ‘circularise the text’. This means: pretending that, after the end, it restarts at the beginning.
    As a consequence, if L is the length of the text, then you also have L bigrams, L trigrams, etc. and you can always just divide by L.

  6. Interesting metric, Nick – I think it would be a good statistic to add to the crypto toolbox.

    A problem I can think of from a hillclimbing perspective is that the search will try to maximize repetitions. Let’s say we have a plaintext such as “WE ARE UNDER ATTACK PLEASE SEND HELP IMMEDIATELY”, and it’s simply transposed via columnar transposition. Our generalized solver might decide to un-transpose it to “AAAAACDDDEEEEEEEEEHIIKLLLMMNNPPRRSSTTTUWY” to maximize adjacent repetitions which (I assume) maximizes the suffixity metric. A worse case would be if substitution is also permitted during the search. Then to maximize the metric it could simply find the key that produces the plaintext “AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA”.

    I haven’t actually verified that these situations would maximize the metric; maybe I’m misunderstanding the math behind it.

  7. Hi Rene,

    Because this process involves a unique endmarker (EOF), it actually prevents any matches from going past the end of the text. I was actually thinking more about statisticians and their involved technical discussions about degrees of freedom etc. 🙂

    Cheers, Nick

  8. David Oranchak: that is exactly why I modified my suffixity metric to try to avoid matching the kind of linear repetitions you highlight. When I tried using suffixity without this modification, it would often reward tripled (or longer) letter sequences, just as you predicted. 😉

  9. Stefano Guidoni on June 24, 2021 at 8:42 am said:

    I think this needs some work. When you have to use some hack (the triple sequence special case), usually it means that the metrics is ill defined.

    There is a doubt, I have: do you count circularity? E.g. given “nana”, you get:
    a 0
    ana 1
    na 0
    nana 2
    or:
    a*nan 0
    ana*n 4
    na*na 0
    nana 4

    Besides this, I also have bad news 🙂 for you. A guy asked a solution to a problem like this 10 years ago: https://stackoverflow.com/questions/8546441/suffix-arrays-in-perl

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Post navigation