Readable IDs

When checking a public key for its authenticity one mostly has to compare the fingerprint. These fingerprints are random and it is not always easy to remeber them or compare them quickly. One ansatz is to append a nonce to the key and change the nonce until the id becomes readable, using base32. However this is difficult to apply to existing IDs so I wrote a little tool that converts existing IDs into "readable" stuff. It is available here: idfy.c (new, 10.Feb.2014) (old version: analyze.c)

To run it, you have to pipe it some english text (or any language, but it should contain mostly US-alphabet characters). It then checks how often each pair of letters occurs, sorts them and builds tables from the most popular ones. With gcc, compile it like this:

#> gcc -Wall -Wextra -o idfy idfy.c

On Linux, run it like this:

#> cat test1 | ./idfy

Where test1 is a file containing text of a single langage (around 30k characters have proven sufficient, online newspapers are a good source). The ID that is encoded can be changed in the source code. Of course it is not as efficient (space-wise) as just trying to get a "nice" id directly, but it works for existing IDs - two different people just have to agree on the text that is piped to the program (or actually, the tables that are generated, look at the source..).

An example of a 32 byte (256 bit) ID with the old implementation is

raelou - rbusou - foofte - fruito - ahdaub - zigtei - feinie - nsoupr - ebvibs - ozzian - riakia - griekl - eantiz - ea - 3

The new version (idfy) does not just more-or-less randomly put pairs of characters after each other but selects the next character depending on the last character that was selected and the input data. Letters that are more likely to be after the last character are more likely chosen to be the next character. Letters that never follow each other in the text will never follow each other in the resulting ID. A few examples are below:

eammmpa - veikue - mbueax - enosas - ujoqun - olisis - ommbuc - ifsuav - ivorep - lireus - idrspr - tobo - 0 ("french")

eckuefz - ahmtaa - getzzt - taluec - oknegi - reiebr - dasihh - mpebsu - grogri - ndltag - gerlkt - 0 ("german")

excrgal - lktndl - medgdn - gicevo - kaviof - atersn - iotuoa - uiepoc - elmior - eftotn - geymea - 2 ("english")

All of these are encoding the same ID while using texts with different languages as input. You might get similar results when using texts of the appropriate language with the same ID, since the frequency of character pairs is similar in different texts of the same language. The ID used can be found in the source code, it is

{77, 123, 89, 150, 111, 22, 208, 119, 251, 66, 25, 173, 241, 226, 133, 64, 65, 37, 210, 167, 189, 118, 77, 140, 45, 165, 4, 51, 21, 120, 14, 97}

This is admittedly not very short. The number at the end are the extra bits used to create the final syllable (when the data ends, zeros are appended).

Background The idea was of course to pretend that the ID is actually a compressed piece of human readable text. What this program does is it "uncompresses" the text. However, since we do not have a decoding table we have to create one. Huffman coding of the pseudo-syllables was not chosen for the first version because it does not respect language rules. With huffman, there might be a lot of consonants or vowels in a row, as can be seen in the examples for the second version, which uses a huffman table that depends on the last selected character.

Doing some statistics on texts of different languages, fixed tables can be found. These tables could then be exchanged depending on the user's language preference, to produce words that are more pronouncable in their native language, maybe even containing special characters. However, two users always have to agree on which language to use to compare fingerprints.

Feel free to comment on/correct me by sending me an email.

Last edited on Feb. 10 2014