The Unicode Standard

The Unicode Standard Version 5.0, 2006. 0321480910 (Blackwell’s,, This review is of the Version 3.0 edition.

This is not a book that anyone will want to read.

Unicode is a project to assign code numbers to every character in every language in the world. In the early days of PCs there weren’t many code numbers, so every region used the numbers in its own particular way. You could type “wędzony łosoś” (smoked salmon) in Poland and see wędzony łosoś on your screen; but if someone looked at it on an English computer he’d see those same codes displayed as wêdzony ³oso¦ – not the same thing at all. This made it impossible to type anything that had certain mixtures of languages in it, and it made life very difficult for web browsers, which need to be able to handle anything anyone anywhere has written.

Unicode started with 60,000 character codes, soon increased to just under a million. It sounds like the world’s most boring project, and the book of the project must therefore qualify as one of the world’s most boring books.

In many ways, it is.

I can’t imagine anyone reading through this book from beginning to end, even for a bet; and unlike most boring books it’s not even any use as an insomnia cure because it’s too heavy to hold up in bed.

And yet…

Looking at the book from a safe distance (as I had to to make sure that Cardbox was supporting Unicode properly) its charm grows on you. Just as we can appreciate the story of someone overhearing one Oxford don saying to another, “and ninthly…” while at the same time being grateful that we didn’t have to listen all the way from “firstly” up to “eighthly”, so we can relish a paragraph that starts “In good Latvian typography…” while being grateful that we ourselves aren’t called upon to distinguish good Latvian typography from bad in any great detail. [It’s careful if ģ (small g-cedilla) has a comma on top of it instead of having a cedilla underneath that carelessly gets tangled up in the bottom of the g].

Indeed, all human life is there.

  • There is philosophy: Unicode encodes characters, not typefaces, so if h and h are just similar shapes of the same letter, what about Planck’s constant ℎ? Or R, R, and the mathematical ℜ, ℛ, ℝ?
  • There is politics: if Korean, Chinese and Japanese all use the same ideographs, how do we decide which country’s codes to use?
  • There is economics: if a language has never been typed or printed before, and Adobe spend a lot of money developing a computer character set for it, is it fair that no-one can print books in that language without paying them royalties?
  • There is religion: the Deseret script used for the Book of Mormon; and various pious Arabic ejaculations such as ﷺ (sallallahou alayhe wasallam).
  • There is culture: if George Bernard Shaw’s Shavian alphabet is already in Unicode, what about Tolkien’s Elvish script, and in that case what about Klingon? (the answers were No and No, respectively, but the floor of the committee room ran grey with blood).
  • There is high computing technology: in Sanskrit (and some modern Indian languages do much the same) you can only write two consecutive consonants by merging them into one elaborate filigree: very, very beautiful, but you can’t tell a computer “be beautiful”. Emerging from this with a little sanity left, the programmer discovers that in South Asian languages some vowels are written on both sides of the consonants they come after…

The kind of beauty is best seen from a distance. It is a geological, sedimentary kind of beauty. Each grain of sand is about as uninteresting as each other grain of sand, and we pity the people who have to count them; but over the millennia they accumulate as rock, each one fixed forever where it fell, and in time we can see the lovely bands of the different layers in the sandstone.

Now that the focus has moved from living languages to long dead ones, I can imagine long-forgotten professors being unwrapped from their mummies in the British Museum and being asked to pronounce, once and for all, finally and definitively, on the transcription and representation of Akkadian cuneiform. Mediaeval Latin, Norse runes, Maya glyphs, one by one they are pondered, debated, included. The knowledge thus encoded is not forgotten but will remain for all time, the work of our generation and no other.

Here, to conclude, are some edited highlights:

The loveliest character sets are the South Asian ones. The book says that they get their flowing curves from being written with needles on palm leaves, but this must be wrong. One look shows that they must have been written with silver wire laid on burnished gold:


They also work quite well as cartoon characters. Here are some other characters to tease your scholarly friends with. You’ll have to have the right fonts installed on your computer, or you won’t see anything but blobs or question marks:

goat butting a fleeing downhill skier (Tamil)

scorpion balancing a fish on its back (Sinhala)

couple feeling comfortable together (Myanmar)

now she sits down, he stands on guard (Myanmar)

But then, I suppose our characters look funny to them as well.