On Unicode

There were some interesting Unicode related issues that cropped up recently in the project that I am working on that led to my doing a little research into what the fuss around unicode was all about. While I had some understanding of what Unicode was, there were a few things that I managed to learn anew. So, if you didn't know, here's the low down on the Unicode standard.

First, some basic facts

Firstly there are two parallel efforts aimed at standardizing the use of characters in computer programs! One is the ISO 10646 project called the Universal Character Set (UCS) and the other is of course, Unicode. Around 1991 however, participants from both the projects fortunately decided that it would probably not be a good idea to have two competing standards for solving the same problem and decided to make both of their specifications compatible.
The primary goal of the Unicode standard is the definition of a universal character set (!), i.e., a character set to replace all the other character sets. Further, it would also be able to accommodate characters from all the languages spoken/written in the world.
It achieves this by assigning unique numbers - called code points - to each character. The Kannada letter "ka" for example has been assigned the code point 3221. What this means is that 3221 is forever the code for the Kannada letter "ka" all over the planet! Numbers such as this are assigned for all characters in all languages.
Code points are always assigned from the range 0x000000 to 0x10FFFF. You'd need 21 bits to represent this information at most. Around 5% of this space (works out to about 50,000 characters) is currently in use, another 5% is in preparation, about 13% is reserved for private use and about 2% is just reserved and not to be used for representing characters. The remaining 75% (around 8,35,000 characters) is open for future use!
Interestingly, effort is underway for assigning code points to characters from imaginary languages as well! JRR Tolkien invented a whole slew of languages each with its own grammar and script for his epic trilogy - "The Lord of the Rings". Languages spoken and written by elves, dwarves, hobbits and ents (large walking/talking trees!) including a language called "Black Speech" used by orcs and other such dark residents of Mordor!

Some caveats

You might have heard that Unicode characters can be represented by 2 byte unsigned integers. Well, this is not entirely true. While it is possible to represent all the Unicode characters that exist in the world today (which represents the most frequently used set of characters) using 2 byte unsigned integers (given that only around 50,000 characters exist and an unsigned short can have a maximum value of 65,535) it is possible that code points get created whose value is greater than the maximum that can be accommodated in an unsigned short. The most commonly used characters however have been assigned numbers within the range 0x0000 to 0xFFFF (this is called the Basic Multilingual Plane or BMP).
The closest data type in C/C++ that can be used to represent all the possible Unicode code points is a 4 byte integer. But this would also mean that 11 bits would get wasted for every character given that all the code points can be represented with just 21 bits. The size of the C/C++ wchar_t data-type is compiler dependent and the standard does not say anything on how big it must be.
Even while using 2 bytes per character you'll immediately notice that using them is wasteful when you're mostly dealing with characters belonging, for example to the ASCII character set (all the ASCII character code assignments have been retained in Unicode for ensuring backward compatibility by the way) since the second byte would always have the value zero for all the characters.

Encodings

To get around this problem some clever folks invented "encoding" schemes such as UTF-8 and UTF-16 that lay out how any Unicode code point from the entire spectrum can be represented using the least number of bytes. UTF-8 in particular is quite popular as it automatically ensures backward compatibility with older documents. All existing ASCII documents are already valid UTF-8 files. Here's a nifty little table that specifies how Unicode code points will be represented in the UTF-8 encoding scheme

Unicode	UTF-8
00000000 - 0000007F	0xxxxxxx
00000080 - 000007FF	110xxxxx 10xxxxxx
00000800 - 0000FFFF	1110xxxx 10xxxxxx 10xxxxxx
00010000 - 001FFFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000 - 03FFFFFF	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
04000000 - 7FFFFFFF	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The first column specifies the range of code points and the second column is a bitwise representation of how it will be denoted under UTF-8. Code points till 0x7F (ASCII character set) for instance will be represented using a single byte. For code points greater than 7F at least 2 bytes are needed and the number of contiguous bits set to 1 in the first byte till a zero is encountered indicates the number of bytes used to represent that code point. For example, 3 bytes are required for representing code points in the range 0x00000800 - 0x0000FFFF and this is indicated by the fact that the 3 most significant bits in the first byte is set to 1 followed by a zero bit.

UTF-8 is a "variable encoding" scheme where each character in the document can correspond to a varying number of bytes. Finding the size of such a document post encoding can be somewhat tricky.

Most of this information has been taken from the following great resources on this topic.

http://www.cl.cam.ac.uk/\~mgk25/unicode.html
This is an FAQ on what it takes to support Unicode on Linux and has a lot of information on Unicode and UCS in general.

http://www-128.ibm.com/developerworks/library/codepages.html
Talks about various character sets. Good introduction to Unicode.

http://icu.sourceforge.net/docs/papers/unicode_wchar_t.html
Talks about issues relating to size of the C/C++ wchar_t data-type.