This post is a brief technival overview of Unicode, a widely used standard for multilingual character representation, and the family of UTF-x encoding algorithms. First a brief introduction to Unicode:
Unicode is intended to address the need for a workable, reliable world text encoding.
Unicode could be roughly described as “wide-body ASCII” that has been stretched to 16 bits to encompass the characters of all the world’s living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.
Character Representation: Code Points and Planes
The reference to a specific character is called a code-point. ASCII for example uses 8 bit per character, which allows for 2^8 = 256 different characters (code-points).
Unicode uses 16 bits (2 bytes) per code-point and furthermore associates each code-point with one of 17 planes. Therefore Unicode provides 2^16 = 65,536 unique code-points per plane, with 2^16 * 17 = 1,114,112 maximum total unique code-points.
Currently only 6 of the 17 available planes are used:
| Plane | Unicode repr. | Description |
| 0 | U+0000 … U+FFFF | Basic Multilingual Plane |
| 1 | U+10000 … U+1FFFF | Supplementary Multilingual Plane |
| 2 | U+20000 … U+2FFFF | Supplementary Ideographic Plane |
| 14 | U+E0000 … U+EFFFF | Supplementary Special-purpose Plane |
| 15-16 | U+F0000 … U+10FFFF | Private Use Area |
Unicode code points of the first plane use two bytes, all other planes require a third byte to indicate the plane (blue color above).
Code points U+0000 to U+00FF (0-255) are identical to the Latin-1 values, so converting between them simply requires converting code points to byte values. In fact any document containing only characters of the first 127 code-points of the ASCII character map is a perfectly valid UTF-8 encoded Unicode document.
Character Encoding: UTF-8, 16 and 32
>>> u = u"€" >>> u u'\u20ac' >>> bytearray(u) Traceback (most recent call last): File "", line 1, in TypeError: unicode argument without an encoding >>>
This is where Unicode Transformation Formats (UTF) come into play. UTF-8/16/32 encoding stores any given unicode byte-array into either a variable amount of 8 bit blocks, or one or multiple 16 or 32 bit blocks.
UTF-8
UTF-8 is a variable-width encoding, with each unicode character represented by one to four bytes. A main advantage of UTF-8 is backward compatibility with the ASCII charset, allowing us to use the same decoding function for both any ASCII text and any utf-8 encoded unicode text.
If the character is encoded into just one byte, the high-order bit is 0 and the other bits represent the code point (in the range 0..127). If the character is encoded into a sequence of more than one byte, the first byte has as many leading ’1′ bits as the total number of bytes in the sequence, followed by a ’0′ bit, and the succeeding bytes are all marked by a leading “10″ bit pattern. The remaining bits in the byte sequence are concatenated to form the Unicode code point value.
UTF-16
UTF-16 always uses two bytes for encoding each code-point, and is thereby limited to characters of only the “Basic Multilingual Plane” (U+0000 to U+FFFF). Unicode code-points of other planes use 3 bytes and UTF-16 converts these into two 16-bit pairs, called a surrogate pair.
UTF-32
UTF-32 always uses exactly four bytes for encoding each Unicode code point (if the endianess is specified).
Summary
- UTF-8 can encode any code-point of any plane, and compresses lower code-points into fewer bytes (eg. ASCII charset into 1 byte). UTF-8 furthermore shares a common encoding with the first 127 code-points of the ASCII character set. Recommended for everything related to text.
- UTF-16 always saves 16 bit blocks without compression. If Unicode character is of a higher plane than 0 it has three bytes, and UTF-16 needs two 16-bit groups to represent it (see the euro € sign example below)
- UTF-32 encodes all Unicode code-points, but always saves 32 bit groups with no compression
.
Examples
>>> u = u"a"
>>> u
u'a'
>>> repr(u.encode("utf-8"))
"'a'"
>>> repr(u.encode("utf-16")) # no endianess specified
"'\\xff\\xfea\\x00'"
>>> repr(u.encode("utf-16-le")) # little endian byte order
"'a\\x00'"
>>> repr(u.encode("utf-16-be")) # big endian byte order
"'\\x00a'"
>>> repr(u.encode("utf-32"))
"'\\xff\\xfe\\x00\\x00a\\x00\\x00\\x00'"
>>> repr(u.encode("utf-32-le"))
"'a\\x00\\x00\\x00'"
>>> repr(u.encode("utf-32-be"))
"'\\x00\\x00\\x00a'"
>>> u = u"€"
>>> u
u'\u20ac'
>>> repr(u.encode("utf-8"))
"'\\xe2\\x82\\xac'"
>>> repr(u.encode("utf-16"))
"'\\xff\\xfe\\xac '"
>>> repr(u.encode("utf-16-le"))
"'\\xac '"
>>> repr(u.encode("utf-16-be"))
"' \\xac'"
>>> repr(u.encode("utf-32"))
"'\\xff\\xfe\\x00\\x00\\xac \\x00\\x00'"
>>> repr(u.encode("utf-32-le"))
"'\\xac \\x00\\x00'"
>>> repr(u.encode("utf-32-be"))
"'\\x00\\x00 \\xac'"
Feedback
Please leave a comment if you have feedback or questions!





December 29th, 2010 at 2:35 pm
Posting again because your blog software fails at Unicode.
The introduction you give is massively outdated to the point of being wrong. The basic idea of Unicode since at least version 2 is to associate characters with codepoints, and we are at version 6 now. It’s really just a giant table: A is U+0041, ‽ is U+203D, xxx is U+1F010, xxxx is U+E0059 and so on. We need not care about planes at all, and especially we need not care about blue and green numbers. SRSLY. This is not a nitpicking, but actually making the explanation understandable. The latter part about encodings also has a weak spot. UTF-16 is not limited to the BMP, you are thinking of UCS-2, which is also deprecated since 15 years. The rest is okay.
What I am missing in your article:
Firstly, no one uses UTF-32. UTF-7 is not mentioned at all. Only Microsoft uses UTF-16, and that’s also only an internal encoding. UTF-8 is the de facto standard encoding for Unix and on the Internet, especially the Web, XML based services, and somewhat popular in e-mail.
Secondly, not only *how* to use this, but *why*. Basically every programmer must know about the need to *decode* data coming from external sources into the program/programming language’s internal representation of data/strings, and when writing data to external sources, it needs to be *encoded*. Programmers must acquire these habits. is the canonical collection of documents that expound on this topic.
December 29th, 2010 at 4:19 pm
Thanks for sharing your thoughts, I appreciate your feedback and the various points of your comment. Regarding the blog software used here, it’s WordPress, and I thought it was Unicode compatible — sorry for the inconvenience.
I disagree that this introduction is outdated, and do not see where it would be wrong.
Even with Unicode version 6 each code point is associated with a plane. I think this is important to understand why some Unicode characters require only 2 bytes (eg. U+0041, because they are on plane 0), whereas others characters require 3 bytes (eg. U+E0059, which indicates it is on plane 0E = 14).
Re UTF-16 and code-points from other plains than BMP, I did explain it in the post: “Unicode code-points of other planes use 3 bytes and UTF-16 converts these into two 16-bit pairs, called a surrogate pair”.