This HOWTO discusses Python support for Unicode, and explains various problems that people commonly encounter when trying to work with Unicode.
Introduction to Unicode¶
History of Character Codes¶
In 1968, the American Standard Code for Information Interchange, better known by its acronym ASCII, was standardized. ASCII defined numeric codes for various characters, with the numeric values running from 0 to 127. For example, the lowercase letter ‘a’ is assigned 97 as its code value.
ASCII was an American-developed standard, so it only defined unaccented characters. There was an ‘e’, but no ‘é’ or ‘Í’. This meant that languages which required accented characters couldn’t be faithfully represented in ASCII. (Actually the missing accents matter for English, too, which contains words such as ‘naïve’ and ‘café’, and some publications have house styles which require spellings such as ‘coöperate’.)
For a while people just wrote programs that didn’t display accents. In the mid-1980s an Apple II BASIC program written by a French speaker might have lines like these:
Those messages should contain accents (completé, caractère, accepté), and they just look wrong to someone who can read French.
In the 1980s, almost all personal computers were 8-bit, meaning that bytes could hold values ranging from 0 to 255. ASCII codes only went up to 127, so some machines assigned values between 128 and 255 to accented characters. Different machines had different codes, however, which led to problems exchanging files. Eventually various commonly used sets of values for the 128–255 range emerged. Some were true standards, defined by the International Standards Organization, and some were de facto conventions that were invented by one company or another and managed to catch on.
255 characters aren’t very many. For example, you can’t fit both the accented characters used in Western Europe and the Cyrillic alphabet used for Russian into the 128–255 range because there are more than 127 such characters.
You could write files using different codes (all your Russian files in a coding system called KOI8, all your French files in a different coding system called Latin1), but what if you wanted to write a French document that quotes some Russian text? In the 1980s people began to want to solve this problem, and the Unicode standardization effort began.
Unicode started out using 16-bit characters instead of 8-bit characters. 16 bits means you have 2^16 = 65,536 distinct values available, making it possible to represent many different characters from many different alphabets; an initial goal was to have Unicode contain the alphabets for every single human language. It turns out that even 16 bits isn’t enough to meet that goal, and the modern Unicode specification uses a wider range of codes, 0 through 1,114,111 ( in base 16).
There’s a related ISO standard, ISO 10646. Unicode and ISO 10646 were originally separate efforts, but the specifications were merged with the 1.1 revision of Unicode.
(This discussion of Unicode’s history is highly simplified. The precise historical details aren’t necessary for understanding how to use Unicode effectively, but if you’re curious, consult the Unicode consortium site listed in the References or the Wikipedia entry for Unicode for more information.)
A character is the smallest possible component of a text. ‘A’, ‘B’, ‘C’, etc., are all different characters. So are ‘È’ and ‘Í’. Characters are abstractions, and vary depending on the language or context you’re talking about. For example, the symbol for ohms (Ω) is usually drawn much like the capital letter omega (Ω) in the Greek alphabet (they may even be the same in some fonts), but these are two different characters that have different meanings.
The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation to mean the character with value (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:
Strictly, these definitions imply that it’s meaningless to say ‘this is character ‘. is a code point, which represents some particular character; in this case, it represents the character ‘ETHIOPIC SYLLABLE WI’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.
A character is represented on a screen or on paper by a set of graphical elements that’s called a glyph. The glyph for an uppercase A, for example, is two diagonal strokes and a horizontal stroke, though the exact details will depend on the font being used. Most Python code doesn’t need to worry about glyphs; figuring out the correct glyph to display is generally the job of a GUI toolkit or a terminal’s font renderer.
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through (1,114,111 decimal). This sequence needs to be represented as a set of bytes (meaning, values from 0 through 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
The first encoding you might think of is an array of 32-bit integers. In this representation, the string “Python” would look like this:
This representation is straightforward but using it presents a number of problems.
- It’s not portable; different processors order the bytes differently.
- It’s very wasteful of space. In most texts, the majority of the code points are less than 127, or less than 255, so a lot of space is occupied by bytes. The above string takes 24 bytes compared to the 6 bytes needed for an ASCII representation. Increased RAM usage doesn’t matter too much (desktop computers have gigabytes of RAM, and strings aren’t usually that large), but expanding our usage of disk and network bandwidth by a factor of 4 is intolerable.
- It’s not compatible with existing C functions such as , so a new family of wide string functions would need to be used.
- Many Internet standards are defined in terms of textual data, and can’t handle content with embedded zero bytes.
Generally people don’t use this encoding, instead choosing other encodings that are more efficient and convenient. UTF-8 is probably the most commonly supported encoding; it will be discussed below.
Encodings don’t have to handle every possible Unicode character, and most encodings don’t. The rules for converting a Unicode string into the ASCII encoding, for example, are simple; for each code point:
- If the code point is < 128, each byte is the same as the value of the code point.
- If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a exception in this case.)
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points 0–255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can’t be encoded into Latin-1.
Encodings don’t have to be simple one-to-one mappings like Latin-1. Consider IBM’s EBCDIC, which was used on IBM mainframes. Letter values weren’t in one block: ‘a’ through ‘i’ had values from 129 to 137, but ‘j’ through ‘r’ were 145 through 153. If you wanted to use EBCDIC as an encoding, you’d probably use some sort of lookup table to perform the conversion, but this is largely an internal detail.
UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There are also a UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.) UTF-8 uses the following rules:
- If the code point is < 128, it’s represented by the corresponding byte value.
- If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties:
- It can handle any Unicode code point.
- A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as and sent through protocols that can’t handle zero bytes.
- A string of ASCII text is also valid UTF-8 text.
- UTF-8 is fairly compact; the majority of commonly used characters can be represented with one or two bytes.
- If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.
The Unicode Consortium site has character charts, a glossary, and PDF versions of the Unicode specification. Be prepared for some difficult reading. A chronology of the origin and development of Unicode is also available on the site.
To help understand the standard, Jukka Korpela has written an introductory guide to reading the Unicode character tables.
Another good introductory article was written by Joel Spolsky. If this introduction didn’t make things clear to you, you should try reading this alternate article before continuing.
Wikipedia entries are often helpful; see the entries for “character encoding” and UTF-8, for example.
Python’s Unicode Support¶
Now that you’ve learned the rudiments of Unicode, we can look at Python’s Unicode features.
The String Type¶
Since Python 3.0, the language features a type that contain Unicode characters, meaning any string created using , , or the triple-quoted string syntax is stored as Unicode.
The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal:
You can use a different encoding from UTF-8 by putting a specially-formatted comment as the first or second line of the source code:
Side note: Python 3 also supports using Unicode characters in identifiers:
If you can’t enter a particular character in your editor or want to keep the source code ASCII-only for some reason, you can also use escape sequences in string literals. (Depending on your system, you may see the actual capital-delta glyph instead of a u escape.)
In addition, one can create a string using the method of . This method takes an encoding argument, such as , and optionally an errors argument.
The errors argument specifies the response when the input string can’t be converted according to the encoding’s rules. Legal values for this argument are (raise a exception), (use , ), or (just leave the character out of the Unicode result). The following examples show the differences:
(In this code example, the Unicode replacement character has been replaced by a question mark because it may not be displayed on some systems.)
Encodings are specified as strings containing the encoding’s name. Python 3.2 comes with roughly 100 different encodings; see the Python Library Reference at Standard Encodings for a list. Some encodings have multiple names; for example, , and ‘ are all synonyms for the same encoding.
One-character Unicode strings can also be created with the built-in function, which takes integers and returns a Unicode string of length 1 that contains the corresponding code point. The reverse operation is the built-in function that takes a one-character Unicode string and returns the code point value:
Converting to Bytes¶
The opposite method of is , which returns a representation of the Unicode string, encoded in the requested encoding.
The errors parameter is the same as the parameter of the method but supports a few more possible handlers. As well as , , and (which in this case inserts a question mark instead of the unencodable character), there is also (inserts an XML character reference) and (inserts a escape sequence).
The following example shows the different results:
The low-level routines for registering and accessing the available encodings are found in the module. Implementing new encodings also requires understanding the module. However, the encoding and decoding functions returned by this module are usually more low-level than is comfortable, and writing new encodings is a specialized task, so the module won’t be covered in this HOWTO.
Unicode Literals in Python Source Code¶
In Python source code, specific Unicode code points can be written using the escape sequence, which is followed by four hex digits giving the code point. The escape sequence is similar, but expects eight hex digits, not four:
Using escape sequences for code points greater than 127 is fine in small doses, but becomes an annoyance if you’re using many accented characters, as you would in a program with messages in French or some other accent-using language. You can also assemble strings using the built-in function, but this is even more tedious.
Ideally, you’d want to be able to write literals in your language’s natural encoding. You could then edit Python source code with your favorite editor which would display the accented characters naturally, and have the right characters used at runtime.
Python supports writing source code in UTF-8 by default, but you can use almost any encoding if you declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:
The syntax is inspired by Emacs’s notation for specifying variables local to a file. Emacs supports many different variables, but Python only supports ‘coding’. The symbols indicate to Emacs that the comment is special; they have no significance to Python but are a convention. Python looks for or in the comment.
If you don’t include such a comment, the default encoding used will be UTF-8 as already mentioned. See also PEP 263 for more information.
The Unicode specification includes a database of information about code points. For each defined code point, the information includes the character’s name, its category, the numeric value if applicable (Unicode has characters representing the Roman numerals and fractions such as one-third and four-fifths). There are also properties related to the code point’s use in bidirectional text and other display-related properties.
The following program displays some information about several characters, and prints the numeric value of one particular character:
When run, this prints:
The category codes are abbreviations describing the nature of the character. These are grouped into categories such as “Letter”, “Number”, “Punctuation”, or “Symbol”, which in turn are broken up into subcategories. To take the codes from the above output, means ‘Letter, lowercase’, means “Number, other”, is “Mark, nonspacing”, and is “Symbol, other”. See the General Category Values section of the Unicode Character Database documentation for a list of category codes.
Unicode Regular Expressions¶
The regular expressions supported by the module can be provided either as bytes or strings. Some of the special character sequences such as and have different meanings depending on whether the pattern is supplied as bytes or a string. For example, will match the characters in bytes but in strings will match any character that’s in the category.
The string in this example has the number 57 written in both Thai and Arabic numerals:
When executed, will match the Thai numerals and print them out. If you supply the flag to , will match the substring “57” instead.
Similarly, matches a wide variety of Unicode characters but only in bytes or if is supplied, and will match either Unicode whitespace characters or .
Reading and Writing Unicode Data¶
Once you’ve written some code that works with Unicode data, the next problem is input/output. How do you get Unicode strings into your program, and how do you convert Unicode into a form suitable for storage or transmission?
It’s possible that you may not need to do anything depending on your input sources and output destinations; you should check whether the libraries used in your application support Unicode natively. XML parsers often return Unicode data, for example. Many relational databases also support Unicode-valued columns and can return Unicode values from an SQL query.
Unicode data is usually converted to a particular encoding before it gets written to disk or sent over a socket. It’s possible to do all the work yourself: open a file, read an 8-bit bytes object from it, and convert the bytes with . However, the manual approach is not recommended.
One problem is the multi-byte nature of encodings; one Unicode character can be represented by several bytes. If you want to read the file in arbitrary-sized chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case where only part of the bytes encoding a single Unicode character are read at the end of a chunk. One solution would be to read the entire file into memory and then perform the decoding, but that prevents you from working with files that are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM. (More, really, since for at least a moment you’d need to have both the encoded string and its Unicode version in memory.)
The solution would be to use the low-level decoding interface to catch the case of partial coding sequences. The work of implementing this has already been done for you: the built-in function can return a file-like object that assumes the file’s contents are in a specified encoding and accepts Unicode parameters for methods such as and . This works through ‘s encoding and errors parameters which are interpreted just like those in and .
Reading Unicode from a file is therefore simple:
It’s also possible to open files in update mode, allowing both reading and writing:
The Unicode character is used as a byte-order mark (BOM), and is often written as the first character of a file in order to assist with autodetection of the file’s byte ordering. Some encodings, such as UTF-16, expect a BOM to be present at the start of a file; when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as ‘utf-16-le’ and ‘utf-16-be’ for little-endian and big-endian encodings, that specify one particular byte ordering and don’t skip the BOM.
In some areas, it is also convention to use a “BOM” at the start of UTF-8 encoded files; the name is misleading since UTF-8 is not byte-order dependent. The mark simply announces that the file is encoded in UTF-8. Use the ‘utf-8-sig’ codec to automatically skip the mark if present for reading such files.
Most of the operating systems in common use today support filenames that contain arbitrary Unicode characters. Usually this is implemented by converting the Unicode string into some encoding that varies depending on the system. For example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on Windows, Python uses the name “mbcs” to refer to whatever the currently configured encoding is. On Unix systems, there will only be a filesystem encoding if you’ve set the or environment variables; if you haven’t, the default encoding is UTF-8.
The function returns the encoding to use on your current system, in case you want to do the encoding manually, but there’s not much reason to bother. When opening a file for reading or writing, you can usually just provide the Unicode string as the filename, and it will be automatically converted to the right encoding for you:
Functions in the module such as will also accept Unicode filenames.
The function returns filenames and raises an issue: should it return the Unicode version of filenames, or should it return bytes containing the encoded versions? will do both, depending on whether you provided the directory path as bytes or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem’s encoding and a list of Unicode strings will be returned, while passing a byte path will return the filenames as bytes. For example, assuming the default filesystem encoding is UTF-8, running the following program:
will produce the following output:
The first list contains UTF-8-encoded filenames, and the second list contains the Unicode versions.
Note that on most occasions, the Unicode APIs should be used. The bytes APIs should only be used on systems where undecodable file names can be present, i.e. Unix systems.
Tips for Writing Unicode-aware Programs¶
This section provides some suggestions on writing software that deals with Unicode.
The most important tip is:
Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end.
If you attempt to write processing functions that accept both Unicode and byte strings, you will find your program vulnerable to bugs wherever you combine the two different kinds of strings. There is no automatic encoding or decoding: if you do e.g. , a will be raised.
When using data coming from a web browser or some other untrusted source, a common technique is to check for illegal characters in a string before using the string in a generated command line or storing it in a database. If you’re doing this, be careful to check the decoded string, not the encoded bytes data; some encodings may have interesting properties, such as not being bijective or not being fully ASCII-compatible. This is especially true if the input data also specifies the encoding, since the attacker can then choose a clever way to hide malicious text in the encoded bytestream.
Converting Between File Encodings¶
The class can transparently convert between encodings, taking a stream that returns data in encoding #1 and behaving like a stream returning data in encoding #2.
For example, if you have an input file f that’s in Latin-1, you can wrap it with a to return bytes encoded in UTF-8:
Files in an Unknown Encoding¶
What can you do if you need to make a change to a file, but don’t know the file’s encoding? If you know the encoding is ASCII-compatible and only want to examine or modify the ASCII parts, you can open the file with the error handler:
The error handler will decode any non-ASCII bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the error handler is used when encoding the data and writing it back out.
The initial draft of this document was written by Andrew Kuchling. It has since been revised further by Alexander Belopolsky, Georg Brandl, Andrew Kuchling, and Ezio Melotti.
Thanks to the following people who have noted errors or offered suggestions on this article: Éric Araujo, Nicholas Bastin, Nick Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André Lemburg, Martin von Löwis, Terry J. Reedy, Chad Whitacre.
What every programmer absolutely, positively needs to know about encodings and character sets to work with text
If you are dealing with text in a computer, you need to know about encodings. Period. Yes, even if you are just sending emails. Even if you are just receiving emails. You don't need to understand every last detail, but you must at least know what this whole "encoding" thing is about. And the good news first: while the topic can get messy and confusing, the basic idea is really, really simple.
This article is about encodings and character sets. An article by Joel Spolsky entitled The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is a nice introduction to the topic and I greatly enjoy reading it every once in a while. I hesitate to refer people to it who have trouble understanding encoding problems though since, while entertaining, it is pretty light on actual technical details. I hope this article can shed some more light on what exactly an encoding is and just why all your text screws up when you least need it. This article is aimed at developers (with a focus on PHP), but any computer user should be able to benefit from it.
Getting the basics straight
Everybody is aware of this at some level, but somehow this knowledge seems to suddenly disappear in a discussion about text, so let's get it out first: A computer cannot store "letters", "numbers", "pictures" or anything else. The only thing it can store and work with are bits. A bit can only have two values: or , or , or or whatever else you want to call these two values. Since a computer works with electricity, an "actual" bit is a blip of electricity that either is or isn't there. For humans, this is usually represented using and and I'll stick with this convention throughout this article.
To use bits to represent anything at all besides bits, we need rules. We need to convert a sequence of bits into something like letters, numbers and pictures using an encoding scheme, or encoding for short. Like this:
In this encoding, stands for the letter "b", for the letter "i", stands for "t" and for "s". A certain sequence of bits stands for a letter and a letter stands for a certain sequence of bits. If you can keep this in your head for 26 letters or are really fast with looking stuff up in a table, you could read bits like a book.
The above encoding scheme happens to be ASCII. A string of s and s is broken down into parts of eight bit each (a byte for short). The ASCII encoding specifies a table translating bytes into human readable letters. Here's a short excerpt of that table:
There are 95 human readable characters specified in the ASCII table, including the letters A through Z both in upper and lower case, the numbers 0 through 9, a handful of punctuation marks and characters like the dollar symbol, the ampersand and a few others. It also includes 33 values for things like space, line feed, tab, backspace and so on. These are not printable per se, but still visible in some form and useful to humans directly. A number of values are only useful to a computer, like codes to signify the start or end of a text. In total there are 128 characters defined in the ASCII encoding, which is a nice round number (for people dealing with computers), since it uses all possible combinations of 7 bits (, , through ).1
And there you have it, the way to represent human-readable text using only s and s.
To encode something in ASCII, follow the table from right to left, substituting letters for bits. To decode a string of bits into human readable characters, follow the table from left to right, substituting bits for letters.
verb [ with obj. ]
convert into a coded form
a system of words, letters, figures, or other symbols substituted for other words, letters, etc.
To encode means to use something to represent something else. An encoding is the set of rules with which to convert something from one representation to another.
Other terms which deserve clarification in this context:
- character set, charset
- The set of characters that can be encoded. "The ASCII encoding encompasses a character set of 128 characters." Essentially synonymous to "encoding".
- code page
- A "page" of codes that map a character to a number or bit sequence. A.k.a. "the table". Essentially synonymous to "encoding".
- A string is a bunch of items strung together. A bit string is a bunch of bits, like . A character string is a bunch of characters, . Synonymous to "sequence".
Binary, octal, decimal, hex
There are many ways to write numbers. 10011111 in binary is 237 in octal is 159 in decimal is 9F in hexadecimal. They all represent the same value, but hexadecimal is shorter and easier to read than binary. I will stick with binary throughout this article to get the point across better and spare the reader one layer of abstraction. Do not be alarmed to see character codes referred to in other notations elsewhere, it's all the same thing.
Now that we know what we're talking about, let's just say it: 95 characters really isn't a lot when it comes to languages. It covers the basics of English, but what about writing a risqué letter in French? A Straßenübergangsänderungsgesetz in German? An invitation to a smörgåsbord in Swedish? Well, you couldn't. Not in ASCII. There's no specification on how to represent any of the letters é, ß, ü, ä, ö or å in ASCII, so you can't use them.
"But look at it," the Europeans said, "in a common computer with 8 bits to the byte, ASCII is wasting an entire bit which is always set to ! We can use that bit to squeeze a whole 'nother 128 values into that table!" And so they did. But even so, there are more than 128 ways to stroke, slice, slash and dot a vowel. Not all variations of letters and squiggles used in all European languages can be represented in the same table with a maximum of 256 values. So what the world ended up with is a wealth of encoding schemes, standards, de-facto standards and half-standards that all cover a different subset of characters. Somebody needed to write a document about Swedish in Czech, found that no encoding covered both languages and invented one. Or so I imagine it went countless times over.
And not to forget about Russian, Hindi, Arabic, Hebrew, Korean and all the other languages currently in active use on this planet. Not to mention the ones not in use anymore. Once you have solved the problem of how to write mixed language documents in all of these languages, try yourself on Chinese. Or Japanese. Both contain tens of thousands of characters. You have 256 possible values to a byte consisting of 8 bit. Go!
To create a table that maps characters to letters for a language that uses more than 256 characters, one byte simply isn't enough. Using two bytes (16 bits), it's possible to encode 65,536 distinct values. BIG-5 is such a double-byte encoding. Instead of breaking a string of bits into blocks of eight, it breaks it into blocks of 16 and has a big (I mean, BIG) table that specifies which character each combination of bits maps to. BIG-5 in its basic form covers mostly Traditional Chinese characters. GB18030 is another encoding which essentially does the same thing, but includes both Traditional and Simplified Chinese characters. And before you ask, yes, there are encodings which cover only Simplified Chinese. Can't just have one encoding now, can we?
Here a small excerpt from the GB18030 table:
GB18030 covers quite a range of characters (including a large part of latin characters), but in the end is yet another specialized encoding format among many.
Unicode to the confusion
Finally somebody had enough of the mess and set out to
forge a ring to bind them all create one encoding standard to unify all encoding standards. This standard is Unicode. It basically defines a ginormous table of 1,114,112 code points that can be used for all sorts of letters and symbols. That's plenty to encode all European, Middle-Eastern, Far-Eastern, Southern, Northern, Western, pre-historian and future characters mankind knows about.2 Using Unicode, you can write a document containing virtually any language using any character you can type into a computer. This was either impossible or very very hard to get right before Unicode came along. There's even an unofficial section for Klingon in Unicode. Indeed, Unicode is big enough to allow for unofficial, private-use areas.
So, how many bits does Unicode use to encode all these characters? None. Because Unicode is not an encoding.
Confused? Many people seem to be. Unicode first and foremost defines a table of code points for characters. That's a fancy way of saying "65 stands for A, 66 stands for B and 9,731 stands for ☃" (seriously, it does). How these code points are actually encoded into bits is a different topic. To represent 1,114,112 different values, two bytes aren't enough. Three bytes are, but three bytes are often awkward to work with, so four bytes would be the comfortable minimum. But, unless you're actually using Chinese or some of the other characters with big numbers that take a lot of bits to encode, you're never going to use a huge chunk of those four bytes. If the letter "A" was always encoded to , "B" always to and so on, any document would bloat to four times the necessary size.
To optimize this, there are several ways to encode Unicode code points into bits. UTF-32 is such an encoding that encodes all Unicode code points using 32 bits. That is, four bytes per character. It's very simple, but often wastes a lot of space. UTF-16 and UTF-8 are variable-length encodings. If a character can be represented using a single byte (because its code point is a very small number), UTF-8 will encode it with a single byte. If it requires two bytes, it will use two bytes and so on. It has elaborate ways to use the highest bits in a byte to signal how many bytes a character consists of. This can save space, but may also waste space if these signal bits need to be used often. UTF-16 is in the middle, using at least two bytes, growing to up to four bytes as necessary.
And that's all there is to it. Unicode is a large table mapping characters to numbers and the different UTF encodings specify how these numbers are encoded as bits. Overall, Unicode is yet another encoding scheme. There's nothing special about it, it's just trying to cover everything while still being efficient. And that's A Good Thing.™
Characters are referred to by their "Unicode code point". Unicode code points are written in hexadecimal (to keep the numbers shorter), preceded by a "U+" (that's just what they do, it has no other meaning than "this is a Unicode code point"). The character Ḁ has the Unicode code point U+1E00. In other (decimal) words, it is the 7680th character of the Unicode table. It is officially called "LATIN CAPITAL LETTER A WITH RING BELOW".
A summary of all the above: Any character can be encoded in many different bit sequences and any particular bit sequence can represent many different characters, depending on which encoding is used to read or write them. The reason is simply because different encodings use different numbers of bits per characters and different values to represent different characters.
|Windows Latin 1||ÄB|
|Føö||Windows Latin 1|
Misconceptions, confusions and problems
Having said all that, we come to the actual problems experienced by many users and programmers every day, how those problems relate to all of the above and what their solution is. The biggest problem of all is:
Why in god's name are my characters garbled?!
If you open a document and it looks like this, there's one and only one reason for it: Your text editor, browser, word processor or whatever else that's trying to read the document is assuming the wrong encoding. That's all. The document is not broken (well, unless it is, see below), there's no magic you need to perform, you simply need to select the right encoding to display the document.
The hypothetical document above contains this sequence of bits:
Now, quick, what encoding is that? If you just shrugged, you'd be correct. Who knows, right‽
Well, let's try to interpret this as ASCII. Hmm, most of these bytes start3 with a bit. If you remember correctly, ASCII doesn't use that bit. So it's not ASCII. What about UTF-8? Hmm, no, most of these sequences are not valid UTF-8.4 So UTF-8 is out, too. Let's try "Mac Roman" (yet another encoding scheme for them Europeans). Hey, all those bytes are valid in Mac Roman. maps to "É", to "G" and so on. If you read this bit sequence using the Mac Roman encoding, the result is "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔÇµÇ≠Ç»Ç¢". That looks like a valid string, no? Yes? Maybe? Well, how's the computer to know? Maybe somebody meant to write "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔÇµÇ≠Ç»Ç¢". For all I know that could be a DNA sequence.5 Unless you have a better suggestion, let's declare this to be a DNA sequence, say this document was encoded in Mac Roman and call it a day.
Of course, that unfortunately is complete nonsense. The correct answer is that this text is encoded in the Japanese Shift-JIS encoding and was supposed to read "エンコーディングは難しくない". Well, who'd've thunk?
The primary cause of garbled text is: Somebody is trying to read a byte sequence using the wrong encoding. The computer always needs to be told what encoding some text is in. Otherwise it can't know. There are different ways how different kinds of documents can specify what encoding they're in and these ways should be used. A raw bit sequence is always a mystery box and could mean anything.
Most browsers allow the selection of a different encoding in the View menu under the menu option "Text Encoding", which causes the browser to reinterpret the current page using the selected encoding. Other programs may offer something like "Reopen using encoding…" in the File menu, or possibly an "Import…" option which allows the user to manually select an encoding.
My document doesn't make sense in any encoding!
If a sequence of bits doesn't make sense (to a human) in any encoding, the document has mostly likely been converted incorrectly at some point. Say we took the above text "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔÇµÇ≠Ç»Ç¢" because we didn't know any better and saved it as UTF-8. The text editor assumed it correctly read a Mac Roman encoded text and you now want to save this text in a different encoding. All of these characters are valid Unicode characters after all. That is to say, there's a code point in Unicode that can represent "É", one that can represent "G" and so on. So we can happily save this text as UTF-8:
This is now the UTF-8 bit sequence representing the text "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔÇµÇ≠Ç»Ç¢". This bit sequence has absolutely nothing to do with our original document. Whatever encoding we try to open it in, we won't ever get the text "エンコーディングは難しくない" from it. It is completely lost. It would be possible to recover the original text from it if we knew that a Shift-JIS document was misinterpreted as Mac Roman and then accidentally saved as UTF-8 and reversed this chain of missteps. But that would be a lucky fluke.
Many times certain bit sequences are invalid in a particular encoding. If we tried to open the original document using ASCII, some bytes would be valid in ASCII and map to a real character and others wouldn't. The program you're opening it with may decide to silently discard any bytes that aren't valid in the chosen encoding, or possibly replace them with . There's also the "Unicode replacement character" � (U+FFFD) which a program may decide to insert for any character it couldn't decode correctly when trying to handle Unicode. If a document is saved with some characters gone or replaced, then those characters are really gone for good with no way to reverse-engineer them.
If a document has been misinterpreted and converted to a different encoding, it's broken. Trying to "repair" it may or may not be successful, usually it isn't. Any manual bit-shifting or other encoding voodoo is mostly that, voodoo. It's trying to fix the symptoms after the patient has already died.
So how to handle encodings correctly?
It's really simple: Know what encoding a certain piece of text, that is, a certain byte sequence, is in, then interpret it with that encoding. That's all you need to do. If you're writing an app that allows the user to input some text, specify what encoding you accept from the user. For any sort of text field, the programmer can usually decide its encoding. For any sort of file a user may upload or import into a program, there needs to be a specification what encoding that file should be in. Alternatively, the user needs some way to tell the program what encoding the file is in. This information may be part of the file format itself, or it may be a selection the user has make (not that most users would usually know, unless they have read this article).
If you need to convert from one encoding to another, do so cleanly using tools that are specialized for that. Converting between encodings is the tedious task of comparing two code pages and deciding that character 152 in encoding A is the same as character 4122 in encoding B, then changing the bits accordingly. This particular wheel does not need reinventing and any mainstream programming language includes some way of converting text from one encoding to another without needing to think about code points, pages or bits at all.
Say, your app must accept files uploaded in GB18030, but internally you are handling all data in UTF-32. A tool like can cleanly convert the uploaded file with a one-liner like . That is, it will preserve the characters while changing the underlying bits:
|character||GB18030 encoding||UTF-32 encoding|
That's all there is to it. The content of the string, that is, the human readable characters, didn't change, but it's now a valid UTF-32 string. If you keep treating it as UTF-32, there's no problem with garbled characters. As discussed at the very beginning though, not all encoding schemes can represent all characters. It's not possible to encode the character "縧" in any encoding scheme designed for European languages. Something Bad™ would happen if you tried to.
Unicode all the way
Precisely because of that, there's virtually no excuse in this day and age not to be using Unicode all the way. Some specialized encodings may be more efficient than the Unicode encodings for certain languages. But unless you're storing terabytes and terabytes of very specialized text (and that's a lot of text), there's usually no reason to worry about it. Problems stemming from incompatible encoding schemes are much worse than a wasted gigabyte or two these days. And this will become even truer as storage and bandwidth keeps growing larger and cheaper.
If your system needs to work with other encodings, convert them to Unicode upon input and convert them back to other encodings on output as necessary. Otherwise, be very aware of what encodings you're dealing with at which point and convert as necessary, if that's possible without losing any information.
I have this website talking to a database. My app handles everything as UTF-8 and stores it as such in the database and everything works fine, but when I look at my database admin interface my text is garbled. -- Anonymous code monkey
There are situations where encodings are handled incorrectly but things still work. An often-encountered situation is a database that's set to and an app that works with UTF-8 (or any other encoding). Pretty much any combination of s and s is valid in the single-byte encoding scheme. If the database receives text from an application that looks like , it'll happily store it, thinking the app meant to store the three latin characters "ç¸§". After all, why not? It then later returns this bit sequence back to the app, which will happily accept it as the UTF-8 sequence for "縧", which it originally stored. The database admin interface automatically figures out that the database is set to though and interprets any text as , so all values look garbled only in the admin interface.
That's a case of fool's luck where things happen to work when they actually aren't. Any sort of operation on the text in the database may or may not work as intended, since the database is not interpreting the text correctly. In a worst case scenario, the database inadvertently destroys all text during some random operation two years after the system went into production because it was operating on text assuming the wrong encoding.6
UTF-8 and ASCII
The ingenious thing about UTF-8 is that it's binary compatible with ASCII, which is the de-facto baseline for all encodings. All characters available in the ASCII encoding only take up a single byte in UTF-8 and they're the exact same bytes as are used in ASCII. In other words, ASCII maps 1:1 unto UTF-8. Any character not in ASCII takes up two or more bytes in UTF-8. For most programming languages that expect to parse ASCII, this means you can include UTF-8 text directly in your programs:
Saving this as UTF-8 results in this bit sequence:
Only bytes 12 through 17 (the ones starting with ) are UTF-8 characters (two characters with three bytes each). All the surrounding characters are perfectly good ASCII. A parser would read this as follows:
To the parser, anything following a quotation mark is just a byte sequence which it will take as-is until it encounters another quotation mark. If you simply output this byte sequence, you're outputting UTF-8 text. No need to do anything else. The parser does not need to specifically support UTF-8, it just needs to take strings literally. Naive parsers can support Unicode this way without actually supporting Unicode. Many modern languages are explicitly Unicode-aware though.
Encodings and PHPThis last section deals with issues surrounding Unicode and PHP. Some portions of it are applicable to programming languages in general while others are PHP specific. Nothing new will be revealed about encodings, but concepts described above will be rehashed in the light of practical application.
PHP doesn't natively support Unicode. Except it actually supports it quite well. The previous section shows how UTF-8 characters can be embedded in any program directly without problems, since UTF-8 is backwards compatible with ASCII, which is all PHP needs. The statement "PHP doesn't natively support Unicode" is true though and it seems to cause a lot of confusion in the PHP community.
One specific pet-peeve of mine are the functions and . I often see nonsense along the lines of "To use Unicode in PHP you need to your text on input and on output". These two functions seem to promise some sort of automagic conversion of text to UTF-8 which is "necessary" since "PHP doesn't support Unicode". If you've been following this article at all though, you should know by now that
- there's nothing special about UTF-8 and
- you cannot encode text to UTF-8 after the fact
To clarify that second point: All text is already encoded in some encoding. When you type it into the source code, it has some encoding. Specifically, whatever you saved it as in your text editor. If you get it from a database, it's already in some encoding. If you read it from a file, it's already in some encoding.
Text is either encoded in UTF-8 or it's not. If it's not, it's encoded in ASCII, ISO-8859-1, UTF-16 or some other encoding. If it's not encoded in UTF-8 but is supposed to contain "UTF-8 characters",7 then you have a case of cognitive dissonance. If it does contain actual characters encoded in UTF-8, then it's actually UTF-8 encoded. Text can't contain Unicode characters without being encoded in one of the Unicode encodings.
So what in the world does do then?
"Encodes an ISO-8859-1 string to UTF-8"8
Aha! So what the author actually wanted to say is that it converts the encoding of text from ISO-8859-1 to UTF-8. That's all there is to it. must have been named by some European without any foresight and is a horrible, horrible misnomer. The same goes for . These functions are useless for any purpose other than converting between ISO-8859-1 and UTF-8. If you need to convert a string from any other encoding to any other encoding, look no further than .
is not a magic wand that needs to be swung over any and all text because "PHP doesn't support Unicode". Rather, it seems to cause more encoding problems than it solves thanks to terrible naming and unknowing developers.
So what does it mean for a language to natively support or not support Unicode? It basically refers to whether the language assumes that one character equals one byte or not. For example, PHP allows direct access to the characters of a string using array notation:
If that was in a single-byte encoding, this would give us the first character. But only because "character" coincides with "byte" in a single-byte encoding. PHP simply gives us the first byte without thinking about "characters". Strings are byte sequences to PHP, nothing more, nothing less. All this "readable character" stuff is a human thing and PHP doesn't care about it.
The same goes for many standard functions such as , , and so on. The non-support arises if there's a discrepancy between the length of a byte and a character.
Using on the above string will, again, give us the first byte, which is . In other words, a third of the three-byte character "漢". is, by itself, an invalid UTF-8 sequence, so the string is now broken. If you felt like it, you could try to interpret that in some other encoding where represents a valid character, which will result in some random character. Have fun, but don't use it in production.
And that's actually all there is to it. "PHP doesn't natively support Unicode" simply means that most PHP functions assume one byte = one character, which may lead to it chopping multi-byte characters in half or calculating the length of strings incorrectly if you're naively using non-multi-byte-aware functions on multi-byte strings. It does not mean that you can't use Unicode in PHP or that every Unicode string needs to be blessed by or other such nonsense.
Luckily, there's the Multibyte String extension, which replicates all important string functions in a multi-byte aware fashion. Using on the above string correctly returns , which is the whole "漢" character. Because the functions now have to actually think about what they're doing, they need to know what encoding they're working on. Therefore every function accepts an parameter as well. Alternatively, this can be set globally for all functions using .
Using and abusing PHP's handling of encodings
The whole issue of PHP's (non-)support for Unicode is that it just doesn't care. Strings are byte sequences to PHP. What bytes in particular doesn't matter. PHP doesn't do anything with strings except keeping them stored in memory. PHP simply doesn't have any concept of either characters or encodings. And unless it tries to manipulate strings, it doesn't need to either; it just holds onto bytes that may or may not eventually be interpreted as characters by somebody else. The only requirement PHP has of encodings is that PHP source code needs to be saved in an ASCII compatible encoding. The PHP parser is looking for certain characters that tell it what to do. () signals the start of a variable, () an assignment, () the start and end of a string and so on. Anything else that doesn't have any special significance to the parser is just taken as a literal byte sequence. That includes anything between quotes, as discussed above. This means the following:
You can't save PHP source code in an ASCII-incompatible encoding. For example, in UTF-16 a is encoded as . To PHP, which tries to read everything as ASCII, that's a byte followed by a . PHP will probably get a hiccup if every other character it finds is a byte.
You can save PHP source code in any ASCII-compatible encoding. If the first 128 code points of an encoding are identical to ASCII, PHP can parse it. All characters that are in any way significant to PHP are within the 128 code points defined by ASCII. If string literals contain any code points beyond that, PHP doesn't care. You can save PHP source code in ISO-8859-1, Mac Roman, UTF-8 or any other ASCII-compatible encoding. The string literals in your script will have whatever encoding you saved your source code as.
Any external file you process with PHP can be in whatever encoding you like. If PHP doesn't need to parse it, there are no requirements to meet to keep the PHP parser happy.
The above will simply read the bits in into the variable . PHP doesn't try to interpret, convert, encode or otherwise fiddle with the contents. The file can even contain binary data such as an image, PHP doesn't care.
If internal and external encodings have to match, they have to match. A common case is localization, where the source code contains something like and an external localization file contains something along the lines of this:
Both "Foobar" strings need to have an identical bit representation if you want to find the correct localization. If the source code was saved in ASCII but the localization file in UTF-16, the strings wouldn't match. Either some sort of encoding conversion would be necessary or the use of an encoding-aware string matching function.
The astute reader might ask at this point whether it's possible to save a, say, UTF-16 byte sequence inside a string literal of an ASCII encoded source code file, to which the answer would be: absolutely.
If you can bring your text editor to save the and parts in ASCII and only in UTF-16, this will work just fine. The necessary binary representation for that looks like this:
The first line and the last two bytes are ASCII. The rest is UTF-16 with two bytes per character. The leading on line 2 is a marker required at the start of UTF-16 encoded text (required by the UTF-16 standard, PHP doesn't give a damn). This PHP script will happily output the string "UTF-16" encoded in UTF-16, because it simple outputs the bytes between the two double quotes, which happens to represent the text "UTF-16" encoded in UTF-16. The source code file is neither completely valid ASCII nor UTF-16 though, so working with it in a text editor won't be much fun.
PHP supports Unicode, or in fact any encoding, just fine, as long as certain requirements are met to keep the parser happy and the programmer knows what he's doing. You really only need to be careful when manipulating strings, which includes slicing, trimming, counting and other operations that need to happen on a character level rather than a byte level. If you're not "doing anything" with your strings besides reading and outputting them, you will hardly have any problems with PHP's support of encodings that you wouldn't have in any other language as well.
Other languages are simply encoding-aware. Internally they store strings in a particular encoding, often UTF-16. In turn they need to be told or try to detect the encoding of everything that has to do with text. They need to know what encoding the source code is saved in, what encoding a file they're supposed to read is in, what encoding you want to output text in; and they convert encodings on the fly as needed with some manifestation of Unicode as the middleman. They're doing the same thing you can/should/need to do in PHP semi-automatically behind the scenes. That's neither better nor worse than PHP, just different. The nice thing about it is that standard language functions that deal with strings Just Work™, while in PHP one needs to spare some attention to whether a string may contain multi-byte characters or not and choose string manipulation functions accordingly.
The depths of Unicode
Since Unicode deals with many different scripts and many different problems, it has a lot of depth to it. For example, the Unicode standard contains information for such problems as CJK ideograph unification. That means, information that two or more Chinese/Japanese/Korean characters actually represent the same character in slightly different writing methods. Or rules about converting from lower case to upper case, vice-versa and round-trip, which is not always as straight forward in all scripts as it is in most Western European Latin-derived scripts. Some characters can also be represented using different code points. The letter "ö" for example can be represented using the code point U+00F6 ("LATIN SMALL LETTER O WITH DIAERESIS") or as the two code points U+006F ("LATIN SMALL LETTER O") and U+0308 ("COMBINING DIAERESIS"), that is the letter "o" combined with "¨". In UTF-8 that's either the double-byte sequence or the three-byte sequence , both representing the same human readable character. As such, there are rules governing Normalization within the Unicode standard, i.e. how either of these forms can be converted into the other. This and a lot more is outside the scope of this article, but one should be aware of it.
- Text is always a sequence of bits which needs to be translated into human readable text using lookup tables. If the wrong lookup table is used, the wrong character is used.
- You're never actually directly dealing with "characters" or "text", you're always dealing with bits as seen through several layers of abstractions. Incorrect results are a sign of one of the abstraction layers failing.
- If two systems are talking to each other, they always need to specify what encoding they want to talk to each other in. The simplest example of this is this website telling your browser that it's encoded in UTF-8.
- In this day and age, the standard encoding is UTF-8 since it can encode virtually any character of interest, is backwards compatible with the de-facto baseline ASCII and is relatively space efficient for the majority of use cases nonetheless.
- Other encodings still occasionally have their uses, but you should have a concrete reason for wanting to deal with the headaches associated with character sets that can only encode a subset of Unicode.
- The days of one byte = one character are over and both programmers and programs need to catch up on this.
Now you should really have no excuse anymore the next time you garble some text.
About the author
David C. Zentgraf is a web developer working partly in Japan and Europe and is a regular on Stack Overflow. If you have feedback, criticism or additions, please feel free to try @deceze on Twitter, take an educated guess at his email address or look it up using time-honored methods. This article was published on kunststube.net. And no, there is no dirty word in "Kunststube".