Encodings in Python
Friday, May 22nd, 2009
There's a whole slew of information regarding all kinds of encoding issues out there on the big bad Internet. Some deal with how unicode works, some with what UTF-8 is and how it relates to other encodings and some with how to transform from one encoding to another. All that theory is nice, but I've found a rather worrying lack of practical, understandable and contextual information on dealing with encodings in Python, leading me to think I'd never be able to properly deal with encodings in Python.
So I took the plunge, and tried to find some stuff out. Here's what I came up with. All of this might be terribly wrong though. Encodings are a complicated subject if you ask me, so feel free to correct me if I'm wrong.
NOTICE: This article uses special HTML entities in various places to show output. Depending on your browser, the encodings it supports and the font you are using and its capabilities of showing UTF-8 characters, you may or may not be able to properly see these characters. In these cases a description of the character is given between parenthesis right after the character.
When we're talking about text, we're really talking about two things:
- Byte representation of text.
- Encoded representation of text.
Byte representation of text doesn't give a crap about what language or encoding something is. A byte is a byte: 8 bits, 256 different values.
The text's encoding doesn't exist. That is, a text is encoded in what we say it is encoded in. If I take a piece of UTF-8 text, and say "this is encoded as latin-1", that's just fine. There is nothing inherently UTF-8y about the text. The encoding merely lets us know what little symbol we should show when we encounter a certain byte (or range of bytes, for that matter). But pretending that a piece of UTF-8 text is encoded in latin-1 can of course cause problems when the text contains bytes which latin-1 doesn't have. This is what causes encoding/decoding errors in Python.
A piece of text consisting of the byte with decimal value 163 does not exist in ASCII (as it is heigher than 128), represents '£' (Pound sterling) in latin-1, 'Ģ' (Capital G Cedilla) in Cyrillic (ISO-8859-5) and also doesn't exist in UTF-8. As you can see, it all depends on how you interpret it, and encodings are names for ways to interpret.
There are an almost infinite number of encodings out there. If you wanted to, you could make your own. Some of the ones I'll be using as an example in this little adventure are:
A 7-bit encoding (the 8th bit is unused) that doesn't know jack about swishy characters with accents and stuff.
An 8-bit encoding that does know about accents and stuff. Backwards compatible with ASCII in that its first 7 bits (byte-values lower than 128) map to the same characters as ASCII does. Higher characters map to all kinds of stuff like characters with accents on them, pound signs, etc.
A variable-length encoding. It can consist of 1 byte or more, up to 4 bytes. Also backwards compatible with ASCII in the same way that latin-1 is.
When to deal with encodings
In any given software system, we have to deal with a lot of different potential encodings. Here are some general ones we have to worry about:
- The default system encoding of the Operating System.
- The encoding of any input into the program.
- The default encoding of our programming language.
- The encodings our software libraries can work with.
- The supported encoding of the tools we use to work with data in our program (debuggers, etc).
- The encoding of the destination when we output data to it.
As we can see, there are a lot of things that can go wrong. All this sounds rather despairing doesn't it? The truth of the matter is that most of the time, encodings don't matter at all. In reality, we only need to take encoding into account when:
- We want to operate on the actual meanings of the text, instead of just transporting bytes around.
- We output text to a system which needs to operate on the actual meanings of the text (or which cares about the encoding its input comes in).
Suppose we're reading a file and counting the number of bytes in that file. Does the encoding matter? No, we care about bytes, not characters. If, however, we want to count the frequency of characters, we'll need to deal with encodings, since an 'e' without an accent is not the same as an 'é' ('e' with an accent). When we're outputting text, it depends on the system we're outputting to whether we need to deal with encodings. If we're inserting data into a database, and that database expects UTF-8, we'll need to make sure we also output UTF-8. If we're printing to the console, and the system default encoding is ASCII, we'll need to make sure we're outputting ASCII.
As a software developer, you'll mostly have to deal with the encodings of input and output of your program. This presents us with two major problems:
- We need to know the encoding of our input.
- We need to know the encoding we need to output.
Sometimes, however, we simply don't know the encodings. When we read input from a text file that contains bytes with values higher than 127, we might simply not know what encoding it is. It could be in the system's default encoding, but it does not have to be. These can be tricky problems. For instance, my system default encoding is ASCII at the moment. Yet running the command
apt-cache dumpavail produces output with bytes with a value higher than 128. Which encoding is it in? It's in ASCII encoding of course! Except that ASCII doesn't support bytes with larger then 128 values, so it is actually output with invalid characters.
Usually these problems are solved by making industry standards. For instance, some things always have to be in a certain encoding. Other things may let us know up-front what encoding something is in. An HTML document, for instance, should mention its encoding on the first line of the file (which should always be in just ASCII characters so everything can read it).
How to deal with encodings
WARNING: You may encounter what is sometimes called a 'heisenbug'. A heisenbug is a bug which normally does not appear when your program is running, but only appears when you try to look at the data. In the case of encodings, suppose you try to print a string, purely for debugging reasons, which has the UTF-8 encoding. The console you're printing on may not be able to deal with UTF-8, and as such Python tries to encode the UTF-8 string to ASCII and fails. This bug would not occur had you not tried to print the data to the console.
Onto the meat of the article! How do we deal with encodings?
First, some basics. Normal strings in Python don't care about encoding:
>>> s = 'Andr\xE9' # \xE9 == hex for 233 == latin-1 for e-acute. >>> type(s) <type 'str'> >>> print s Andr�
Note that I'm outputting this to an ASCII terminal, yet python does not complain about the unsupported character in 's'. It simply show some garbled text 'Andr�' ('i' with trema, reversed questionmark and a 1/2 character).
Python also has support for unicode strings. These unicode strings can contain just about any character in the entire world.
>>> s = u'Andr\xE9' >>> type(s) <type 'unicode'>
In this case, we define a unicode string 'André' ('e' acute) by specifying the 'é' as hexidecimal value
E9. Since unicode strings can contain just about everything, they are ideal as intermediate storage for strings. It is therefor always a good idea to decode any input you receive from its encoding to unicode, and to encode it to the proper encoding when you output it again.
Now let's see what happens when we print the unicode string we just created:
>>> print s Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 4: ordinal not in range(128)
As you can see in the previous examples, when we use normal python strings, we don't get an error. But when we use the Unicode string, we get an encoding error. Like I said: normal strings in Python don't care about encoding, but unicode strings do. So when we try to print it, Python will notice our terminal is in ASCII, and tries to convert the string from Unicode to ASCII. This, naturally, fails as ASCII doesn't know about character \xE9. We'll learn how to deal with that in a moment. First, let's look at how to handle input.
So what do we do when reading in an ASCII string when it contains invalid characters? We've got three options:
- Ignore the entire encoding (if we can).
- Ignore any invalid characters.
- Replace any invalid characters with a placeholder character.
Ignoring invalid characters simply removes them from the text as the text is decoded from the source encoding to the target encoding. Replacement will replace the unknown character with a placeholder character. When encoding to unicode, it will be replaced with the character with hexidecimal value XFFFD. This is a two-byte character which will be rendered (if your font supports it) as a square or diamond with a questionmark in it.
So, let's decode our input (containing invalid characters) from ascii to unicode:
>>> s = 'Andr\xE9' # ASCII with invalid char \xE9. >>> s.decode('ascii', 'ignore') u'Andr'
The 'é' is dropped from the output string as it does not exist in ASCII. We can also replace characters:
>>> s.decode('ascii', 'replace') u'Andr\ufffd'
This works great. The 'é' is replaced with
\uFFFFD: the UTF-8 symbol � (Black diamond with a questionmark) representing an unsupported character. The
decode() method on normal strings decodes from the encoding you specify to a unicode string. But look at what happens when we do the same thing, but print the variable?:
>>> print s.decode('ascii', 'replace') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 4: ordinal not in range(128)
How is this possible? Did we do something wrong? No, everything is alright. This is the heisenbug. What happened here is that the 'replace' option will replace any unknown characters in the ASCII string to the \xFFFD unicode character. When we try to print the resulting unicode string to the terminal, Python will try to convert the string to our default system encoding (ASCII) and print it. This will fail, because \xFFFD isn't a character in ascii either. If we want to print it (for debugging purposes or something) we have to encode it back to ASCII before we can:
>>> print s.decode('ascii', 'replace').encode('ascii', 'replace') Andr?
The Python manual mentions: "When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default [ASCII] encoding". Since Python converts without using any of the replacement options, UnicodeEncodeErrors can occur. We have to encode it ourselves if our data contains invalid characters. The
encode() method of a unicode string encodes the string from unicode to the encoding you specify. If we were to output the string to a HTML file that is in the UTF-8 encoding, we would do this instead:
s = 'Andr\xE9' line = s.decode('ascii', 'replace') f_out = file('foo.html', 'w') f_out.write('''<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <body> %s </body> </html>''' % (line.encode('utf-8'))) f_out.close()
This will output a HTML document with the string 'Andr�' (where the last character is a diamond with a questionmark in it, depending on how your browser displays the UTF-8 replacement character). We first decode the line from ASCII to Unicode and when we output it, we encode it from Unicode to UTF-8. Rmemeber: UTF-8 isn't the same as Unicode! There's also UTF-16, UTF-32, etc.
If we know that the string 's' is actually in the latin-1 encoding, we can replace the line:
line = s.decode('ascii', 'replace')
line = s.decode('latin-1', 'replace')
and the output in UTF-8 will become 'André'.
Another problem that can occur is when you try to run the decode() method on strings which are already unicode:
>>> s = 'Andr\xE9' >>> u = s.decode('ascii', 'replace') >>> u u'Andr\ufffd' >>> u.decode('ascii', 'replace') Traceback (most recent call last): File "
", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 4: ordinal not in range(128) >>> u.decode('utf-8', 'replace') Traceback (most recent call last): File " ", line 1, in File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 4: ordinal not in range(128)
As you can see, calling the decode() method on a Unicode string always seems to fail. I have no idea why the Unicode strings in Python have this method, but it's probably either because they inherit from the String type, or this is the 'internal' method Python uses when it needs to display or convert a Unicode string. If I remember correctly, the decode() will be removed from Unicode strings in the future.
It turns out that dealing with encodings in Python is tricky since we have to deal with a large number of potential problem areas. Once you're familier with the ins and outs though, dealing with encodings becomes rather easy. In general:
- If input is just 'passing through' your program, just treat it as binary instead of text. Don't decode or encode at all. But be careful about the target output (including the terminal when debugging with print or something) as it may require a specific encoding.
- First decode input from the encoding it is in to Unicode.
- If you know the input encoding:
s = input.decode(input_encoding).
- If you do not know the input encoding, take the safe route and decode from ASCII with the 'replace' option:
s = input.decode('ascii', 'replace')
- If the input might contain invalid characters for the encoding it is in, use the
replaceoption. (always a good idea)
- If you know the output encoding:
- If you do not know the output encoding, and there is an agreement about the default encoding, use that. Otherwise, take the safe route and use ASCII.
- If the target encoding might not support all the characters that are in the internal representation, use the 'replace' option.
I hope I got all this right, and that it makes dealing with encodings in Python a little clearer.