Ferry Boender, April 15 2011
There's a whole slew of information regarding all kinds of encoding issues out there on the big bad Internet. Some deal with how unicode works, some with what UTF-8 is and how it relates to other encodings and some with how to transform from one encoding to another. All that theory is nice, but I've found a rather worrying lack of practical, understandable and contextual information on dealing with encodings in Python, leading me to think I'd never be able to properly deal with encodings in Python.
So I took the plunge, and tried to find some stuff out. Here's what I came up with. All of this might be terribly wrong though. Encodings are a complicated subject if you ask me, so feel free to correct me if I'm wrong.
NOTICE: This article uses special HTML entities in various places to show output. Depending on your browser, the encodings it supports and the font you are using and its capabilities of showing UTF-8 characters (see, complicated!), you may or may not be able to properly see these characters. In these cases a description of the character is given between parenthesis right after the character.
When we're talking about text, we're really talking about two things:
Byte representation of text doesn't give a crap about what language or encoding something is. A byte is a byte: 8 bits, 256 different values.
The text's encoding doesn't exist. That is, a text is encoded in what we say it is encoded in. If I take a piece of UTF-8 text, and say "this is encoded as latin-1", that's just fine. There is nothing inherently UTF-8y about the text. The encoding merely lets us know what little symbol we should show when we encounter a certain byte (or range of bytes, for that matter). But pretending that a piece of UTF-8 text is encoded in latin-1 can of course cause problems when the text contains bytes which latin-1 doesn't have. This is what causes encoding/decoding errors in Python.
A piece of text consisting of the byte with decimal value 163 does not exist in ASCII (as it is heigher than 128), represents '£' (Pound sterling) in latin-1, 'Ģ' (Capital G Cedilla) in Cyrillic (ISO-8859-5) and also doesn't exist in UTF-8. As you can see, it all depends on how you interpret it, and encodings are names for ways to interpret.
There are an almost infinite number of encodings out there. If you wanted to, you could make your own. Some of the ones I'll be using as an example in this little adventure are:
A 7-bit encoding (the 1st bit is unused) that doesn't know jack about swishy characters with accents and stuff.
An 8-bit encoding that knows about accents and stuff. Backwards compatible with ASCII in that its last 7 bits (byte-values lower than 128) map to the same characters as ASCII does. Higher characters map to all kinds of stuff like characters with accents on them, pound signs, etc.
A variable-length encoding. It can consist of 1 byte or more, up to 4 bytes. Also backwards compatible with ASCII in the same way that latin-1 is. Go here if you want to read more about how UTF-8 works..
In any given software system, we have to deal with a lot of different potential encodings. Here are some general ones we have to worry about:
As we can see, there are a lot of things that can go wrong. All this sounds rather despairing doesn't it? The truth of the matter is that most of the time, encodings don't matter at all. In reality, we only need to take encoding into account when:
Suppose we're reading a file and counting the number of bytes in that file. Does the encoding matter? No, we care about bytes, not characters. If, however, we want to count the frequency of characters, we'll need to deal with encodings, since an 'e' without an accent is not the same as an 'é' ('e' with an accent). When we're outputting text, it depends on the system we're outputting to whether we need to deal with encodings. If we're inserting data into a database, and that database expects UTF-8, we'll need to make sure we also output UTF-8. If we're printing to the console, and the system default encoding is ASCII, we'll need to make sure we're outputting ASCII.
As a software developer, you'll mostly have to deal with the encodings of input and output of your program. This presents us with two major problems:
Sometimes, however, we simply don't know the encodings. When we read input from a text file that contains bytes with values higher than 127, we might simply not know what encoding it is. It could be in the system's default encoding, but it does not have to be. These can be tricky problems. For instance, my system default encoding is ASCII at the moment. Yet running the command
apt-cache dumpavail produces output with bytes with a value higher than 128. Which encoding is it in? It's probably UTF-8, but we don't know for sure because apt-cache doesn't tell us. As far as the system is concerned, It's in ASCII encoding! Except that ASCII doesn't support bytes with larger then 128 values, so in reality it is output with invalid characters.
Usually these problems are solved by making industry standards. For instance, some things always have to be in a certain encoding. Other things may let us know up-front what encoding something is in. An HTML document, for instance, should mention its encoding on the first line of the file (which should always be in just ASCII characters so everything can read it).
WARNING: You may encounter what is sometimes called a 'heisenbug'. A heisenbug is a bug which normally does not appear when your program is running, but only appears when you try to look at the data. In the case of encodings, suppose you try to print a string, purely for debugging reasons, which has the UTF-8 encoding. The console you're printing on may not be able to deal with UTF-8, and as such Python tries to encode the UTF-8 string to ASCII and fails. This bug would not occur had you not tried to print the data to the console.
Onto the meat of the article! How do we deal with encodings?
First, some basics. Normal strings in Python don't care about encoding:
>>> s = 'Andr\xE9' # \xE9 == hex for 233 == latin-1 for e-acute. >>> type(s) <type 'str'> >>> print s Andr�
Note that I'm outputting this to an ASCII terminal, yet python does not complain about the unsupported character in 's'. It simply show some garbled text 'Andr�' ('i' with trema, reversed questionmark and a 1/2 character).
Python also has support for unicode strings. These unicode strings can contain just about any character in the entire world.
>>> s = u'Andr\xE9' >>> type(s) <type 'unicode'>
In this case, we define a unicode string 'André' ('e' acute) by specifying the 'é' as hexidecimal value
E9. Since unicode strings can contain just about everything, they are ideal as intermediate storage for strings. It is therefor always a good idea to decode any input you receive from its encoding to unicode, and to encode it to the proper encoding when you output it again.
>>> print s Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 4: ordinal not in range(128)
As you can see in the previous examples, when we use normal python strings, we don't get an error. But when we use the Unicode string, we get an encoding error. Like I said: normal strings in Python don't care about encoding, but unicode strings do. So when we try to print it, Python will notice our terminal is in ASCII, and tries to convert the string from Unicode to ASCII. This, naturally, fails as ASCII doesn't know about character \xE9. We'll learn how to deal with that in a moment. First, let's look at how to handle input.
So what do we do when reading in an ASCII string when it contains invalid characters? We've got three options:
Ignoring invalid characters simply removes them from the text as the text is decoded from the source encoding to the target encoding. Replacement will replace the unknown character with a placeholder character. When encoding to unicode, it will be replaced with the character with hexidecimal value XFFFD. This is a two-byte character which will be rendered (if your font supports it) as a square or diamond with a questionmark in it.
So, let's decode our input (containing invalid characters) from ascii to unicode:
>>> s = 'Andr\xE9' # ASCII with invalid char \xE9. >>> s.decode('ascii', 'ignore') u'Andr'
The 'é' is dropped from the output string as it does not exist in ASCII. We can also replace characters:
>>> s.decode('ascii', 'replace') u'Andr\ufffd'
This works great. The 'é' is replaced with
\uFFFFD: the UTF-8 symbol � (Black diamond with a questionmark) representing an unsupported character. The
decode() method on normal strings decodes from the encoding you specify to a unicode string. But look at what happens when we do the same thing, but print the variable?:
>>> print s.decode('ascii', 'replace') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 4: ordinal not in range(128)
How is this possible? Did we do something wrong? No, everything is alright. This is the heisenbug. What happened here is that the 'replace' option will replace any unknown characters in the ASCII string to the \xFFFD unicode character. When we try to print the resulting unicode string to the terminal, Python will try to convert the string to our default system encoding (ASCII) and print it. This will fail, because \xFFFD isn't a character in ascii either. If we want to print it (for debugging purposes or something) we have to encode it back to ASCII before we can:
>>> print s.decode('ascii', 'replace').encode('ascii', 'replace') Andr?
The Python manual mentions: "When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default [ASCII] encoding". Since Python converts without using any of the replacement options, UnicodeEncodeErrors can occur. We have to encode it ourselves if our data contains invalid characters. The
encode() method of a unicode string encodes the string from unicode to the encoding you specify. If we were to output the string to a HTML file that is in the UTF-8 encoding, we would do this instead:
s = 'Andr\xE9' line = s.decode('ascii', 'replace') f_out = file('foo.html', 'w') f_out.write('''<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <body> %s </body> </html>''' % (line.encode('utf-8'))) f_out.close()
This will output a HTML document with the string 'Andr�' (where the last character is a diamond with a questionmark in it, depending on how your browser displays the UTF-8 replacement character). We first decode the line from ASCII to Unicode and when we output it, we encode it from Unicode to UTF-8. Rmemeber: UTF-8 isn't the same as Unicode! There's also UTF-16, UTF-32, etc.
If we know that the string 's' is actually in the latin-1 encoding, we can replace the line:
line = s.decode('ascii', 'replace')
line = s.decode('latin-1', 'replace')
and the output in UTF-8 will become 'André'.
Another problem that can occur is when you try to run the decode() method on strings which are already unicode:
>>> s = 'Andr\xE9' >>> u = s.decode('ascii', 'replace') >>> u u'Andr\ufffd' >>> u.decode('ascii', 'replace') Traceback (most recent call last): File "
", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 4: ordinal not in range(128) >>> u.decode('utf-8', 'replace') Traceback (most recent call last): File " ", line 1, in File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 4: ordinal not in range(128)
As you can see, calling the decode() method on a Unicode string always seems to fail. I have no idea why the Unicode strings in Python have this method, but it's probably either because they inherit from the String type, or this is the 'internal' method Python uses when it needs to display or convert a Unicode string. If I remember correctly, the decode() will be removed from Unicode strings in the future.
It turns out that dealing with encodings in Python is tricky since we have to deal with a large number of potential problem areas. Once you're familier with the ins and outs though, dealing with encodings becomes rather easy. In general:
s = input.decode(input_encoding).
s = input.decode('ascii', 'replace')
replaceoption. (always a good idea)
I hope I got all this right, and that it makes dealing with encodings in Python a little clearer.
Copyright (c) 2009-2011, Ferry Boender
This document may be freely distributed, in part or as a whole, on any medium, without the prior authorization of the author, provided that this Copyright notice remains intact, and there will be no obstruction as to the further distribution of this document. You may not ask a fee for the contents of this document, though a fee to compensate for the distribution of this document is permitted.
Modifications to this document are permitted, provided that the modified document is distributed under the same license as the original document and no copyright notices are removed from this document. All contents written by an author stays copyrighted by that author.
Failure to comply to one or all of the terms of this license automatically revokes your rights granted by this license.
All brand and product names mentioned in this document are trademarks or registered trademarks of their respective holders.