Understanding Unicode (Part 3): Handle Decoding Error
In a previous article, we discussed how Python handles encoding errors. In this article, we'll explore the errors that can occur during the decoding process, which is when we convert a sequence of bytes back into a string.
Handling Decoding Errors
Errors in decoding can occur when the byte sequence contains an invalid byte that doesn’t correspond to any character in the chosen encoding scheme. For example:
# Encoding a string in UTF-8
>> b1 = 'cà phê'.encode('utf-8')
# Outputs: b'c\xc3\xa0 ph\xc3\xaa'
# Creating an invalid byte sequence by removing a byte \xa0
>> b1_invalid = b'c\xc3 ph\xc3\xaa'
>> b1_invalid.decode('utf8'))
# Causes an error:
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1: invalid continuation byte
When a valid UTF-8 byte sequence like \xc3\xa0 is replaced by \xc3 (which is not valid on its own in UTF-8), a UnicodeDecodeError is raised. It's similar to removing a letter from a word in English: "Go" is a valid word, but if you remove the "o", the remaining "G" isn't a meaningful word on its own.
However, some 8-bit encoding schemes like cp1252 or iso8859_1 will decode any byte sequence they encounter, without indicating any errors, leading to incorrect text display:
领英推荐
# Attempting to decode the same invalid byte sequence with 'cp1252'
>> b1_invalid.decode('cp1252')
# Outputs: 'c? ph?a'
>> b'\xc3'.decode('cp1252')
# Outputs: '?'
>> b'\xaa'.decode('cp1252')
# Outputs: 'a'
Because cp1252 decodes each byte it sees without considering the whole bytes sequence, the two bytes \xc3\xaa in UTF-8 represent the character ê. But when decoded with cp1252, it interprets \xc3 as ? and \xaa as a, resulting in garbled text.
For example, early in my career, I used an older version of SQL Management Studio to display text from a database. The default encoding scheme of this tool was cp1252. So when it displayed text that was actually encoded with UTF-16 in the database, the result looked strange. Here's an example of how it appeared:
Conclusion
We’ve discussed what happens when you use the wrong decoding scheme, which doesn't match the original encoding of the text. We also saw how popular encodings like utf8 and cp1252 handle byte sequences. In the next articles, we'll dive deeper into the complexities of handling accented strings, such as comparing and sorting them. Thanks for reading, and I hope you join me in exploring more in the following articles.
?Presales Assistant @ SoftwareOne | Modern Work & Security, Data Engineering
9 个月Thanks for sharing!
?Database Administrator at Wecommit Vi?t Nam
9 个月very helpful
??Software Engineer, AI and Algorithms Enthusiast.
9 个月tr??c em c?ng làm 1 cái ch?c n?ng convert utf8 <-> unicode (ti?ng nh?t hàn trung...) mà có 1 s? ky t? unicode ko encode sang utf8 ???c, m?y encoder hay tìm 1 ky t? gi?ng th? trong b?ng unicode ?? thay th?, th? nên khi decode nó l?i ko ra cái unicode string ban ??u, nói chung là m?t m?i. ??
Software Engineer
9 个月I'll keep this in mind