Understanding Unicode (Part 2): Handle encoding error in python
In the previous article, we discussed essential knowledge about Unicode encoding and decoding. This article continues by exploring how Python handles encoding errors.
How bytes display in python console
>> s1='cà phê'
>> b1=s1.encode('utf8')
>> print(b1)
b'c\xc3\xa0 ph\xc3\xaa' # display c instead of \x63
>> b2=s1.encode('utf16')
b'\xff\xfec\x00\xe0\x00 \x00p\x00h\x00\xea\x00'
>> b3=s1.encode('cp1252') # the default encoding in window system
b'c\xe0 ph\xea'
In the output above, each character in cà phê is converted to corresponding bytes using utf8, utf16, and cp1252. ASCII characters (e.g., a, p, h) display directly rather than as byte values for readability.
Handle Encoding errors
Most non-UTF encoding schemes handle only a subset of Unicode characters. Only UTF encodings like UTF-8 and UTF-16, UTF-32 can handle all Unicode characters. Applying a non-UTF scheme to unsupported characters results in errors, which Python can handle in several ways:
# the default encoding in most window system is cp1252
>>> s1='cà phê vi?t nam ngon tuy?t cú mèo'
>>> b1=s1.encode('cp1252')
UnicodeEncodeError: 'charmap' codec can't encode character '\u1ec7' in position 9: character maps to <undefined
>>> print('\u1ec7') # cp1252 can not handle the codepoint '\u1ec7'
?
>>> b2=s1.encode('cp1252', errors='ignore')
>>> b2
b'c\xe0 ph\xea vit nam ngon tuyt c\xfa meo'
# vi?t -> vit, tuy?t -> tuyt because ? is skipped and not convert to byte
>>> b2.decode('cp1252') ## read data from bytes but data has been lost
'cà phê vit nam ngon tuyt cú mèo'
# vi?t -> vit, tuy?t -> tuyt because ? is not exists in byte sequence
>>> b3=s1.encode('cp1252', errors='replace')
>>> b3
b'c\xe0 ph\xea vi?t nam ngon tuy?t c\xfa m\xe8o'
# vi?t -> vi?t, tuy?t -> tuy?t because ? is replaced by ?
>> b3.decode('cp1252')
'cà phê vi?t nam ngon tuy?t cú mèo'
# the original letter ê is lost and replaced by ?
This mechanism is also implemented in many software, for example, in notepad++, the default encoding is utf8, so we can type every Unicode character in notepad++, and it displays correctly, but if we convert that text to ANSI encoding (`windows-1258` encoding), the unsupported character will be replaced by ?
This option is highly recommended when UTF encodings cannot be used. To understand its effectiveness, we first need to delve into what XML character references are:
XML character references allow for the inclusion of any Unicode character in an XML document using a specific notation. There are two types of XML character references:
1. Numeric Character References:
- Decimal format: &#codepoint; where codepoint is the decimal value of the Unicode character.
领英推荐
- Hexadecimal format: odepoint; where codepoint is the hexadecimal value of the Unicode character.
2. Named Character References (Entities):
- These include predefined named entities like & for &, < for <, > for >, " for ", and ' for '. These are standard definitions in XML.
Examples and Relations:
- The character '?' has a Unicode code point of U+1EC7. Its XML numeric character references would be:
- Decimal: ệ
- Hexadecimal: ệ
Relation Between Unicode and XML Character References:
- Unicode: Assigns a unique code point to each character.
- XML Standard (W3C): Specifies how to represent these Unicode code points in XML documents using character references.
When an unsupported codepoint is encountered, the encode method utilizes the XML character reference associated with that codepoint. For instance, with the character ?, which cannot be handled by Windows-1252 encoding, the encode method replaces it with its XML character reference ệ. This transformation ensures that all characters in the string ệ are ASCII, which can be encoded by any encoding scheme.
let's see how this happen in code
>>> b4=s1.encode('cp1252', errors='xmlcharrefreplace')
>>> b4
b'c\xe0 ph\xea việt nam ngon tuyệt c\xfa m\xe8o'
# vi?t -> việt; tuy?t -> tuyệt because ? is replace by ệ
>> b3.decode('cp1252')
'cà phê việt nam ngon tuyệt cú mèo'
# the original letter ê is replaced by ệ
This mechanism allows developers to retrieve the original characters by analyzing the XML character references in the output text and mapping them back to the actual Unicode characters.
Conclusion
In this article, we explored the various strategies Python offers for handling encoding errors when working with different character sets. From the straightforward approaches of ignoring or replacing unsupported characters to the more sophisticated method of using XML character references, Python's flexibility allows developers to choose the best strategy based on their specific needs. We will dive into the handling of decoding errors in the next article. Thank you for reading.
?Software Engineer | .NET Developer | Database
9 个月Bài vi?t hay l?m anh ?i ??
??Java Software Engineer | Oracle Certified Professional
9 个月Thanks for sharing
?Presales Assistant @ SoftwareOne | Modern Work & Security, Data Engineering
9 个月Thanks for sharing!
Data Engineering | Database Optimization | Data/Risk Analyst in Finance | Expert in MicrosoftExcel
9 个月Very helpful!