Understanding Unicode (Part 2): Handle encoding error in python

Understanding Unicode (Part 2): Handle encoding error in python

In the previous article, we discussed essential knowledge about Unicode encoding and decoding. This article continues by exploring how Python handles encoding errors.

How bytes display in python console


>> s1='cà phê'
>> b1=s1.encode('utf8')
>> print(b1)
b'c\xc3\xa0 ph\xc3\xaa'  # display c instead of \x63 
>> b2=s1.encode('utf16')
b'\xff\xfec\x00\xe0\x00 \x00p\x00h\x00\xea\x00'
>> b3=s1.encode('cp1252')  # the default encoding in window system
b'c\xe0 ph\xea'        

In the output above, each character in cà phê is converted to corresponding bytes using utf8, utf16, and cp1252. ASCII characters (e.g., a, p, h) display directly rather than as byte values for readability.

Handle Encoding errors

Most non-UTF encoding schemes handle only a subset of Unicode characters. Only UTF encodings like UTF-8 and UTF-16, UTF-32 can handle all Unicode characters. Applying a non-UTF scheme to unsupported characters results in errors, which Python can handle in several ways:

  • strict: By default, it raises a UnicodeEncodeError if the string contains unsupported characters.


# the default encoding in most window system is cp1252
>>> s1='cà phê vi?t nam ngon tuy?t cú mèo'
>>> b1=s1.encode('cp1252') 
UnicodeEncodeError: 'charmap' codec can't encode character '\u1ec7' in position 9: character maps to <undefined
>>> print('\u1ec7')  # cp1252 can not handle the codepoint '\u1ec7' 
?
        

  • ignore: Skips unsupported characters, causing data loss without warning.

>>> b2=s1.encode('cp1252', errors='ignore')
>>> b2
b'c\xe0 ph\xea vit nam ngon tuyt c\xfa meo'
# vi?t -> vit, tuy?t -> tuyt  because ? is skipped and not convert to byte

>>> b2.decode('cp1252') ## read data from bytes but data has been lost 
'cà phê vit nam ngon tuyt cú mèo' 
# vi?t -> vit, tuy?t -> tuyt  because ? is not exists in byte sequence        

  • replace: Replaces unsupported characters with ?, also leading to data loss.

>>> b3=s1.encode('cp1252', errors='replace')
>>> b3
b'c\xe0 ph\xea vi?t nam ngon tuy?t c\xfa m\xe8o'
# vi?t -> vi?t, tuy?t -> tuy?t  because ? is replaced by ?

>> b3.decode('cp1252')
'cà phê vi?t nam ngon tuy?t cú mèo' 
# the original letter ê is lost and replaced by ?        

This mechanism is also implemented in many software, for example, in notepad++, the default encoding is utf8, so we can type every Unicode character in notepad++, and it displays correctly, but if we convert that text to ANSI encoding (`windows-1258` encoding), the unsupported character will be replaced by ?

how notepad++ handle encode error with unsupported characters

  • xmlcharrefreplace:

This option is highly recommended when UTF encodings cannot be used. To understand its effectiveness, we first need to delve into what XML character references are:

XML character references allow for the inclusion of any Unicode character in an XML document using a specific notation. There are two types of XML character references:

1. Numeric Character References:

- Decimal format: &#codepoint; where codepoint is the decimal value of the Unicode character.

- Hexadecimal format: &#xcodepoint; where codepoint is the hexadecimal value of the Unicode character.

2. Named Character References (Entities):

- These include predefined named entities like &amp; for &, &lt; for <, &gt; for >, &quot; for ", and &apos; for '. These are standard definitions in XML.

Examples and Relations:

- The character '?' has a Unicode code point of U+1EC7. Its XML numeric character references would be:

- Decimal: &#7879;

- Hexadecimal: &#x1EC7;

Relation Between Unicode and XML Character References:

- Unicode: Assigns a unique code point to each character.

- XML Standard (W3C): Specifies how to represent these Unicode code points in XML documents using character references.

When an unsupported codepoint is encountered, the encode method utilizes the XML character reference associated with that codepoint. For instance, with the character ?, which cannot be handled by Windows-1252 encoding, the encode method replaces it with its XML character reference &#7879;. This transformation ensures that all characters in the string &#7879; are ASCII, which can be encoded by any encoding scheme.

the unencodeable letter is replace by XML numeric character reference

let's see how this happen in code

>>> b4=s1.encode('cp1252', errors='xmlcharrefreplace')
>>> b4
b'c\xe0 ph\xea vi&#7879;t nam ngon tuy&#7879;t c\xfa m\xe8o' 
# vi?t -> vi&#7879t; tuy?t -> tuy&#7879;t  because ? is replace by &#7879;

>> b3.decode('cp1252')
'cà phê vi&#7879;t nam ngon tuy&#7879;t cú mèo' 

# the original letter ê is replaced by &#7879;         

This mechanism allows developers to retrieve the original characters by analyzing the XML character references in the output text and mapping them back to the actual Unicode characters.

Conclusion

In this article, we explored the various strategies Python offers for handling encoding errors when working with different character sets. From the straightforward approaches of ignoring or replacing unsupported characters to the more sophisticated method of using XML character references, Python's flexibility allows developers to choose the best strategy based on their specific needs. We will dive into the handling of decoding errors in the next article. Thank you for reading.

Phan ??c Hoàng Long

?Software Engineer | .NET Developer | Database

9 个月

Bài vi?t hay l?m anh ?i ??

D??ng Xuan ?à

??Java Software Engineer | Oracle Certified Professional

9 个月

Thanks for sharing

Tan Tran Minh

?Presales Assistant @ SoftwareOne | Modern Work & Security, Data Engineering

9 个月

Thanks for sharing!

Trung Hi?u ?inh

Data Engineering | Database Optimization | Data/Risk Analyst in Finance | Expert in MicrosoftExcel

9 个月

Very helpful!

要查看或添加评论,请登录

Vu Truong Huu的更多文章

社区洞察

其他会员也浏览了