登录查看更多内容

Understanding Unicode (Part 2): Handle encoding error in python

Vu Truong Huu

?Senior .NET Developer at VNPT?

发布日期: 2024年6月17日

In the previous article, we discussed essential knowledge about Unicode encoding and decoding. This article continues by exploring how Python handles encoding errors.

How bytes display in python console


>> s1='cà phê'
>> b1=s1.encode('utf8')
>> print(b1)
b'c\xc3\xa0 ph\xc3\xaa'  # display c instead of \x63 
>> b2=s1.encode('utf16')
b'\xff\xfec\x00\xe0\x00 \x00p\x00h\x00\xea\x00'
>> b3=s1.encode('cp1252')  # the default encoding in window system
b'c\xe0 ph\xea'

In the output above, each character in cà phê is converted to corresponding bytes using utf8, utf16, and cp1252. ASCII characters (e.g., a, p, h) display directly rather than as byte values for readability.

Handle Encoding errors

Most non-UTF encoding schemes handle only a subset of Unicode characters. Only UTF encodings like UTF-8 and UTF-16, UTF-32 can handle all Unicode characters. Applying a non-UTF scheme to unsupported characters results in errors, which Python can handle in several ways:

strict: By default, it raises a UnicodeEncodeError if the string contains unsupported characters.


# the default encoding in most window system is cp1252
>>> s1='cà phê vi?t nam ngon tuy?t cú mèo'
>>> b1=s1.encode('cp1252') 
UnicodeEncodeError: 'charmap' codec can't encode character '\u1ec7' in position 9: character maps to <undefined
>>> print('\u1ec7')  # cp1252 can not handle the codepoint '\u1ec7' 
?

ignore: Skips unsupported characters, causing data loss without warning.

>>> b2=s1.encode('cp1252', errors='ignore')
>>> b2
b'c\xe0 ph\xea vit nam ngon tuyt c\xfa meo'
# vi?t -> vit, tuy?t -> tuyt  because ? is skipped and not convert to byte

>>> b2.decode('cp1252') ## read data from bytes but data has been lost 
'cà phê vit nam ngon tuyt cú mèo' 
# vi?t -> vit, tuy?t -> tuyt  because ? is not exists in byte sequence

replace: Replaces unsupported characters with ?, also leading to data loss.

>>> b3=s1.encode('cp1252', errors='replace')
>>> b3
b'c\xe0 ph\xea vi?t nam ngon tuy?t c\xfa m\xe8o'
# vi?t -> vi?t, tuy?t -> tuy?t  because ? is replaced by ?

>> b3.decode('cp1252')
'cà phê vi?t nam ngon tuy?t cú mèo' 
# the original letter ê is lost and replaced by ?

This mechanism is also implemented in many software, for example, in notepad++, the default encoding is utf8, so we can type every Unicode character in notepad++, and it displays correctly, but if we convert that text to ANSI encoding (`windows-1258` encoding), the unsupported character will be replaced by ?

how notepad++ handle encode error with unsupported characters

xmlcharrefreplace:

This option is highly recommended when UTF encodings cannot be used. To understand its effectiveness, we first need to delve into what XML character references are:

XML character references allow for the inclusion of any Unicode character in an XML document using a specific notation. There are two types of XML character references:

1. Numeric Character References:

- Decimal format: &#codepoint; where codepoint is the decimal value of the Unicode character.

领英推荐

Decoding Python Functions: Default, Positional, and…

Benjamin Bennett Alexander 1 年前

DSA Questions on Lists using Python

ARNAB MUKHERJEE ???? 10 个月前

IV Implementing a Systemic Dimensional Cyberprofiling…

Edwin P. 6 个月前

- Hexadecimal format: &#xcodepoint; where codepoint is the hexadecimal value of the Unicode character.

2. Named Character References (Entities):

- These include predefined named entities like & for &, < for <, > for >, " for ", and ' for '. These are standard definitions in XML.

Examples and Relations:

- The character '?' has a Unicode code point of U+1EC7. Its XML numeric character references would be:

- Decimal: ệ

- Hexadecimal: ệ

Relation Between Unicode and XML Character References:

- Unicode: Assigns a unique code point to each character.

- XML Standard (W3C): Specifies how to represent these Unicode code points in XML documents using character references.

When an unsupported codepoint is encountered, the encode method utilizes the XML character reference associated with that codepoint. For instance, with the character ?, which cannot be handled by Windows-1252 encoding, the encode method replaces it with its XML character reference ệ. This transformation ensures that all characters in the string ệ are ASCII, which can be encoded by any encoding scheme.

the unencodeable letter is replace by XML numeric character reference

let's see how this happen in code

>>> b4=s1.encode('cp1252', errors='xmlcharrefreplace')
>>> b4
b'c\xe0 ph\xea vi&#7879;t nam ngon tuy&#7879;t c\xfa m\xe8o' 
# vi?t -> vi&#7879t; tuy?t -> tuy&#7879;t  because ? is replace by &#7879;

>> b3.decode('cp1252')
'cà phê vi&#7879;t nam ngon tuy&#7879;t cú mèo' 

# the original letter ê is replaced by &#7879;

This mechanism allows developers to retrieve the original characters by analyzing the XML character references in the output text and mapping them back to the actual Unicode characters.

Conclusion

In this article, we explored the various strategies Python offers for handling encoding errors when working with different character sets. From the straightforward approaches of ignoring or replacing unsupported characters to the more sophisticated method of using XML character references, Python's flexibility allows developers to choose the best strategy based on their specific needs. We will dive into the handling of decoding errors in the next article. Thank you for reading.

Phan ??c Hoàng Long

?Software Engineer | .NET Developer | Database

9 个月

Bài vi?t hay l?m anh ?i ??

1 次回应

D??ng Xuan ?à

??Java Software Engineer | Oracle Certified Professional

9 个月

Thanks for sharing

1 次回应

Tan Tran Minh

?Presales Assistant @ SoftwareOne | Modern Work & Security, Data Engineering

9 个月

Thanks for sharing!

1 次回应

Trung Hi?u ?inh

Data Engineering | Database Optimization | Data/Risk Analyst in Finance | Expert in MicrosoftExcel

9 个月

Very helpful!

1 次回应

查看更多评论

要查看或添加评论，请登录

Vu Truong Huu的更多文章

Understanding Generics in .NET: Part 1

2024年7月24日

Understanding Generics in .NET: Part 1

In this article, i will illustrate the benefits of using generics in C# . Consider a scenario where we want to create a…
Tìm hi?u v? Git Flow: Git rebase

2024年7月23日

Tìm hi?u v? Git Flow: Git rebase

Trong bài tr??c mình ?? gi?i thi?u v? lu?ng git merge, trong bài này mình s? m? t? v? lu?ng git rebase, và phan tích ?u…

3 条评论
Tìm hi?u v? Git Flow: Git merge

2024年7月17日

Tìm hi?u v? Git Flow: Git merge

Trong lo?t bài v? git này, mình s? ?n l?i nh?ng git flow th?ng d?ng nh? Git merge, Git rebase. C? th? trong bài vi?t…
Nang c?p ubuntu 20.04 LTS lên 22.04 LTS, gi? nguyên mysql server 5.7

2024年7月5日

Nang c?p ubuntu 20.04 LTS lên 22.04 LTS, gi? nguyên mysql server 5.7

H?m nay mình th?c hi?n nang c?p ubuntu server t? 16.04->18.

6 条评论
Understanding Unicode: How to Sort Accented Strings Correctly in Unicode

2024年6月25日

Understanding Unicode: How to Sort Accented Strings Correctly in Unicode

Sorting strings in Python might not work as you expect when dealing with accented characters. Let's dive into how you…

7 条评论
Understanding Unicode: Safely Comparing Accented Strings in Python to Prevent Common Errors

2024年6月20日

Understanding Unicode: Safely Comparing Accented Strings in Python to Prevent Common Errors

In this article, we'll explore how to accurately compare two accented strings in Python, avoiding typical mistakes that…

5 条评论
Understanding Unicode (Part 3): Handle Decoding Error

2024年6月18日

Understanding Unicode (Part 3): Handle Decoding Error

In a previous article, we discussed how Python handles encoding errors. In this article, we'll explore the errors that…

11 条评论
Understanding Unicode: Fundamental knowledge to Avoid Encoding/Decoding Errors (Part I)

2024年6月15日

Understanding Unicode: Fundamental knowledge to Avoid Encoding/Decoding Errors (Part I)

From the outset of my career as a developer, I frequently encountered the challenge of converting and normalizing text…

8 条评论
Introducing IHttpClientFactory in .NET Core - Overcoming Challenges with HttpClient - part 1

2024年4月29日

Introducing IHttpClientFactory in .NET Core - Overcoming Challenges with HttpClient - part 1

Introduction In many projects, we have to interact with remote APIs to retrieve data or delegate actions to external…

See all articles

Understanding Unicode (Part 2): Handle encoding error in python

Vu Truong Huu

?Senior .NET Developer at VNPT?

How bytes display in python console

Handle Encoding errors

领英推荐

Conclusion

Vu Truong Huu的更多文章

社区洞察

其他会员也浏览了

Python Interpreter – Environment, Invoking & Working

Understanding Python List Comprehensions and Generator Expressions: Memory Efficiency and Performance

Introduction to Strings in Python

Function Argument And Parameter in Python

Identifiers in Python – Naming Rules & Best Practices

Top K Elements in Python Using Heap Queue

Python Tuple part Questions

Decoding Excel: Converting Column Titles to Numbers with Python

Monitoring Model Drift with Python

Python list reversal vs list sorting

How bytes display in python console

Handle Encoding errors

领英推荐

Conclusion

Vu Truong Huu的更多文章

Understanding Generics in .NET: Part 1

Tìm hi?u v? Git Flow: Git rebase

Tìm hi?u v? Git Flow: Git merge

Nang c?p ubuntu 20.04 LTS lên 22.04 LTS, gi? nguyên mysql server 5.7

Understanding Unicode: How to Sort Accented Strings Correctly in Unicode

Understanding Unicode: Safely Comparing Accented Strings in Python to Prevent Common Errors

Understanding Unicode (Part 3): Handle Decoding Error

Understanding Unicode: Fundamental knowledge to Avoid Encoding/Decoding Errors (Part I)

Introducing IHttpClientFactory in .NET Core - Overcoming Challenges with HttpClient - part 1

社区洞察

其他会员也浏览了

Python Interpreter – Environment, Invoking & Working

Understanding Python List Comprehensions and Generator Expressions: Memory Efficiency and Performance

Introduction to Strings in Python

Function Argument And Parameter in Python

Identifiers in Python – Naming Rules & Best Practices

Top K Elements in Python Using Heap Queue

Python Tuple part Questions

Decoding Excel: Converting Column Titles to Numbers with Python

Monitoring Model Drift with Python

Python list reversal vs list sorting