登录查看更多内容

Understanding Unicode (Part 3): Handle Decoding Error

Vu Truong Huu

?Senior .NET Developer at VNPT?

发布日期: 2024年6月18日

In a previous article, we discussed how Python handles encoding errors. In this article, we'll explore the errors that can occur during the decoding process, which is when we convert a sequence of bytes back into a string.

Handling Decoding Errors

Errors in decoding can occur when the byte sequence contains an invalid byte that doesn’t correspond to any character in the chosen encoding scheme. For example:

# Encoding a string in UTF-8

>> b1 = 'cà phê'.encode('utf-8')

# Outputs: b'c\xc3\xa0 ph\xc3\xaa'

# Creating an invalid byte sequence by removing a byte \xa0 

>> b1_invalid = b'c\xc3 ph\xc3\xaa'

>> b1_invalid.decode('utf8'))

# Causes an error:

# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1: invalid continuation byte

When a valid UTF-8 byte sequence like \xc3\xa0 is replaced by \xc3 (which is not valid on its own in UTF-8), a UnicodeDecodeError is raised. It's similar to removing a letter from a word in English: "Go" is a valid word, but if you remove the "o", the remaining "G" isn't a meaningful word on its own.

However, some 8-bit encoding schemes like cp1252 or iso8859_1 will decode any byte sequence they encounter, without indicating any errors, leading to incorrect text display:

领英推荐

DSA Questions on Lists using Python

ARNAB MUKHERJEE ???? 10 个月前

???????? ?????????????? ?????????????? 5

???????? ?????????????? ?????????????? 5

Shivaam Jaiswal 2 年前

Can one Line of Python Win a Contest at…

Peter Cotton 4 年前

# Attempting to decode the same invalid byte sequence with 'cp1252'

>> b1_invalid.decode('cp1252')  
# Outputs: 'c? ph?a'

>> b'\xc3'.decode('cp1252')     
# Outputs: '?'

>> b'\xaa'.decode('cp1252')     
# Outputs: 'a'

Because cp1252 decodes each byte it sees without considering the whole bytes sequence, the two bytes \xc3\xaa in UTF-8 represent the character ê. But when decoded with cp1252, it interprets \xc3 as ? and \xaa as a, resulting in garbled text.

For example, early in my career, I used an older version of SQL Management Studio to display text from a database. The default encoding scheme of this tool was cp1252. So when it displayed text that was actually encoded with UTF-16 in the database, the result looked strange. Here's an example of how it appeared:

Apply incorrect decoding schema cp1252 for original UTF-16 encoded string

Conclusion

We’ve discussed what happens when you use the wrong decoding scheme, which doesn't match the original encoding of the text. We also saw how popular encodings like utf8 and cp1252 handle byte sequences. In the next articles, we'll dive deeper into the complexities of handling accented strings, such as comparing and sorting them. Thanks for reading, and I hope you join me in exploring more in the following articles.

Tan Tran Minh

?Presales Assistant @ SoftwareOne | Modern Work & Security, Data Engineering

9 个月

Thanks for sharing!

Manh Vu Dinh

?Database Administrator at Wecommit Vi?t Nam

9 个月

very helpful

Tùng Ph?m

??Software Engineer, AI and Algorithms Enthusiast.

9 个月

tr??c em c?ng làm 1 cái ch?c n?ng convert utf8 <-> unicode (ti?ng nh?t hàn trung...) mà có 1 s? ky t? unicode ko encode sang utf8 ???c, m?y encoder hay tìm 1 ky t? gi?ng th? trong b?ng unicode ?? thay th?, th? nên khi decode nó l?i ko ra cái unicode string ban ??u, nói chung là m?t m?i. ??

1 次回应

Bùi Minh Hoàng

Software Engineer

9 个月

I'll keep this in mind

1 次回应

查看更多评论

要查看或添加评论，请登录

Vu Truong Huu的更多文章

Understanding Generics in .NET: Part 1

2024年7月24日

Understanding Generics in .NET: Part 1

In this article, i will illustrate the benefits of using generics in C# . Consider a scenario where we want to create a…
Tìm hi?u v? Git Flow: Git rebase

2024年7月23日

Tìm hi?u v? Git Flow: Git rebase

Trong bài tr??c mình ?? gi?i thi?u v? lu?ng git merge, trong bài này mình s? m? t? v? lu?ng git rebase, và phan tích ?u…

3 条评论
Tìm hi?u v? Git Flow: Git merge

2024年7月17日

Tìm hi?u v? Git Flow: Git merge

Trong lo?t bài v? git này, mình s? ?n l?i nh?ng git flow th?ng d?ng nh? Git merge, Git rebase. C? th? trong bài vi?t…
Nang c?p ubuntu 20.04 LTS lên 22.04 LTS, gi? nguyên mysql server 5.7

2024年7月5日

Nang c?p ubuntu 20.04 LTS lên 22.04 LTS, gi? nguyên mysql server 5.7

H?m nay mình th?c hi?n nang c?p ubuntu server t? 16.04->18.

6 条评论
Understanding Unicode: How to Sort Accented Strings Correctly in Unicode

2024年6月25日

Understanding Unicode: How to Sort Accented Strings Correctly in Unicode

Sorting strings in Python might not work as you expect when dealing with accented characters. Let's dive into how you…

7 条评论
Understanding Unicode: Safely Comparing Accented Strings in Python to Prevent Common Errors

2024年6月20日

Understanding Unicode: Safely Comparing Accented Strings in Python to Prevent Common Errors

In this article, we'll explore how to accurately compare two accented strings in Python, avoiding typical mistakes that…

5 条评论
Understanding Unicode (Part 2): Handle encoding error in python

2024年6月17日

Understanding Unicode (Part 2): Handle encoding error in python

In the previous article, we discussed essential knowledge about Unicode encoding and decoding. This article continues…

6 条评论
Understanding Unicode: Fundamental knowledge to Avoid Encoding/Decoding Errors (Part I)

2024年6月15日

Understanding Unicode: Fundamental knowledge to Avoid Encoding/Decoding Errors (Part I)

From the outset of my career as a developer, I frequently encountered the challenge of converting and normalizing text…

8 条评论
Introducing IHttpClientFactory in .NET Core - Overcoming Challenges with HttpClient - part 1

2024年4月29日

Introducing IHttpClientFactory in .NET Core - Overcoming Challenges with HttpClient - part 1

Introduction In many projects, we have to interact with remote APIs to retrieve data or delegate actions to external…

See all articles

Understanding Unicode (Part 3): Handle Decoding Error

Vu Truong Huu

?Senior .NET Developer at VNPT?

Handling Decoding Errors

领英推荐

Conclusion

Vu Truong Huu的更多文章

社区洞察

其他会员也浏览了

Getting started with async in Python

Hyperoperations Implementation in Python, Part 3. - Expressions

"List, Tuple & Set what's same and how do they differ (Python)"

LRU Cache in Python using Doubly Linked List

Python slicing...the number scaling way!!!

How to Start Building Tests in Python + R

Introduction to LaModel: Automate Results Saving with Python

Mutable, Immutable... In python everything is an object!

Error Handling in C with goto

Python is not real language

Handling Decoding Errors

领英推荐

Conclusion

Vu Truong Huu的更多文章

Understanding Generics in .NET: Part 1

Tìm hi?u v? Git Flow: Git rebase

Tìm hi?u v? Git Flow: Git merge

Nang c?p ubuntu 20.04 LTS lên 22.04 LTS, gi? nguyên mysql server 5.7

Understanding Unicode: How to Sort Accented Strings Correctly in Unicode

Understanding Unicode: Safely Comparing Accented Strings in Python to Prevent Common Errors

Understanding Unicode (Part 2): Handle encoding error in python

Understanding Unicode: Fundamental knowledge to Avoid Encoding/Decoding Errors (Part I)

Introducing IHttpClientFactory in .NET Core - Overcoming Challenges with HttpClient - part 1

社区洞察

其他会员也浏览了

Getting started with async in Python

Hyperoperations Implementation in Python, Part 3. - Expressions

"List, Tuple & Set what's same and how do they differ (Python)"

LRU Cache in Python using Doubly Linked List

Python slicing...the number scaling way!!!

How to Start Building Tests in Python + R

Introduction to LaModel: Automate Results Saving with Python

Mutable, Immutable... In python everything is an object!

Error Handling in C with goto

Python is not real language