登录查看更多内容

Understanding Unicode: Safely Comparing Accented Strings in Python to Prevent Common Errors

Vu Truong Huu

?Senior .NET Developer at VNPT?

发布日期: 2024年6月20日

In this article, we'll explore how to accurately compare two accented strings in Python, avoiding typical mistakes that lead to incorrect results.

Understanding Accented Strings in Unicode

Accented characters, such as à, can be constructed in Unicode in multiple ways. For instance:

à can be represented as a single codepoint: U+00E0
Alternatively, it can be a combination of the character a followed by the mark ? (U+0300), which is known as the COMBINING GRAVE ACCENT.

In Python, these differences in construction lead to variations in how strings are compared:

领英推荐

Python - Count and learn

Coen de Groot 3 年前

Evolution of 'sort' in Python and the Role of…

Adarsh Divakaran 9 个月前

Solving the "Late Binding" Problem in Python: A Simple…

Hafiz M. Zilehuda (Zile) 1 年前

# Example where 'à' is constructed from 'a' and '\u0300'
>> s1 = 'cà phê'
Output: 'ca? phe?'
>> s2 = 'ca\u0300 phê'
Output: 'ca? phe?'

# Direct comparison without normalization
>> print(s1 == s2)  
# Output: False

Although cà phê and ca\u0300 phê appear identical in the display string, they are different at the code level. To compare their semantic meaning, ignoring their underlying Unicode construction, we need to normalize these strings.

Implementing Unicode Normalization in Python

Python supports two primary forms of normalization:

NFC (Normalization Form C): Converts sequences to their shortest equivalent form. For example, a\u0300 becomes à.
NFD (Normalization Form D): Decomposes characters into their base characters followed by any associated marks. For example, à becomes a followed by \u0300.

Here's how you can apply these normalizations:

>> from unicodedata import normalize

# Normalize both strings to NFC
>> s1_nfc = normalize('NFC', s1)
>> s2_nfc = normalize('NFC', s2)
>> print(len(s1_nfc), len(s2_nfc)) 
 # Both lengths are 6
>> print(s1_nfc == s2_nfc)          
# Output: True

# Normalize both strings to NFD
>> s1_nfd = normalize('NFD', s1)
>> s2_nfd = normalize('NFD', s2)
>> print(len(s1_nfd), len(s2_nfd))  
# Both lengths are 7
>> print(s1_nfd == s2_nfd)          
# Output: True

Conclusion

Understanding that strings displaying similarly can have different underlying constructions is crucial in Python. By normalizing strings to a common form before comparing them, we ensure the comparison accurately reflects their intended semantic meaning. This approach is particularly important when dealing with internationalized text data, ensuring robustness and correctness in your applications.

Thanh Nguyen

Software Engineer is human too

5 个月

Very informative. If I understand correctly, we can convert from Single code point to Combine and vice versa.

1 次回应

D??ng Xuan ?à

??Java Software Engineer | Oracle Certified Professional

5 个月

Interesting!

1 次回应

?inh Quang Tùng

? Backend Developer, Let's connect?

5 个月

Very helpful, thank you so much ??

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Understanding Unicode: Safely Comparing Accented Strings in Python to Prevent Common Errors

Vu Truong Huu

?Senior .NET Developer at VNPT?

Understanding Accented Strings in Unicode

领英推荐

Implementing Unicode Normalization in Python

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Solving the "Late Binding" Problem in Python: A Simple Fix for Unexpected Behavior

Demystifying Python metaclasses -Why are they so special?

Python if else vs if elif else: What's the Difference?

Mutable and Immutable Types in Python

20 cool things you can do with Python's built-in functions

Python journey | Day 5

Practice 2 : Python Operators and Expressions

Sending SMS Messages with Twilio in Python

Type annotation for A strongly and dynamically typed Python.

Voice-Controlled Tasks with Python

Understanding Accented Strings in Unicode

领英推荐

Implementing Unicode Normalization in Python

Conclusion

Understanding Generics in .NET: Part 1

2024年7月24日

Tìm hi?u v? Git Flow: Git rebase

2024年7月23日

Tìm hi?u v? Git Flow: Git merge

2024年7月17日

Nang c?p ubuntu 20.04 LTS lên 22.04 LTS, gi? nguyên mysql server 5.7

2024年7月5日

Understanding Unicode: How to Sort Accented Strings Correctly in Unicode

2024年6月25日

Understanding Unicode (Part 3): Handle Decoding Error

2024年6月18日

Understanding Unicode (Part 2): Handle encoding error in python

2024年6月17日

Understanding Unicode: Fundamental knowledge to Avoid Encoding/Decoding Errors (Part I)

2024年6月15日

Introducing IHttpClientFactory in .NET Core - Overcoming Challenges with HttpClient - part 1

2024年4月29日

社区洞察

其他会员也浏览了

Solving the "Late Binding" Problem in Python: A Simple Fix for Unexpected Behavior

Demystifying Python metaclasses -Why are they so special?

Python if else vs if elif else: What's the Difference?

Mutable and Immutable Types in Python

20 cool things you can do with Python's built-in functions

Python journey | Day 5

Practice 2 : Python Operators and Expressions

Sending SMS Messages with Twilio in Python

Type annotation for A strongly and dynamically typed Python.

Voice-Controlled Tasks with Python