Understanding Unicode: Safely Comparing Accented Strings in Python to Prevent Common Errors

Understanding Unicode: Safely Comparing Accented Strings in Python to Prevent Common Errors

In this article, we'll explore how to accurately compare two accented strings in Python, avoiding typical mistakes that lead to incorrect results.

Understanding Accented Strings in Unicode

Accented characters, such as à, can be constructed in Unicode in multiple ways. For instance:

  • à can be represented as a single codepoint: U+00E0
  • Alternatively, it can be a combination of the character a followed by the mark ? (U+0300), which is known as the COMBINING GRAVE ACCENT.

codepoint for single character
codepoint for mark

In Python, these differences in construction lead to variations in how strings are compared:

# Example where 'à' is constructed from 'a' and '\u0300'
>> s1 = 'cà phê'
Output: 'ca? phe?'
>> s2 = 'ca\u0300 phê'
Output: 'ca? phe?'

# Direct comparison without normalization
>> print(s1 == s2)  
# Output: False
        

Although cà phê and ca\u0300 phê appear identical in the display string, they are different at the code level. To compare their semantic meaning, ignoring their underlying Unicode construction, we need to normalize these strings.

Implementing Unicode Normalization in Python

Python supports two primary forms of normalization:

  • NFC (Normalization Form C): Converts sequences to their shortest equivalent form. For example, a\u0300 becomes à.
  • NFD (Normalization Form D): Decomposes characters into their base characters followed by any associated marks. For example, à becomes a followed by \u0300.

Here's how you can apply these normalizations:

>> from unicodedata import normalize

# Normalize both strings to NFC
>> s1_nfc = normalize('NFC', s1)
>> s2_nfc = normalize('NFC', s2)
>> print(len(s1_nfc), len(s2_nfc)) 
 # Both lengths are 6
>> print(s1_nfc == s2_nfc)          
# Output: True

# Normalize both strings to NFD
>> s1_nfd = normalize('NFD', s1)
>> s2_nfd = normalize('NFD', s2)
>> print(len(s1_nfd), len(s2_nfd))  
# Both lengths are 7
>> print(s1_nfd == s2_nfd)          
# Output: True
        

Conclusion

Understanding that strings displaying similarly can have different underlying constructions is crucial in Python. By normalizing strings to a common form before comparing them, we ensure the comparison accurately reflects their intended semantic meaning. This approach is particularly important when dealing with internationalized text data, ensuring robustness and correctness in your applications.

Thanh Nguyen

Software Engineer is human too

5 个月

Very informative. If I understand correctly, we can convert from Single code point to Combine and vice versa.

D??ng Xuan ?à

??Java Software Engineer | Oracle Certified Professional

5 个月

Interesting!

?inh Quang Tùng

? Backend Developer, Let's connect?

5 个月

Very helpful, thank you so much ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了