Understanding Unicode: Safely Comparing Accented Strings in Python to Prevent Common Errors
In this article, we'll explore how to accurately compare two accented strings in Python, avoiding typical mistakes that lead to incorrect results.
Understanding Accented Strings in Unicode
Accented characters, such as à, can be constructed in Unicode in multiple ways. For instance:
In Python, these differences in construction lead to variations in how strings are compared:
领英推荐
# Example where 'à' is constructed from 'a' and '\u0300'
>> s1 = 'cà phê'
Output: 'ca? phe?'
>> s2 = 'ca\u0300 phê'
Output: 'ca? phe?'
# Direct comparison without normalization
>> print(s1 == s2)
# Output: False
Although cà phê and ca\u0300 phê appear identical in the display string, they are different at the code level. To compare their semantic meaning, ignoring their underlying Unicode construction, we need to normalize these strings.
Implementing Unicode Normalization in Python
Python supports two primary forms of normalization:
Here's how you can apply these normalizations:
>> from unicodedata import normalize
# Normalize both strings to NFC
>> s1_nfc = normalize('NFC', s1)
>> s2_nfc = normalize('NFC', s2)
>> print(len(s1_nfc), len(s2_nfc))
# Both lengths are 6
>> print(s1_nfc == s2_nfc)
# Output: True
# Normalize both strings to NFD
>> s1_nfd = normalize('NFD', s1)
>> s2_nfd = normalize('NFD', s2)
>> print(len(s1_nfd), len(s2_nfd))
# Both lengths are 7
>> print(s1_nfd == s2_nfd)
# Output: True
Conclusion
Understanding that strings displaying similarly can have different underlying constructions is crucial in Python. By normalizing strings to a common form before comparing them, we ensure the comparison accurately reflects their intended semantic meaning. This approach is particularly important when dealing with internationalized text data, ensuring robustness and correctness in your applications.
Software Engineer is human too
5 个月Very informative. If I understand correctly, we can convert from Single code point to Combine and vice versa.
??Java Software Engineer | Oracle Certified Professional
5 个月Interesting!
? Backend Developer, Let's connect?
5 个月Very helpful, thank you so much ??