登录查看更多内容

Unicode over String in Python!!

Ankur Pandey

Full-stack Developer @ Sparrow Interactive | BTech, Python, React.js, Django

发布日期: 2020年7月19日

Python is a widely-used, interpreted, object-oriented, general-purpose, and high-level programming language with dynamic semantics. After getting introduced on Feb 20, 1991, by Guido van Rossum, today it has become so popular. Everyone is taking a keen interest in making products using this. Applications are often internationalized to display messages and output in a variety of user-selectable languages; the same program might need to output a message in English, French, Japanese, Hebrew, or Russian. So today’s programs need to be able to handle a wide variety of characters. And for that Python’s string type uses the Unicode Standard for representing characters.

Unicode is the universal character encoding, maintained by the Unicode Consortium(a 501 non-profit organization, publish the Unicode Standard). It aims to list every character used by human languages so that user from every region can be able to use the string in their native language. The Unicode specifications are continually revised and updated to add new languages and symbols. As Unicode Consortium themselves say

Everyone in the world should be able to use their own language on phones and computers.

A character can be treated differently in different languages like in Roman number one ' I ' is different from the uppercase letter ' I ' of the English language, although they look the same. So to distinguish them this encoding standard provides the basis for processing, storage, and interchange of text data in any language in all modern software by using code points. A code point value is an integer in the range 0 to 0x10FFFF (about 1.1 million values, with some 110 thousand assigned so far). A character is represented on a screen or paper by a set of graphical elements that’s called a glyph. In typography, a glyph is an elemental symbol within an agreed set of symbols, intended to represent a readable character for writing.

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit values are used in the encoding. There are also UTF-16 and UTF-32 encodings. The rules for translating a Unicode string into a sequence of bytes are called a character encoding, or just an encoding.

# Unicode string

>>> string = '??Ankur_Pandey??????'


# default encoding to utf-8

>>> string
'\xf0\x9f\x98\x8eAnkur_Pandey\xf0\x9f\x98\x8a\xf0\x9f\xa5\xba\xf0\x9f\x98\x89'
    

>>>

In Python 2.1, Unicode literals can only be written using the Latin-1 based encoding "Unicode-escape". This makes the programming environment rather unfriendly to Python users who live and work in non-Latin-1 locales such as many of the Asian countries. So to remove bound to the "Unicode-escape" encoding for Unicode literals, PEP 263 has proposed to introduce a syntax to declare the encoding of a Python source file. Adding the following line on the top of your .py file allows you to encode strings directly in your script.

>>># -*- coding: utf-8 -*-

>>> utfstr = u"ボールト"

So now we can have our output in any language. Thanks for reading, hope it added some value to you.

Unicode over String in Python!!

Ankur Pandey

Full-stack Developer @ Sparrow Interactive | BTech, Python, React.js, Django

更多精彩文章

社区洞察

其他会员也浏览了

Why Companies Use Python?

Profiling Python the easy way

Python: Error & Exception Handling - Debugging

What is Python ?

Top 100 - Python Interview Questions & Answers

Improve your Python code by adding type hint annotations with the help of AI

How to Perform Python String Concatenation?

Python Single-Line Loops and Comprehensions. What are they and how to use ?

What’s all the hyper about Python?

Python Interview Questions with Answers

Importance of .get() in python!!

2020年7月17日