Unicode over String in Python!!
Ankur Pandey
Full-stack Developer @ Sparrow Interactive | BTech, Python, React.js, Django
Python is a widely-used, interpreted, object-oriented, general-purpose, and high-level programming language with dynamic semantics. After getting introduced on Feb 20, 1991, by Guido van Rossum, today it has become so popular. Everyone is taking a keen interest in making products using this. Applications are often internationalized to display messages and output in a variety of user-selectable languages; the same program might need to output a message in English, French, Japanese, Hebrew, or Russian. So today’s programs need to be able to handle a wide variety of characters. And for that Python’s string type uses the Unicode Standard for representing characters.
Unicode is the universal character encoding, maintained by the Unicode Consortium(a 501 non-profit organization, publish the Unicode Standard). It aims to list every character used by human languages so that user from every region can be able to use the string in their native language. The Unicode specifications are continually revised and updated to add new languages and symbols. As Unicode Consortium themselves say
Everyone in the world should be able to use their own language on phones and computers.
A character can be treated differently in different languages like in Roman number one ' I ' is different from the uppercase letter ' I ' of the English language, although they look the same. So to distinguish them this encoding standard provides the basis for processing, storage, and interchange of text data in any language in all modern software by using code points. A code point value is an integer in the range 0 to 0x10FFFF (about 1.1 million values, with some 110 thousand assigned so far). A character is represented on a screen or paper by a set of graphical elements that’s called a glyph. In typography, a glyph is an elemental symbol within an agreed set of symbols, intended to represent a readable character for writing.
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit values are used in the encoding. There are also UTF-16 and UTF-32 encodings. The rules for translating a Unicode string into a sequence of bytes are called a character encoding, or just an encoding.
# Unicode string >>> string = '??Ankur_Pandey??????' # default encoding to utf-8 >>> string '\xf0\x9f\x98\x8eAnkur_Pandey\xf0\x9f\x98\x8a\xf0\x9f\xa5\xba\xf0\x9f\x98\x89' >>>
In Python 2.1, Unicode literals can only be written using the Latin-1 based encoding "Unicode-escape". This makes the programming environment rather unfriendly to Python users who live and work in non-Latin-1 locales such as many of the Asian countries. So to remove bound to the "Unicode-escape" encoding for Unicode literals, PEP 263 has proposed to introduce a syntax to declare the encoding of a Python source file. Adding the following line on the top of your .py file allows you to encode strings directly in your script.
>>># -*- coding: utf-8 -*- >>> utfstr = u"ボールト"
So now we can have our output in any language. Thanks for reading, hope it added some value to you.
-
4 年Ankur Pandey Good content about Unicode over normal string Python.