Unicode in Python

Unicode Overview

Unicode Overview

What is an encoding

An encoding of a character set is a mapping from characters to numbers.

The problem

In the beginning (1963), there was only ASCII. After that, a bunch of character encodings was used:
Windows-1252
KOI8-R
Windows-1251
many, many others...
And the mess begins...
in KOI8-R, the code '209' == 'я'
in Windows-1251, the code '209' == 'С'
in Windows-1252, the code '209' == 'Ñ'

The Solution

Unicode encompasses virtually all characters used widely in computers today. It is capable of addressing more than 1.1 million code points.
aăя
𝄞

Encoding

A Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal)
This sequence needs to be represented as a set of bytes (1 byte can store values from 0 to (2**8)-1) in memory
The rules for translating a Unicode string into a sequence of bytes are called an encoding
UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode using one to four 8-bit bytes
It's the most popular encoding, used in contemporary software systems.

Python’s Unicode Support

Python’s Unicode Support

Python3 represents strings as Unicode

The default encoding for Python3 source code is UTF-8

Python 3 also supports Unicode characters in identifiers


			низ = "This is a normal Python string :ছ 𝄞 ☕"
			print(низ)
			# This is a normal Python string :ছ 𝄞 ☕
		

ord() and chr() functions

ord(char) - return an integer representing the Unicode code point of char given.
chr(i) - return the string representing a character whose Unicode code point is the integer i

			print( ord('я') )
			# 1103
			print( chr(1103) )
			# я
		

Unicode symbols in Python strings

You can use the unicode symbols directly in strings, or, you can enter them using escape sequences
Example - Various ways to represent symbol Cyrillic Capital Letter Yat

			# Unicode symbol in string:
			print("Ѣ")

			# Using the character name:
			print("\N{Cyrillic Capital Letter Yat}")

			# Using a 16-bit hex value code point:
			print("\u0462")

			# Using a 32-bit hex value code point:
			print("\U00000462")
		

Convert String to Bytes

Convert String to Bytes

str.encode() - syntax


			str.encode(encoding="utf-8", errors="strict")
		
Return an encoded version of the string as a bytes object
Default encoding is 'utf-8'
List of possible encodings: Standard Encodings
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError.
More on str.encode(): str.encode @python3 docs
More on bytes objects: Bytes objects and Bytearrays

str.encode() - example


			string = "123абв"

			str_in_utf = string.encode()

			print("Byte object:", str_in_utf)
			print("Type: ",type(str_in_utf) )
			print("Length:",len(str_in_utf) )

			#Byte object: b'123\xd0\xb0\xd0\xb1\xd0\xb2'
			#Type:  <class 'bytes'>
			#Length: 9
		

Note, that the len() of byte object returns the number of bytes, not the number of characters encoded!

Convert Bytes to String

Convert Bytes to String

bytes.decode() - syntax


			bytes.decode(encoding="utf-8", errors="strict")
		
Return a string decoded from the given bytes
Default encoding is 'utf-8'
List of possible encodings: Standard Encodings
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError.
More on bytes.decode(): bytes.decode @python3 docs
More on bytes objects: Bytes objects and Bytearrays

bytes.decode() - example


			str_in_bytes =  b'1\xd0\xb02\xd0\xb13\xd0\xb2'
			str_in_utf8 = str_in_bytes.decode()

			print("String object:", str_in_utf8)
			print("Type: ",type(str_in_utf8) )
			print("Length:",len(str_in_utf8) )

			#String object: 1а2б3в
			#Type:  <class 'str'>
			#Length: 6
		

Note, that the len() of byte object returns the number of bytes, not the number of characters encoded!

Resources

Resources

Texts

Unicode HOWTO @python docs

Exercises

Task1: guess_the_quotes

The Task

Given is next file: quotes.txt, containing quotes in Cyrillic, from a famous writer. But, the file is encoded in KOI8-R
Write a Python program, that will convert that file into Unicode, using the UTF8 encoding.
Now, you'll be able to open and read the text with your favourite editor

Task2: cp1251_to_utf8

The Task

Imagine, you've downloaded a Bulgarian subtitles file, encoded in Windows-1251: Silicon.Valley.sampleBGsubs.srt
But you have to convert it in UTF8, as your player recognise only Unicode encoded subtitles
Write a program: cp1251_to_utf8.py, which will receive an input file name as argument and will create an UTF encoded file with the same name, but with sufix "_utf8_" added.

Program usage example


			.
			├── cp1251_to_utf8.py
			└── Silicon.Valley.sampleBGsubs.srt
		

			$ python cp1251_to_utf8.py Silicon.Valley.sampleBGsubs.srt
		

			.
			├── cp1251_to_utf8.py
			├── Silicon.Valley.sampleBGsubs.srt
			└── Silicon.Valley.sampleBGsubs_utf8_.srt
		

Make sure, that Silicon.Valley.sampleBGsubs_utf8_.srt is properly converted and readable!

Submission

Please, prefix your filenames/archive with your name initials, before sending.
For instance: iep_task1.py or iep_tasks.rar
Send files to progressbg.python.course@gmail.com

These slides are based on

customised version of

Hakimel's reveal.js

framework