Unicode in Python

Unicode Overview

Unicode Overview

The problem

In the beginning (1963), there was only ASCII. After that, a bunch of character encodings was used:
Windows-1252
KOI8-R
Windows-1251
many, many others...
And the mess begins...
in KOI8-R, the code '209' == 'я'
in Windows-1251, the code '209' == 'С'
in Windows-1252, the code '209' == 'Ñ'

Unicode - The Solution

Unicode encompasses virtually all characters used widely in computers today. As of version 11.0, Unicode contains a repertoire of over 137,000 characters covering 146 modern and historic scripts, as well as multiple symbol sets. Even Emoji!
List of Unicode characters @wikipedia
aăя
𝄞

Encoding

A Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal)
This sequence needs to be represented as a set of bytes (1 byte can store values from 0 to (2**8)-1) in memory
The rules for translating a Unicode string into a sequence of bytes are called an encoding
UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode using one to four 8-bit bytes
It's the most popular encoding, used in contemporary software systems.

Python’s Unicode Support

Python’s Unicode Support

Python3 represents strings in Unicode

The default encoding for Python3 source code is UTF-8

Python 3 also supports Unicode characters in identifiers


			низ = "This is a normal Python string :ছ 𝄞 ☕"
			print(низ)
			# This is a normal Python string :ছ 𝄞 ☕
		

ord() and chr() functions

ord(char) - return an integer representing the Unicode code point of char given.
chr(i) - return the string representing a character whose Unicode code point is the integer i

			print( ord('я') )
			# 1103
			print( chr(1103) )
			# я
		

Unicode symbols in Python strings

You can use the unicode symbols directly in strings, or, you can enter them using escape sequences
Example - Various ways to represent symbol Cyrillic Capital Letter Yat

			# Unicode symbol in string:
			print("Ѣ")

			# Using the character name:
			print("\N{Cyrillic Capital Letter Yat}")

			# Using a 16-bit hex value code point:
			print("\u0462")

			# Using a 32-bit hex value code point:
			print("\U00000462")
		

Encode-Decode

Encode-Decode Flow

encode() - convert String to Bytes

convert String to Bytes

str.encode() - syntax


			str.encode(encoding="utf-8", errors="strict")
		
Return an encoded version of the string as a bytes object
Default encoding is 'utf-8'
List of possible encodings: Standard Encodings
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError.
More on str.encode(): str.encode @python3 docs

str.encode() - example


			string = "123абв"

			str_in_utf = string.encode()

			print("Byte object:", str_in_utf)
			print("Type: ",type(str_in_utf) )
			print("Length:",len(str_in_utf) )

			#Byte object: b'123\xd0\xb0\xd0\xb1\xd0\xb2'
			#Type:  <class 'bytes'>
			#Length: 9
		

Note, that the len() of byte object returns the number of bytes, not the number of characters encoded!

The bytes Object and bytestring

The bytes object represents an immutable sequence of integers in the range 0 <= x < 256.
A bytestring is a raw stream of bytes which can be read/write directly into/from memory.
bytestring vs str in Python3
a bytestring represents an immutable sequence of bytes, without implying any particular interpretation
str represents an immutable sequence of unicode codepoints,without implying any particular binary encoding
More on bytes objects: Bytes objects and Bytearrays

decode() - convert Bytes to String

convert Bytes to String

bytes.decode() - syntax


			bytes.decode(encoding="utf-8", errors="strict")
		
Return a string decoded from the given bytes
Default encoding is 'utf-8'
List of possible encodings: Standard Encodings
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError.
More on bytes.decode(): bytes.decode @python3 docs
More on bytes objects: Bytes objects and Bytearrays

bytes.decode() - example


			str_in_bytes =  b'123\xd0\xb0\xd0\xb1\xd0\xb2'
			str_in_utf8 = str_in_bytes.decode()

			print("String object:", str_in_utf8)
			print("Type: ",type(str_in_utf8) )
			print("Length:",len(str_in_utf8) )

			String object: 123абв
			Type:  
			Length: 6
		

Note, that the len() of string object returns the number of characters, not the number of bytes decoded!

Example use-cases

Example use-cases

encode/write examples

Next two examples achieve a same goal - to save a Python string into text file, encoded as 'cp1251'
Thought, the write() method is conciser and clearer, I'm showing the encode() method, as well, as it could be used in other use-cases.

Write a Python string into file, encoded to cp1251 bytes - using str.encode()


			string = "АБВ 123 ИЙК ЩЮЯ"

			# open a file handler for writing in binary mode"
			with open("encode_to_cp1251.txt", "w+b") as fh:
			  bytes_sequence = string.encode(encoding="cp1251")
			  fh.write(bytes_sequence)

			# now re-open the file: encode_to_cp1251.txt with encoding cp1251(windows-1251)
		

Usually, for such task, you would like to use the open() in text mode with encoding option, as given in next slide

Write a Python string into file, encoded to cp1251 bytes - using open(filename, mode, encoding)


			string = "АБВ 123 ИЙК ЩЮЯ"

			# open a file for writing in text mode, with encoding="cp1251" "
			with open("write_to_cp1251.txt", "w+", encoding="cp1251") as fh:
			  fh.write(string)

			# now re-open the file: write_to_cp1251.txt with encoding cp1251(windows-1251)
		

decode/read examples

Next two examples achieve a same goal - to read from text file, encoded as 'cp1251'
Thought, the read() method is conciser and clearer, I'm showing the decode() method, as well, as it could be used in other use-cases.

Read a cp1251 encoded file into Python string - using str.decode()


			win1251file = "encode_to_cp1251.txt"

			# open a file handler for reading in binary mode"
			with open(win1251file, "r+b") as f:
			    bytestring = f.read()

			    decoded_string = bytestring.decode(encoding="cp1251")
			    print(decoded_string)

			    # АБВ 123 ИЙК ЩЮЯ
		

Read a cp1251 encoded file into Python string - using open(filename, mode, encoding)


			win1251file = "write_to_cp1251.txt"

			# open a file handler for reading in text mode, with encoding="cp1251""
			with open(win1251file, "r", encoding="cp1251") as f:
			    print(f.read())
		

base64.b64encode demo

base64.b64encode


			import base64

			passwd = 'abracadabra'

			# base64 needs a byte string
			# encoded = base64.b64encode(b'data to be encoded')

			passwd_bytes = passwd.encode()
			passwd_b64 = base64.b64encode(passwd_bytes)

			print(f'passwd_b64: {passwd_b64}')
			# passwd_b64: b'YWJyYWNhZGFicmE='
		

Resources

Resources

Texts

Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Unicode HOWTO @python docs

Exercises

Task1: guess_the_quotes

The Task

Given is next file: quotes.txt, containing quotes in Cyrillic, from a famous writer. But, the file is encoded in KOI8-R
Write a Python program, that will convert that file into Unicode, using the UTF8 encoding.
Now, you'll be able to open and read the text with your favourite editor

Task2: cp1251_to_utf8

The Task

Imagine, you've downloaded a Bulgarian subtitles file, encoded in Windows-1251: Silicon.Valley.sampleBGsubs.srt
But you have to convert it in UTF8, as your player recognise only Unicode encoded subtitles
Write a program: cp1251_to_utf8.py, which will receive an input file name as argument and will create an UTF encoded file with the same name, but with sufix "_utf8_" added.

Program usage example


			.
			├── cp1251_to_utf8.py
			└── Silicon.Valley.sampleBGsubs.srt
		

			$ python cp1251_to_utf8.py Silicon.Valley.sampleBGsubs.srt
		

			.
			├── cp1251_to_utf8.py
			├── Silicon.Valley.sampleBGsubs.srt
			└── Silicon.Valley.sampleBGsubs_utf8_.srt
		

Make sure, that Silicon.Valley.sampleBGsubs_utf8_.srt is properly converted and readable!

Submission

Please, prefix your filenames/archive with your name initials, before sending.
For instance: iep_task1.py or iep_tasks.rar
Send files to progressbg.python.course@gmail.com

These slides are based on

customised version of

Hakimel's reveal.js

framework