N/Spacenext slide

Pprevious slide

Ooverview slides

ctrl+left clickzoom element

Unicode in Python

Created for

Iva E. Popova, 2019,

Iva E. Popova on LinkedIn

Unicode Overview

The problem

In the beginning (1963), there was only ASCII. After that, a bunch of character encodings was used:: Windows-1252; KOI8-R; Windows-1251; many, many others...
And the mess begins...: in KOI8-R, the code '209' == 'я'; in Windows-1251, the code '209' == 'С'; in Windows-1252, the code '209' == 'Ñ'

Unicode - The Solution

Unicode encompasses virtually all characters used widely in computers today. As of version 11.0, Unicode contains a repertoire of over 137,000 characters covering 146 modern and historic scripts, as well as multiple symbol sets. Even Emoji!
List of Unicode characters @wikipedia

☕

aăя

ছ𝄞

Encoding

A Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal)
This sequence needs to be represented as a set of bytes (1 byte can store values from 0 to (2**8)-1) in memory
The rules for translating a Unicode string into a sequence of bytes are called an encoding
UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode using one to four 8-bit bytes: It's the most popular encoding, used in contemporary software systems.

Python’s Unicode Support

Python3 represents strings in Unicode

The default encoding for Python3 source code is UTF-8

Python 3 also supports Unicode characters in identifiers


			низ = "This is a normal Python string :ছ 𝄞 ☕"
			print(низ)
			# This is a normal Python string :ছ 𝄞 ☕

`ord()` and `chr()` functions

ord(char) - return an integer representing the Unicode code point of char given.
chr(i) - return the string representing a character whose Unicode code point is the integer i


			print( ord('я') )
			# 1103
			print( chr(1103) )
			# я

Unicode symbols in Python strings

You can use the unicode symbols directly in strings, or, you can enter them using escape sequences
Example - Various ways to represent symbol Cyrillic Capital Letter Yat


			# Unicode symbol in string:
			print("Ѣ")

			# Using the character name:
			print("\N{Cyrillic Capital Letter Yat}")

			# Using a 16-bit hex value code point:
			print("\u0462")

			# Using a 32-bit hex value code point:
			print("\U00000462")

Encode-Decode

Encode-Decode Flow

encode() - convert String to Bytes

convert String to Bytes

`str.encode()` - syntax


			str.encode(encoding="utf-8", errors="strict")

Return an encoded version of the string as a bytes object
Default encoding is 'utf-8'
List of possible encodings: Standard Encodings
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError.
More on str.encode(): str.encode @python3 docs

`str.encode()` - example


			string = "123абв"

			str_in_utf = string.encode()

			print("Byte object:", str_in_utf)
			print("Type: ",type(str_in_utf) )
			print("Length:",len(str_in_utf) )

			#Byte object: b'123\xd0\xb0\xd0\xb1\xd0\xb2'
			#Type:  <class 'bytes'>
			#Length: 9

Note, that the len() of byte object returns the number of bytes, not the number of characters encoded!

The `bytes` Object and `bytestring`

The bytes object represents an immutable sequence of integers in the range 0 <= x < 256.
A bytestring is a raw stream of bytes which can be read/write directly into/from memory.
bytestring vs str in Python3: a bytestring represents an immutable sequence of bytes, without implying any particular interpretation; str represents an immutable sequence of unicode codepoints,without implying any particular binary encoding
More on bytes objects: Bytes objects and Bytearrays

decode() - convert Bytes to String

convert Bytes to String

`bytes.decode()` - syntax


			bytes.decode(encoding="utf-8", errors="strict")

Return a string decoded from the given bytes
Default encoding is 'utf-8'
List of possible encodings: Standard Encodings
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError.
More on bytes.decode(): bytes.decode @python3 docs
More on bytes objects: Bytes objects and Bytearrays

`bytes.decode()` - example


			str_in_bytes =  b'123\xd0\xb0\xd0\xb1\xd0\xb2'
			str_in_utf8 = str_in_bytes.decode()

			print("String object:", str_in_utf8)
			print("Type: ",type(str_in_utf8) )
			print("Length:",len(str_in_utf8) )

			String object: 123абв
			Type:  
			Length: 6

Note, that the len() of string object returns the number of characters, not the number of bytes decoded!

Example use-cases

encode/write examples

Next two examples achieve a same goal - to save a Python string into text file, encoded as 'cp1251'
Thought, the write() method is conciser and clearer, I'm showing the encode() method, as well, as it could be used in other use-cases.

Write a Python string into file, encoded to cp1251 bytes - using `str.encode()`


			string = "АБВ 123 ИЙК ЩЮЯ"

			# open a file handler for writing in binary mode"
			with open("encode_to_cp1251.txt", "w+b") as fh:
			  bytes_sequence = string.encode(encoding="cp1251")
			  fh.write(bytes_sequence)

			# now re-open the file: encode_to_cp1251.txt with encoding cp1251(windows-1251)

Usually, for such task, you would like to use the open() in text mode with encoding option, as given in next slide

Write a Python string into file, encoded to cp1251 bytes - using `open(filename, mode, encoding)`


			string = "АБВ 123 ИЙК ЩЮЯ"

			# open a file for writing in text mode, with encoding="cp1251" "
			with open("write_to_cp1251.txt", "w+", encoding="cp1251") as fh:
			  fh.write(string)

			# now re-open the file: write_to_cp1251.txt with encoding cp1251(windows-1251)

decode/read examples

Next two examples achieve a same goal - to read from text file, encoded as 'cp1251'
Thought, the read() method is conciser and clearer, I'm showing the decode() method, as well, as it could be used in other use-cases.

Read a cp1251 encoded file into Python string - using `str.decode()`


			win1251file = "encode_to_cp1251.txt"

			# open a file handler for reading in binary mode"
			with open(win1251file, "r+b") as f:
			    bytestring = f.read()

			    decoded_string = bytestring.decode(encoding="cp1251")
			    print(decoded_string)

			    # АБВ 123 ИЙК ЩЮЯ

Read a cp1251 encoded file into Python string - using `open(filename, mode, encoding)`


			win1251file = "write_to_cp1251.txt"

			# open a file handler for reading in text mode, with encoding="cp1251""
			with open(win1251file, "r", encoding="cp1251") as f:
			    print(f.read())

base64.b64encode demo

base64.b64encode


			import base64

			passwd = 'abracadabra'

			# base64 needs a byte string
			# encoded = base64.b64encode(b'data to be encoded')

			passwd_bytes = passwd.encode()
			passwd_b64 = base64.b64encode(passwd_bytes)

			print(f'passwd_b64: {passwd_b64}')
			# passwd_b64: b'YWJyYWNhZGFicmE='

Resources

Texts

Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Unicode HOWTO @python docs

Exercises

Task1: guess_the_quotes

The Task

Given is next file: quotes.txt, containing quotes in Cyrillic, from a famous writer. But, the file is encoded in KOI8-R
Write a Python program, that will convert that file into Unicode, using the UTF8 encoding.
Now, you'll be able to open and read the text with your favourite editor

Task2: cp1251_to_utf8

The Task

Imagine, you've downloaded a Bulgarian subtitles file, encoded in Windows-1251: Silicon.Valley.sampleBGsubs.srt
But you have to convert it in UTF8, as your player recognise only Unicode encoded subtitles
Write a program: cp1251_to_utf8.py, which will receive an input file name as argument and will create an UTF encoded file with the same name, but with sufix "_utf8_" added.

Program usage example


			.
			├── cp1251_to_utf8.py
			└── Silicon.Valley.sampleBGsubs.srt


			$ python cp1251_to_utf8.py Silicon.Valley.sampleBGsubs.srt


			.
			├── cp1251_to_utf8.py
			├── Silicon.Valley.sampleBGsubs.srt
			└── Silicon.Valley.sampleBGsubs_utf8_.srt

Make sure, that Silicon.Valley.sampleBGsubs_utf8_.srt is properly converted and readable!

Submission

Please, prefix your filenames/archive with your name initials, before sending.: For instance: iep_task1.py or iep_tasks.rar
Send files to progressbg.python.course@gmail.com

These slides are based on

customised version of

Hakimel's reveal.js

framework

Unicode in Python

Unicode Overview

Unicode Overview

The problem

Unicode - The Solution

Encoding

Python’s Unicode Support

Python’s Unicode Support

ord() and chr() functions

Unicode symbols in Python strings

Encode-Decode

Encode-Decode Flow

encode() - convert String to Bytes

convert String to Bytes

str.encode() - syntax

str.encode() - example

The bytes Object and bytestring

decode() - convert Bytes to String

convert Bytes to String

bytes.decode() - syntax

bytes.decode() - example

Example use-cases

Example use-cases

encode/write examples

Write a Python string into file, encoded to cp1251 bytes - using str.encode()

Write a Python string into file, encoded to cp1251 bytes - using open(filename, mode, encoding)

decode/read examples

Read a cp1251 encoded file into Python string - using str.decode()

Read a cp1251 encoded file into Python string - using open(filename, mode, encoding)

base64.b64encode demo

Resources

Resources

Texts

Exercises

Task1: guess_the_quotes

The Task

Task2: cp1251_to_utf8

The Task

Program usage example

Submission

`ord()` and `chr()` functions

`str.encode()` - syntax

`str.encode()` - example

The `bytes` Object and `bytestring`

`bytes.decode()` - syntax

`bytes.decode()` - example

Write a Python string into file, encoded to cp1251 bytes - using `str.encode()`

Write a Python string into file, encoded to cp1251 bytes - using `open(filename, mode, encoding)`

Read a cp1251 encoded file into Python string - using `str.decode()`

Read a cp1251 encoded file into Python string - using `open(filename, mode, encoding)`