Keyboard shortcuts:

N/СпейсNext Slide

PPrevious Slide

OSlides Overview

ctrl+left clickZoom Element

If you want print version => add '?print-pdf'
at the end of slides URL (remove '#' fragment) and then print.
Like: https://progressbg-python-course.github.io/...CourseIntro.html?print-pdf

Unicode in Python

Created for

Iva E. Popova, 2016-2025,

Iva E. Popova on LinkedIn

What encoding means?

To store anything in a computer, you must first encode it, i.e. convert it to bytes. For example:: If you want to store music, you must first encode it using MP3, WAV, etc.; If you want to store a picture, you must first encode it using PNG, JPEG, etc.; If you want to store text, you must first encode it using ASCII, UTF-8, etc.

Unicode Overview

The problem

In the beginning (1963), there was only ASCII. After that, a bunch of character encodings was used:: Windows-1252; KOI8-R; Windows-1251; many, many others...
And the mess begins...: in KOI8-R, the code '209' == 'я'; in Windows-1251, the code '209' == 'С'; in Windows-1252, the code '209' == 'Ñ'

Unicode - The Solution

Unicode encompasses virtually all characters used widely in computers today.
As of Unicode version 15.1, there are 149,878 characters with code points, covering 161 modern and historical scripts, as well as multiple symbol sets.
Even Emojis 😂 !
Reference: List of Unicode characters @wikipedia

☕

a ă я

ছ𝄞

Unicode Encoding

A Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal)
This sequence needs to be represented as a set of bytes (1 byte can store values from 0 to (2**8)-1) in memory).
The rules for translating a Unicode string into a sequence of bytes are called an encoding
There are various unicode encodings ( UTF-8, UTF-16, UTF-32 and other), but most widely used is UTF-8 which is a variable width character encoding capable of encoding all valid code points in Unicode using one to four bytes

Python’s Unicode Support

From Python3, default string encoding is UTF-8.
UTF-8 is also the default encoding for Python3 source code.
Python 3 also supports some Unicode characters in identifiers

`ord()` and `chr()` functions

ord(char) - return an integer representing the Unicode code point of char given.
chr(i) - return the string representing a character whose Unicode code point is the integer i


            print( ord('я') )
            # 1103
            print( chr(1103) )
            # я

Unicode symbols in Python strings

You can use the unicode symbols directly in strings, or, you can enter them using escape sequences
Example - Various ways to represent symbol: Ѣ


            # Unicode symbol in string:
            print("Ѣ")

            # Using the character name:
            print("\N{Cyrillic Capital Letter Yat}")

            # Using a 16-bit hex value code point:
            print("\u0462")

            # Using a 32-bit hex value code point:
            print("\U00000462")

Encode-Decode

Overview

Encoding and Decoding refer to the process of converting data between different formats, typically from text to byte representations and vice versa.
This is important when handling data for various applications, such as reading and writing files or network transmission where data must be in byte format
Encoding is the process of converting a string (text) into a specific byte format using a certain character encoding, like UTF-8 or ASCII.
Decoding is the reverse process of encoding, where a byte object is converted back into a string using the same encoding format

Encode-Decode Flow

In byte streams, we have only bytes, but we can interpret them, i.e. to decode, as characters using some character table

encode() - convert String to Bytes

convert String to Bytes

`str.encode()` - syntax


            str.encode(encoding="utf-8", errors="strict")

Return an encoded version of the string as a bytes object
Default encoding is 'utf-8'
List of possible encodings: Standard Encodings
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError.
More on str.encode(): str.encode @python3 docs

`str.encode()` - example


            string = "123абв"

            bytes_string = string.encode()

            print("Byte object:", bytes_string)
            print("Type: ",type(bytes_string) )
            print("Length:",len(bytes_string) )


            #Byte object: b'123\xd0\xb0\xd0\xb1\xd0\xb2'
            #Type:  <class 'bytes'>
            #Length: 9

Note, that the len() of byte object returns the number of bytes, not the number of characters encoded!

UnicodeEncodeError - example

In next example we try to encode a string using ASCII. Convertion will raise UnicodeEncodeError for non-ASCII characters

decode() - convert Bytes to String

convert Bytes to String

`bytes.decode()` - syntax


            bytes.decode(encoding="utf-8", errors="strict")

Return a string decoded from the given bytes
Default encoding is 'utf-8'
List of possible encodings: Standard Encodings
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError.
More on bytes.decode(): bytes.decode @python3 docs

`bytes.decode()` - example


            byte_string =  b'\xd0\xb0\xd0\xb1\xd0\xb2'
            string = byte_string.decode()

            print("String object:", string)
            print("Type: ",type(string) )
            print("String length:",len(string) )
            print("Byte_string length:",len(byte_string) )

            # String object: абв
            # Type:  <class 'str'>
            # String length: 3
            # Byte_string length: 6

The `bytes object`

Bytes Object

A bytes object is a built-in data type in Python representing a sequence of bytes.
Bytes objects are created using the bytes() constructor or by using a bytes literal prefixed with b (bytes string).
Bytes objects are immutable (cannot be changed after creation).
They are typically used to store raw binary data, such as images, audio files, or network packets.
Example:
More on bytes objects: Bytes objects and Bytearrays

Encode/decode examples and use-cases

encode/write examples

Next two examples achieve a same goal - to save a Python string into text file, encoded as 'cp1251'
Usually, for such task, you would like to use the open() in text mode with encoding option
Using encode() method
Note, that if you open the txt files in VSCode (which by default uses UTF-8), you'll see "�� ".: But you can tell (CTRL+SHIFT+P) VSCode to "Change File Encoding"=>"Reopen with Encoding" and select from encoding list "Cyrillic Windows-1251

decode/read examples

Next two examples achieve a same goal - to read from text file, encoded as 'cp1251'
Usually, for such task, you would like to use the open() in text mode with encoding option
Using decode() method

Resources

Texts

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Unicode HOWTO @python docs

Exercises

Task1: guess_the_quotes

The Task

Given is next file: quotes.txt, containing quotes in Cyrillic, from a famous writer. But, the file is encoded in KOI8-R
Write a program: koi8r_to_utf8.py, which will receive an input file name as argument and will create an UTF encoded file with the same name, but with sufix "_utf8_" added (quotes_utf8_.txt).
Now, you'll be able to open and read the text with your favourite editor

Program usage example


            .
            ├── koi8r_to_utf8.py
            └── quotes.txt


            $ python koi8r_to_utf8.py quotes.txt


            .
            ├── koi8r_to_utf8.py
            ├── quotes.txt
            └── quotes_utf8_.txt

Make sure, that quotes_utf8_.txt is properly converted and readable!

These slides are based on

customised version of

Hakimel's reveal.js

framework

Unicode in Python

What encoding means?

What encoding means?

Unicode Overview

Unicode Overview

The problem

Unicode - The Solution

Unicode Encoding

Python’s Unicode Support

Python’s Unicode Support

ord() and chr() functions

Unicode symbols in Python strings

Encode-Decode

Encode-Decode

Overview

Encode-Decode Flow

encode() - convert String to Bytes

convert String to Bytes

str.encode() - syntax

str.encode() - example

UnicodeEncodeError - example

decode() - convert Bytes to String

convert Bytes to String

bytes.decode() - syntax

bytes.decode() - example

The bytes object

The bytes object

Bytes Object

Encode/decode examples and use-cases

Encode/decode examples and use-cases

encode/write examples

decode/read examples

Resources

Resources

Texts

Exercises

Task1: guess_the_quotes

The Task

Program usage example

`ord()` and `chr()` functions

`str.encode()` - syntax

`str.encode()` - example

`bytes.decode()` - syntax

`bytes.decode()` - example

The `bytes object`

The `bytes object`