Created for
Python3 represents strings in Unicode
The default encoding for Python3 source code is UTF-8
Python 3 also supports Unicode characters in identifiers
низ = "This is a normal Python string :ছ 𝄞 ☕"
print(низ)
# This is a normal Python string :ছ 𝄞 ☕
ord()
and chr()
functionsord(char)
- return an integer representing the Unicode code point of char given.chr(i)
- return the string representing a character whose Unicode code point is the integer i
print( ord('я') )
# 1103
print( chr(1103) )
# я
# Unicode symbol in string:
print("Ѣ")
# Using the character name:
print("\N{Cyrillic Capital Letter Yat}")
# Using a 16-bit hex value code point:
print("\u0462")
# Using a 32-bit hex value code point:
print("\U00000462")
str.encode()
- syntax
str.encode(encoding="utf-8", errors="strict")
str.encode()
- example
string = "123абв"
str_in_utf = string.encode()
print("Byte object:", str_in_utf)
print("Type: ",type(str_in_utf) )
print("Length:",len(str_in_utf) )
#Byte object: b'123\xd0\xb0\xd0\xb1\xd0\xb2'
#Type: <class 'bytes'>
#Length: 9
Note, that the len() of byte object returns the number of bytes, not the number of characters encoded!
bytes
Object and bytestring
bytes
object represents an immutable sequence of integers in the range 0 <= x < 256.bytestring
vs str
in Python3bytestring
represents an immutable sequence of bytes, without implying any particular interpretationstr
represents an immutable sequence of unicode codepoints,without implying any particular binary encodingbytes.decode()
- syntax
bytes.decode(encoding="utf-8", errors="strict")
bytes.decode()
- example
str_in_bytes = b'123\xd0\xb0\xd0\xb1\xd0\xb2'
str_in_utf8 = str_in_bytes.decode()
print("String object:", str_in_utf8)
print("Type: ",type(str_in_utf8) )
print("Length:",len(str_in_utf8) )
String object: 123абв
Type:
Length: 6
Note, that the len() of string object returns the number of characters, not the number of bytes decoded!
str.encode()
string = "АБВ 123 ИЙК ЩЮЯ"
# open a file handler for writing in binary mode"
with open("encode_to_cp1251.txt", "w+b") as fh:
bytes_sequence = string.encode(encoding="cp1251")
fh.write(bytes_sequence)
# now re-open the file: encode_to_cp1251.txt with encoding cp1251(windows-1251)
Usually, for such task, you would like to use the open() in text mode with encoding option, as given in next slide
open(filename, mode, encoding)
string = "АБВ 123 ИЙК ЩЮЯ"
# open a file for writing in text mode, with encoding="cp1251" "
with open("write_to_cp1251.txt", "w+", encoding="cp1251") as fh:
fh.write(string)
# now re-open the file: write_to_cp1251.txt with encoding cp1251(windows-1251)
str.decode()
win1251file = "encode_to_cp1251.txt"
# open a file handler for reading in binary mode"
with open(win1251file, "r+b") as f:
bytestring = f.read()
decoded_string = bytestring.decode(encoding="cp1251")
print(decoded_string)
# АБВ 123 ИЙК ЩЮЯ
open(filename, mode, encoding)
win1251file = "write_to_cp1251.txt"
# open a file handler for reading in text mode, with encoding="cp1251""
with open(win1251file, "r", encoding="cp1251") as f:
print(f.read())
import base64
passwd = 'abracadabra'
# base64 needs a byte string
# encoded = base64.b64encode(b'data to be encoded')
passwd_bytes = passwd.encode()
passwd_b64 = base64.b64encode(passwd_bytes)
print(f'passwd_b64: {passwd_b64}')
# passwd_b64: b'YWJyYWNhZGFicmE='
cp1251_to_utf8.py
, which will receive an input file name as argument and will create an UTF encoded file with the same name, but with sufix "_utf8_" added.
.
├── cp1251_to_utf8.py
└── Silicon.Valley.sampleBGsubs.srt
$ python cp1251_to_utf8.py Silicon.Valley.sampleBGsubs.srt
.
├── cp1251_to_utf8.py
├── Silicon.Valley.sampleBGsubs.srt
└── Silicon.Valley.sampleBGsubs_utf8_.srt
Make sure, that Silicon.Valley.sampleBGsubs_utf8_.srt is properly converted and readable!
These slides are based on
customised version of
framework