Regular Expressions in Python

Regular Expressions Basis

Regular Expressions Basis

Overview

Regular Expression synonyms:
Regex, RegEx, RegExp
Regular Expression is a string pattern, which can match or not other stings
You can think of it a as a kind of search mechanism.

      import re

      # the string to search with regex:
      user_email = "prefix@domain.com"

      # the regex to find if the userEmail contains '@' symbol:
      regex = re.compile(r'@')

      # do the match test:
      if regex.search(user_email):
        print("Match!")
      else:
        print("No match!")
    

the language

You can think of Regular Expressions as a separate language, with its own rules and specs.
In fact, the Regular Expressions are coming from the regular language defined by Kleene in the early 1950s
Nowadays, almost all programming languages implements the concept of Regex.
A regex grammar includes 2 types of symbols:
Regular symbols: they are matched literally on the matching string
Meta-characters: they have special meaning and gives the power of regex


        . ^ $ * + ? { } [ ] \ | ( )
      
All characters which are not metacharacters are matched literally.

metacharacters - example


      import re

      phone_numbers = ['+359 88 7123 456', '+359 88 7123456' ]

      # match numbers with format: +359 YY YXXX XXX
      regex = r'\+359\s\d{2}\s\d{4}\s\d{3}'

      for number in phone_numners:
        if re.match(regex,number):
          print("{} is a valid number format".format(number))
        else:
          print("{} is NOT IN A VALID FORMAT".format(number))
    

      +359 88 7123 456 is a valid number format
      +359 88 7123456 is NOT IN A VALID FORMAT
    

Using regexes in Python - the re module

Using regexes in Python - the re module

Overview

The built-in re module in Python provides regular expression matching operations similar to those found in Perl.
Regular expressions are compiled into Regular Expression Object, which have methods for various operations such as searching for pattern matches or performing string substitutions.
REs in Python are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them.

How to write regex

Regex in Python are written as string, which can be passed to re.compile() method or directly to other matching methods, like re.search(), re.match()
We can use any string literals, including the row string syntax.

Matching backslash

The raw string syntax is most concise when we need to match a backslash
The raw string @docs:
When an 'r' or 'R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string
a raw string cannot end in a single backslash

      print(len(r'\n')) #2
      print(len(r'\')) #SyntaxError
    

Matching backslash


      import re

      text = '\\stop'
      re1 = '\\\\stop'
      re2 = '\\stop'
      re3 = r'\\stop'


      if re.match(re1, text):
        print("re1 matched!")

      if re.match(re2, text):
        # would not match, as '\' is a special character in regex and should be escaped, as well
        print("re2 matched!")

      if re.match(re3, text):
        print("re3 matched!")
    

the re.compile() method

Compiles a regular expression pattern into a regular expression object, which can be used for matching using its methods for matching and search

      import re

      text = "ABRACADABRA"

      regex = re.compile(r'aca', re.I)

      if regex.search(text):
        print('Match')
    

Regex Match Methods

Regex Match Methods

regex.search(string[, start[, end]]))

Scan through string looking for the first location where this regular expression produces a match.
If match produced => returns a corresponding match object
If string does not matches the pattern => return None
optional parameters:
start - the index where the search should start
end - the index where the search should ends

regex.search(string[, start[, end]])) - example


      import re

      text = "123abc456"
      rx = re.compile('abc')

      res = rx.search(text) # will match
      res = rx.search(text,3) # will match, 'a' is on index 3 in text
      res = rx.search(text,4) # would NOT match
    

regex.match(string[, start[, end]])

Matches only on the beginning of the string
For the rest - acts like regex.search() method

regex.match(string[, start[, end]]) - example


      text = "123abc456"
      rx = re.compile('abc')

      res = rx.match(text)

      res = rx.match(text) # will NOT match, 'abc' is not in the beginning
      res = rx.match(text,3) # will match, as matching starts from index 3
    

regex.findall(string[, pos[, endpos]])

Returns a list of strings containing all non-overlapping matches of regex in the string
The string is scanned left-to-right, and matches are returned in the order found

regex.findall(string[, pos[, endpos]]) -example


      text = "123abc456abcabc"
      rx = re.compile('abc')

      res = rx.findall(text) # ['abc', 'abc', 'abc']
      res = rx.findall(\dtext) # ['3abc', '6abc']
    

Other Matching Methods

regex.finditer(string[, pos[, endpos]])
regex.fullmatch(string[, pos[, endpos]])
regex.sub(repl, string, count=0)
regex.subn(repl, string, count=0)

re module-level matching methods

The methods described above, was methods of an Regular Expression Objects
Python has the same methods defined for re module, like:
re.search(pattern, string, flags=0)
and so on...
The difference is that we must pass the pattern string as first argument, and optional flags at the end.

When to use Regex Match Methods?

The modlule-level match functions compile the given regex string, and keep it in its cache. So future calls using the same RE won’t need to parse the pattern again
But when you use the function in a loop, the function itself will be needlessly called, that's why in loops, it is better to use the precompiled regex.

Match Objects

The Match Object

Overview

match() and search() methods returns a Match Object
It always have a boolean value of True
It contains useful information about the matched strings.

Match Object Methods

Method/AttributePurpose
group()Return the string matched by the RE
groups()Return a tuple containing all the subgroups of the match
start()Return the starting position of the match
end()Return the ending position of the match
span()Return a tuple containing the (start, end) positions of the match

More methods: Match Object

Match Object Methods - example


      text = "123abc456abc"
      rx = re.compile('(\d+)(abc)')

      res = rx.match(text)
      if res:
        print("res.group():", res.group()) #123abc
        print("res.groups():", res.groups()) #('123', 'abc')
      else:
        print("No match!")
    

We will discuss capturing groups on next slides

Regex Syntax

Regex Syntax

Special Characters

Only next characters has special meaning in Regex:

^ $ \ . * + ? ( ) [ ] { } |

They can be combined with ordinary characters to change their meaning too

If we want to match literally a special character we have to escape it with backslash '\'

Matching Special Characters - example


      import re

      text = "try to match: 2+3"
      rx = re.compile('2\+3')

      res = rx.search(text)
      if res:
        print( res.group())
    

Quantifiers

Quantifiers

Overview

QuantifierDescription
r *r match 0 or more times
r +r match 1 or more times
r ?r match 0 or 1time
r {n}r match exactly n times
r {n,m}r match between n and m times (n, m are positive)

r can be any regex!

Quantifiers (greedy and non-greedy match)

The quantifiers are greedy, meaning they will match the maximum part of the string they can:

        matched = re.search(r'a.*a','ala bala' );
        print(matched)
        # match='ala bala', but not 'ala'
      

Quantifiers (greedy and non-greedy match)

We can make them non-greedy, if we suffix them with '?'

        matched = re.search(r'a.*?a','ala bala' );
        print(matched)
        #match='ala'
      

'*'quantifier - example


      import re

      string = 'ala bala'

      matched = re.findall(r'a.*a',string ) # greedy
      print(matched)
      #OUTPUT: ['ala bala']

      matched = re.findall(r'a.*?a',string ) # non-greedy
      print(matched)
      #OUTPUT: ['ala', 'ala']

      matched = re.findall(r'.*?',string ) # non-gready
      print(matched)
      #OUTPUT: ['', '', '', '', '', '', '', '', ''
    

{n,m} quantifier - example


      import re

      matched = re.findall(r'\d{2,4}','123456789' ) # gready
      print(matched)
      # OUTPUT: ['1234', '5678']

      matched = re.findall(r'\d{2,4}?','123456789' ) # non-gready
      print(matched)
      #OUTPUT: ['12', '34', '56', '78']
    

Character Sets

Character Sets

Overview

The square brackets are used to define a character set. Like: [abc] (will match 'a' or 'b' or 'c').
The character set itself match only one symbol!
Symbols inside brackets are the elements of set.
Special characters lose their special meaning inside sets
The hyphen (-), when it is between 2 symbols, has special meaning inside the character class - it defines a range. Like: [0-9]. If it is in the end, it is considered as a hyphen.

Character Sets Description

Character setDescription
[abc]Match any one of the symbols listed ('a' or 'b' or 'c')
[a-z]Match any symbol, from 'a' till 'z' (i.e. any lower Latin letter)
[^abc]Match any symbol, except 'a or 'b' or 'c' (i.e. the ^ negates the characters in the set)

Character Sets examples


      import re

      # match any one of the vocals
      matched = re.findall(r'[aeiouy]','astroid' );
      print(matched)
      #OUTPUT: ['a', 'o', 'i']

      # match any consecutive vocals - one or more times
      matched = re.findall(r'[aeiouy]+','astroid' );
      print(matched)
      #OUTPUT: ['a', 'oi']

      # match bg mobile phone numbers
      matched = re.findall('\+3598[7-9][0-9]{7}', '+359888123456');
      print(matched)
      #OUTPUT: ['+359888123456']

      # match digit or hyphen:
      matched = re.findall('[1-5-]', '12-34');
      print(matched)
      #OUTPUT: ['1', '2', '-', '3', '4']
    

Character Sets Negation - examples


      import re

      # match any non-vocal:
      matched = re.findall(r'[^aeiouy]','astroid' );
      print(matched)
      #OUTPUT: ['s', 't', 'r', 'd']
    

Character classes

Character classes

Character classes

Character classes can be regarded as shorthands for some of the most used character sets.
In python3 they work only on any Unicode symbols.
You can use the re.ASCII/(?a) flag to specify that you want only ASCII symbols to be matched.

Character classes

Char classDescription
.Match any character, except newline/line terminator. You can use the re.DOTALL/(?s) to match the new line, as well
\wMatches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.
If the ASCII flag is used, only [a-zA-Z0-9_] is matched
\dMatches any Unicode decimal digit, which includes [0-9], and also many other digit characters
If the ASCII flag is used, only [0-9] is matched
\sMatches any Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters,

Character classes example


      import re

      # match bg mobile phone numbers
      matched = re.findall('\+3598[7-9]\d{7}', '+359888123456');
      print(matched)
      #OUTPUT: ['+359888123456']
    

Character classes example


      import re

      strings = ['petrov42','42petrov','ivan_pterov']
      rx = re.compile('[a-z]\w+')

      for string in strings:
        matched = rx.search(string);
        print("{} matched in {}".format(matched.group(),string) )
      #OUTPUT:
      #petrov42 matched in petrov42
      #petrov matched in 42petrov
      #ivan_pterov matched in ivan_pterov
    

Character classes example


  		string = """line1
  		line2
  		line3 line4"""

  		matched = re.findall('line\d\s', string);
  		print(matched)

  		#OUTPUT: ['line1\n', 'line2\n', 'line3 ']
  	

Modifiers/Flags

Modifiers/Flags

Modifiers/Flags

Flags reflects how the regular expression is executed.
They are available in the re module with a long name such as re.IGNORECASE or with a short, one-letter form such as re.I.
Multiple flags can be specified by bitwise OR-ing them. For example re.I|re.M sets both the I and M flags.
These flags are set by passing the flags argument to the re.compile() method
Flags can be also set in the regular expression itself, using (?aiLmsux) syntax at the beginning of the regex

Modifiers/Flags list

In regesAs paramDescription
(?i)re.Icase-insensitive matching
(?m)re.Mmultiline matching
(?s)re.SMake the '.' to match any character at all, including a newline
(?x)re.XAllows to write readable regexes by using spaces and comments('#') in the regex. More on: re.X

Modifiers/Flags example


      import re

      text = """123
      ABC
      456"""
      rx = re.compile('(?is)123.abc')

      res = rx.search(text)
      if res:
        print(res.group(0))
      else:
        print("No match!")
    

Anchors and Boundaries

Anchors and Boundaries

Overview

They specify a position in the string where a match should occurs.
They are zero-width, i.e.when matched they do NOT consume characters from the string.

Anchors and Boundaries

AnchorDescription
^Matches the beginning of the string (or the line, if m flag is used)
$Matches the end of the string (or the line, if m flag is used)
\bMatches on word boundaries, i.e. between word(\w) and non-word(\W) characters.
Note that the start and end of string are considered as non-word characters.
\ZMatches only at the end of the string.

Anchors and Boundaries example


      import re

      strings = [
        '',
        'a',
        '@',
        '@a',
        'aa',
        'a!',
        'a,a',
      ]
      rx = re.compile(r'\b');

      for string in strings:
        res = rx.findall(string)
        print("{} word bounders counted in {}".format(len(res), string))
      #OUTPUT
      #0 word bounders counted in
      #2 word bounders counted in a
      #0 word bounders counted in @
      #2 word bounders counted in @a
      #2 word bounders counted in aa
      #2 word bounders counted in a!
      #4 word bounders counted in a,a
    

Anchors and Boundaries example


    	strings = [
    	  'ana',
    	  'ana bel',
    	]
    	rx = re.compile(r'^a\w+a$');

    	for string in strings:
    	  res = rx.findall(string)
    	  print("{} matches in {}".format(len(res), string))
    	#OUTPUT:
    	#1 matches in ana
    	#0 matches in ana bel
    

Alternation

Alternation

Alternation

With alternation we can match one or another regexp!
AlternationDescription
r1|r2Matches if r1 OR r2 is matched

Grouping and capturing

Grouping and capturing

Grouping and back references

Brackets: ( and ), play a dual role in regex!
They can be used for grouping regexes.Like:
/(r1|r2)r3/ => match r1r3 OR r2r3, but not r1r2r3
Or they can be used to capture (remember) the matched part of the string. Like:
/(r1)r2/ => match r1r2 and capture the part of the string that matched r1
If you just want to group regexes, without capturing the match, you should explicitly state that by:
(?:r1|r2) => match r1 or r2 but do not capture the match
NB! Capturing is slow and memory consuming! If you need the parenthesis just for grouping- always use the ?: prefix.

Capturing - example


  		import re

  		user = 'Ivan Ivanov: +359 887123456'

  		rx = re.compile("""(?x)
  		  ([A-Z]\w+)\s+   # capture first name
  		  ([A-Z]\w+):\s+  # capture sur name
  		  \+(\d{3})\s     # capture country code
  		  (\d{6,8})       # capture number
  		""")

  		res = rx.search(user)
  		if res:
  		  i = 0
  		  for t in res.groups():
  		    print("Capture {}: {}".format(i,t))
  		    i+=1

  		#OUTPUT:
  		#Capture 0: Ivan
  		#Capture 1: Ivanov
  		#Capture 2: 359
  		#Capture 3: 88712345
  	

Grouping regexes example


      import re

      strings = [
        'Icecream with strawberries?',
        'Icecream with blueberries?',
        'Icecream with raspberries?',
        'Icecream with strawraspberries?',
        'Icecream with berries?',
      ]
      rx = re.compile(r'\b(?:straw|rasp)?berries');

      for string in strings:
        res = rx.search(string)
        if res:
          print('{} YES!'.format(string))
        else:
          print('{} NO!'.format(string))
      #OUTPUT:
      #Icecream with strawberries? YES!
      #Icecream with blueberries? NO!
      #Icecream with raspberries? YES!
      #Icecream with strawraspberries? NO!
      #Icecream with berries? YES!
    

Online Regex Testers for Python

regex101.com - Online regex tester and debugger: PHP, PCRE, Python, Golang and

Exercises

Online Chalanges

Practice Regex @Hackerrank

These slides are based on

customised version of

Hakimel's reveal.js

framework