Skip to content

Regular Expressions

Regular Expressions#

Match patterns against text

Reading a file#

Use open()

names_file = open("names.txt", encoding="utf-8")
data = names_file.read()
names_file.close()

a better way:

with open("some_file.txt") as open_file:
    data = open_file.read()

Get the regex library#

import re

Match#

Matches from beginning of string

The r tells python it is a raw string (no escape character)

re.match(r'Love', data)

Match anywhere in string

re.search(r'Kenneth', data)

Findall#

Finds all places where it doesn’t overlap

Escape characters#

  • \w - any unicode word character
  • \W - anything that isn’t unicode
  • \s - whitespace
  • \S - non-whitespace
  • \d - any number 0 - 9
  • \D - anything that isn’t a number
  • \b - word boundary (edges of a word)
  • \B - anything not edges of a word

Parenthesis define a group in regular expressions

You have to escpae them with \(

Frequency#

  • {3} - exactly 3 times
  • (,3) - 0 to 3 times
  • {3,} - 3 or more times
  • {3,5} - 3 to 5 times
  • ? - optional (0 or 1 time)
  • * - occurs at least 0 times
  • + - occurs 1 or more times

eg.

re.search(r'\w+')
re.search(r'\(?\d{3}\)?')

Sets#

  • [aple] - Matches apple
  • [a-z] - any lowercase letter (ranges)
  • [^2] - anything that is not 2

Email address example:

print(re.findall(r'[-\w\d+.]+@[-\w\d.]*, data))

Flags#

  • Ignore case: re.findall(r’[trehous]+\b’, data, re.IGNORECASE)

Shorthand for re.IGNORECASE is re.I

  • Muliple lines (Mulitpline regex)

Use re.VERBOSE or re.X

Add multiple flags with pipe symbol:

re.VERBOSE|re.I

Treat each multiline as a string: re.MULITLINE or re.M

Beginning and End#

Beginning: ^ End: $

Named Groups#

Use (?P<name>{your-expression-here})

Making a dictionary out of a list#

    line = re.search(r'''
        ^(?P<name>[-\w ]*,\s[-\w ]+)\t  # last and first names
        (?P<email>[-\w\d.+]+@[-\w\d.]+)\t # Email
        (?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})?\t # Phone
        (?P<job>[\w\s]+,\s[\w\s.]+)\t? # Job and company
        (?P<twitter>@[\w\d]+)?$ # twitter
    ''', data, re.X|re.M)

    print(line)
    print(line.groupdict())

Compile a pattern to an object#

Get it ready for use re.compile()

Remove data as it wont have been run against anything at that stage

Allows returning an iterable of Matches

    line = re.compile(r'''
        ^(?P<name>(?P<first>[-\w ]*),\s(?P<last>[-\w ]+))\t  # last and first names
        (?P<email>[-\w\d.+]+@[-\w\d.]+)\t # Email
        (?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})?\t # Phone
        (?P<job>[\w\s]+,\s[\w\s.]+)\t? # Job and company
        (?P<twitter>@[\w\d]+)?$ # twitter
    ''', re.X|re.M)

    print(re.search(line, data).groupdict())

    print(line.search(data).groupdict())

    for match in line.finditer(data):
        print('{first} {last} <{email}>'.format(**match.groupdict()))

String Interpolation into a Regex#

You can format a regular expression string the same way you do any other string interpolation:

pattern = r'^set groups {group} interfaces (?P<line>.[^\n]+)$'.format(
    group=group
)
lines_regex = re.compile(pattern, re.MULTILINE)
matches = lines_regex.findall(file_contents)

Findall with groupdicts#

You can’t get groupdicts if you use findall().

However there is a way to still get the groupdicts using a finditer:

[m.groupdict() for m in regex.finditer(search_string)]

Get anything that is not a newline#

re.findall(r'^(?P<line>.[^\n]+)$, words, re.MULTILINE)

Sources#