Regular Expressions in Python

Posted by Sonya Sawtelle on Thu 01 December 2016

In which I learn how to cook word salad with a side of alphabet soup.

Here I'll outline what exactly a regex is, the general anatomy of a regex and the specific python syntax for constructing them. I'm using the adorably tiny book Regular Expression Pocket Reference as well as the official docs for the re module.

Tools covered:

  • constructing regex with proper syntax in python
  • searching, replacing and splitting by regexes using the re module

General Principles of Regexes

In his pocket reference Tony has done a great job condensing the essence of regex principles and syntax, so let me start by quoting him directly.

[Regexes] are a way to describe text through pattern matching... Regular expression syntax defines a language you use to describe text.

In a regex the particular character pattern you are trying to match is described by some combination of normal characters, which mean exactly what they say (a means the letter "a"), and metacharacters and metasequences, which have special meaning like the quantity or location of characters. Sadly the particular syntax of a regex is specific to each programming language.

You can picture a regex parser examining each character in the input string one-by-one sequentially, trying to fit it into the pattern described in your regex (a really neat discussion of the actual algorithmic behavior of regex engines is here). Tony identifies two main principles of regex operation that help you predict their behavior:

  1. The leftmost match wins. The parser will return you the first section of text it finds which completely matches your described pattern even if there might be another full match later on in the input string.
  2. Most quantifiers are greedy. If a part of your pattern describes an unspecified number of characters, the parser will keep matching input characters to it as long as possible. If the full pattern ultimately fails to be matched, only then will the parser try backtracking and giving up characters to the next part of the regex.

Character Sets and Encodings

"Characters" are not just standard printed symbols like letters, digits and punctation, but also things like newline and tab characters which create specific whitespace. There are even characters to control output "devices" in various ways, like an "enquire" character that requests a response from the device. A common division is between "printed" and "control" characters, the former being all visual symbols, not including whitespace.

The smallest element of computer memory is a bit, which has two states $\{0, 1\}$, so that with a single bit you can only encode two different characters, say a and b. With two bits you have four possible states, $\{00, 01, 10, 11\}$ and so you can encode four different characters. Eight bits is called a byte and has 256 possible states.

Ascii

This is a standard encoding for a specific set of characters that uses 1 byte per character. It specifies an encoding for 128 different printed and control characters ("extended" ASCII encodes an additional 128 characters). The letters in ASCII are only english.

Unicode

This is a standard encoding that has a few variants like utf-8. For the characters in the ASCII char set, Unicode uses all the same binary encodings as ASCII, but it also augments this set by permitting more than 1 byte to be used for encoding and thus more total characters to be encoded. Unicode includes, for instance, letters from non-english languages.

It's common to label and identify the specific characters of a char set not by our usual integers but instead in terms of base-8 (or base-16 for unicode) which is called "octal" (or "hexadecimal"). Note that this standard numbering gives a natural ordering to the characters, which is useful in defining "slices" of a char set in regexes, like "a through z".

Anatomy of a Regex

In his book Tony defines some really helpful categories of regex constructs that I'll repeat here with examples in Python regex syntax. Understanding the kinds of constructs that will show up should help us see the forest for the trees, so to speak.

Normal vs. Metacharacters

First recall that within a regex there are normal characters, which mean exactly what they say (a means the letter "a"), and metacharacters, which have special meaning in the regex. In python the special metacharacters are:

. ^ $ * + ? { } [ ] \ | ( )

Denoting Single Characters

There are several ways in which specific individual characters can be referred to within a regex.

  • Normal characters (not metacharacters) can just be written as-is like a means the letter "a".
  • Some control characters have special shorthand like \n for newline.
  • Characters can be denoted by escaped octal numbers like \012 for newline.
  • Characters can be denoted by escaped and lettered hexadecimal numbers like \x0D for a two-digit and \uFFFF for a four-digit

Character Classes

Classes are specific subsets of characters and a regex engine will try to match a single character of the input string to a character from the set. There is a variety of syntax for specifying different sets, and inside a class you can use the dash - to mean a slice.

  • [ ] matches any of the included characters like [a-z] means match any character in the set a through z (all the lowercase letters).
  • [^ ] matches the complement of the specified characters.
  • . matches every character except the newline
  • \d, \w, and \s match all digits, word characters (alphanumerics plus underscore) and space characters, respectively. Using the uppercase like \D means their complement.

Inside the [ ] subset definition all the metacharacters are stripped of their meaning and revert being regular characters, except that special classes will still be recognized. Thus [\d$] means the set including all digits and the dollars sign character, and [.] means match the actual period symbol (which is otherwise a metacharacter).

Anchors

Anchor are metacharacters and metasequences that match a specific position in the input string rather than matching characters. They are also called "zero-width assertions" because they don't actually consume a character of the input string when they match.

  • ^ matches the start of the input string
  • $ matches the end of the input string
  • \b matches a "word boundary" which is a place where a word character (alphanumeric) is next to a non-word character (like punctuation). Using the uppercase \B matches any place that's not a word boundary.

Lookarounds are a different kind of zero-width assertion. They match positions where a specified sub-pattern would have matched, but they don't actually consume those characters that match the sub-pattern.

  • Lookahead matches locations where a subpattern is or is not matched in the subsequent text. Like foo(?=bar) matches all "foo"s followed by "bar"s, while foo(?!bar) matches all "foo"s not followed by "bar"s.
  • Lookbehind matches locations where a subpattern is or is not matched in the preceeding text. LIke (?<=foo)bar matches all "bar"s preceeded by "foo"s, while (?&lt!foo) bar matches all "bar"s not preeeded by "foo"s.

Control Statements

These are miscellaneous functionalities that are often handy.

  • ...|... tries the two specfied subpatterns in alteration like a|b will first try to match "a" then if that fails try to match "b"
  • ( ) groups a subpattern so that the entire subpattern can be referrred to by a quantifier or alternator like (a/d)|(b/d) will first try to match "a" followed by digit and if that fails try "b" followed by a digit.

Note that if you don't use grouping by () wih the alernator | then it tries to match everything to the left of the pipe and then everything to the right of the pipe.

Quantifiers

Quantifiers control how many times the parser tries to match an element, and can be placed after single characters, character classes in brackets and subpatterns in parentheses. Greedy quantifiers will match as many times as allowed while lazy quantifiers will match as few times as allowed.

  • *, +, ?, means greedily match at least 0 times, at least 1 time, and 0 or 1 times, respectively
  • {x,y} means greedily match at least x times but no more than y times like a{3, 5} means match the character "a" as many times as possible subject to the constraint of at least 3 but not more than 5 times.
  • {n} means match exactly n times

Each greedy quantifier has a corresponding lazy quantifier whose syntax is identical just with an additional appended ?, so for example a+? means match the character "a" as few times as possible subject to the constraint of at least one time.

re Module for Python Regex

The re module in the python standard library is where all of Python's regex functionality lives, the official docs and this simpler HOW TO are both great resources.

In the re module you create a regex object from a pattern string by using the compile() function. The resulting object then has helper methods to do things like searching for matches or performing substitutions based on the pattern.

It's important to write your regex pattern strings as python raw strings (r"" rather than "") to avoid some craziness when it comes to backslash escaping. I'll just refer to this excellent SO answer.

In [1]:
import re

pattern_str = r"[bcf]at+y?"  # see below for why the r-prefix
regx= re.compile(pattern_str)

How will this regex match on "ba", "bat", "catty", "fattyy" or "faaty"? Let's break it down going from left to right through the regex pattern:

  1. insist on starting with a single "b", "c" or "f"
  2. insist on a single "a"
  3. insist on at least one "t", but as many as possible
  4. insist on 0 or 1 "y"s, but as few as possible

So it won't match "ba" (no "t") or "faaty" (more than one "a"). From "bat" it returns "bat", from "catty" it returns "catt", and from "fattyy" it returns "fatty".

This is how we actually use the regex pattern we compiled to check for matches against input strings. Remember, the regex will give us back the leftmost match and then stop.

In [3]:
# Search input string "ba" for a match to the regex - FAIL
result = regx.search("ba")
type(result)
Out[3]:
NoneType
In [4]:
# Search input string "bat" for a match to the regex - SUCCESS!
result = regx.search("fattyy")
type(result)
Out[4]:
_sre.SRE_Match

Inspect matches as match objects

In [5]:
# Look at what text the resulting "match" contains
result.group()
Out[5]:
'fatty'
In [6]:
# The start and stop positions of the match within the original input string
result.span()  # "fattyy[0:5] = "fatty"
Out[6]:
(0, 5)

Get text from subpattern groups

You may think that group is a weird name for a method that shows you the text that matched the pattern. Actually in regexes each subpattern group in parentheses, ( ), has it's match text internally captured and saved. If our regex has several subpattern groups defined in it then we can get all the resulting group strings with groups, while the whole matched string we get from group.

In [122]:
# Compile and search a regex that has subpattern groups defined
regx_grp = re.compile(r"(\w*)@(\w*)\.[\w]*")  # has two defined groups
result = regx_grp.search("coolemail@hotdomain.net")


# Get a tuple of the text for matched groups
result.groups()
Out[122]:
('coolemail', 'hotdomain')
In [123]:
# Get the full matched text
result.group()
Out[123]:
'coolemail@hotdomain.net'

Get ALL the matches!

The finditer() method will return a generator that pops out a match object for each substring in the input string which matches the regex, while findall() just returns a list of the string matches!

In [7]:
result = regx.finditer("my fat cat was behaving in a very batty fashion.")
In [8]:
for mtch in result:
    print(mtch.group())  # print the matched substring for each match object
fat
cat
batty
In [10]:
regx.findall("my fat cat was behaving in a very batty fashion.")
Out[10]:
['fat', 'cat', 'batty']

Replace substrings with other text

You can replace occurrences of the pattern in the input string with new text of your choosing, and you can specify how many replacements will occur with kwarg count (they will always happen leftmost-first).

In [85]:
inp_str = "my fat cat was behaving in a very batty fashion."
regx.sub(repl="**CENSORED**", string=inp_str, count=2)  # Only do the first two replacements
Out[85]:
'my **CENSORED** **CENSORED** was behaving in a very batty fashion.'

You can even pass it a function, to be run on each found match object, which will dictate the text that match is replaced by.

In [86]:
# Define function which will uppercase the text of the match objects
replace_with = lambda match: match.group().upper()

regx.sub(repl=replace_with, string=inp_str)
Out[86]:
'my FAT CAT was behaving in a very BATTY fashion.'

Split strings based on pattern matches

At every occurrence of a match the matched text is removed from the input string and it is cleaved at that spot.

In [119]:
regx.split(string=inp_str)
Out[119]:
['my ', ' ', ' was behaving in a very ', ' fashion.']

Compile readable regexes in Verbose mode

Regexes suck to try to read, the syntax is just too compact and confusing. If you want to write a regex that allows you to use liberal whitespace and even include comments then you can send the re.Verbose mode as an input to the compile function and using triple quotes to enclose your multi-line regex string.

In [125]:
regx = re.compile(r"""
 [bcf]  # insist on starting with a single "b", "c" or "f"
 a      # insist on a single "a"
 t+     # insist on at least one "t", but as many as possible
 y?     # insist on 0 or 1 "y"s, but as many as possible
""", re.VERBOSE)

Practice Problems

Email headers have the form "From hurst@missouri.co.jp Fri Aug 23 11:03:04 2002". Write a regex that only matches a string of this form, and has saved subpattern groups for the email address and time. (Of course for such a rigid format str.split() would work just as well.)

In [143]:
inp_str = "From hurst@missouri.co.jp Fri Aug 23 11:03:04 2002"
regx = re.compile(r"^From\s+(\S*)\s+\w+\s+\w+\s+\d+\s+([\d:]*)\s+\d{4}")

regx.search(inp_str).groups()
Out[143]:
('hurst@missouri.co.jp', '11:03:04')

Replace any sequence of whitespaces of any length with a single space.

In [201]:
inp_str = "This    is a  crazily  spaced      sentence."
regx = re.compile(r"\s+")

regx.sub(repl=" ", string=inp_str)
Out[201]:
'This is a crazily spaced sentence.'

[From diveintopython] You're working with addresses and you need to replace the word "road" (could be any case) with "Rd." but only when it's at the end of the string (address). Hint: you can compile a case-insensitive regex.

In [215]:
inp_str1 = "540 Hard Road Road"
inp_str2 = "78 RIVER ROAD"

regx = re.compile(r"road$", re.IGNORECASE)
regx.sub(repl="Rd.",string=inp_str1)
Out[215]:
'540 Hard Road Rd.'
In [216]:
regx.sub(repl="Rd.",string=inp_str2)
Out[216]:
'78 RIVER Rd.'

[From diveintopython] You need to parse phone numbers to get the area code, local code, and last four digits. The phone numbers could come in any of the following forms:

  • 800-555-1212
  • 800 555 1212
  • 800.555.1212
  • (800) 555-1212
  • 1-800-555-1212
In [224]:
inp_str1 = "800-555-1212"
inp_str2 = "800.555.1212"
inp_str3 = "(800) 555-1212"
inp_str4 = "1-800-555-1212"
regx = re.compile(r".*(\d{3})[^\w]{1,2}(\d{3})[^\w]{1,2}(\d{4})")

regx.search(inp_str3).groups()
Out[224]:
('800', '555', '1212')

Censor every occurrence of a number in a medical statement like "The patient, aged 62, has BP measuring 130.6 over 25." Use your favorite censorship stand-in. Careful of floats with decimal points!

In [17]:
inp_str = "The patient, aged 62, has BP measuring 130.6 over 25."
regx = re.compile(r"\b\d+[.]*\d*\b")


regx.sub(repl="**CENSORED**", string=inp_str)
Out[17]:
'The patient, aged **CENSORED**, has BP measuring **CENSORED** over **CENSORED**.'

Now instead replace every occurrence of an integer with its binary equivalent - your regex should not match floats! Hint: bin() gives a text representation of the binary equivalent of a base-10 integer.

In [19]:
inp_str = "The patient, aged 62, presented with blood pressure measuring 130.6 over 25."
regx = re.compile(r"\b(?<!\d[.])\d+(?![.]\d)\b")

def replace_with(match):
    txt = match.group()
    return bin(int(txt))

regx.sub(repl=replace_with, string=inp_str)
Out[19]:
'The patient, aged 0b111110, presented with blood pressure measuring 130.6 over 0b11001.'

[From the official docs]. Find and print all adverbs in a sentence - you can assume they will all end with "ly". Hint: make sure your code can handle adverbs immediately followed by punctuation!

In [151]:
inp_str = "He was carefully disguised but captured quickly by police."
regx = re.compile(r"\b\w+ly\b")

regx.findall(inp_str)
Out[151]:
['carefully', 'quickly']

[From the HOWTO]. Match all filenames whose extension is not ".bat" and capture the filename and extension as two subpattern groups.

In [168]:
inp_str1 = "myfile.bmp"
inp_str2 = "myfile2.bat"

regx = re.compile(r"(.+)[.](?!bat)(.+)") # Use a negative look ahead
regx.search(inp_str1).groups()
Out[168]:
('myfile', 'bmp')
In [170]:
type(regx.search(inp_str2))
Out[170]:
NoneType

Split up a sentence by every occurrence of punctuation in it. Hint: punctuation is not a letter, digit, or whitespace

In [21]:
inp_str = "They say 'stop, in the name of love'. I quite agree!"

regx = re.compile(r"[^\w\d\s]+")
regx.split(string=inp_str)
Out[21]:
['They say ', 'stop', ' in the name of love', ' I quite agree', '']

Match the interior content of all the html tags in the string.

In [22]:
inp_str = "&lt img src=test.png>this is a pic</img> &lt div>this is a div</div>"
regx = re.compile(r"(?<=>)([^<>]*)(?=</)")  # yeah this breaks for content with greater/less than
regx.findall(inp_str)
Out[22]:
['this is a pic', 'this is a div']

More Resources