In which I learn how to cook word salad with a side of alphabet soup.¶
Here I'll outline what exactly a regex is, the general anatomy of a regex and the specific python syntax for constructing them. I'm using the adorably tiny book Regular Expression Pocket Reference as well as the official docs for the re
module.
Tools covered:
- constructing regex with proper syntax in python
- searching, replacing and splitting by regexes using the
re
module
General Principles of Regexes¶
In his pocket reference Tony has done a great job condensing the essence of regex principles and syntax, so let me start by quoting him directly.
[Regexes] are a way to describe text through pattern matching... Regular expression syntax defines a language you use to describe text.
In a regex the particular character pattern you are trying to match is described by some combination of normal characters, which mean exactly what they say (a
means the letter "a"), and metacharacters and metasequences, which have special meaning like the quantity or location of characters. Sadly the particular syntax of a regex is specific to each programming language.
You can picture a regex parser examining each character in the input string one-by-one sequentially, trying to fit it into the pattern described in your regex (a really neat discussion of the actual algorithmic behavior of regex engines is here). Tony identifies two main principles of regex operation that help you predict their behavior:
- The leftmost match wins. The parser will return you the first section of text it finds which completely matches your described pattern even if there might be another full match later on in the input string.
- Most quantifiers are greedy. If a part of your pattern describes an unspecified number of characters, the parser will keep matching input characters to it as long as possible. If the full pattern ultimately fails to be matched, only then will the parser try backtracking and giving up characters to the next part of the regex.
Character Sets and Encodings¶
"Characters" are not just standard printed symbols like letters, digits and punctation, but also things like newline and tab characters which create specific whitespace. There are even characters to control output "devices" in various ways, like an "enquire" character that requests a response from the device. A common division is between "printed" and "control" characters, the former being all visual symbols, not including whitespace.
The smallest element of computer memory is a bit, which has two states $\{0, 1\}$, so that with a single bit you can only encode two different characters, say a and b. With two bits you have four possible states, $\{00, 01, 10, 11\}$ and so you can encode four different characters. Eight bits is called a byte and has 256 possible states.
Ascii¶
This is a standard encoding for a specific set of characters that uses 1 byte per character. It specifies an encoding for 128 different printed and control characters ("extended" ASCII encodes an additional 128 characters). The letters in ASCII are only english.
Unicode¶
This is a standard encoding that has a few variants like utf-8
. For the characters in the ASCII char set, Unicode uses all the same binary encodings as ASCII, but it also augments this set by permitting more than 1 byte to be used for encoding and thus more total characters to be encoded. Unicode includes, for instance, letters from non-english languages.
It's common to label and identify the specific characters of a char set not by our usual integers but instead in terms of base-8 (or base-16 for unicode) which is called "octal" (or "hexadecimal"). Note that this standard numbering gives a natural ordering to the characters, which is useful in defining "slices" of a char set in regexes, like "a through z".
Anatomy of a Regex¶
In his book Tony defines some really helpful categories of regex constructs that I'll repeat here with examples in Python regex syntax. Understanding the kinds of constructs that will show up should help us see the forest for the trees, so to speak.
Normal vs. Metacharacters¶
First recall that within a regex there are normal characters, which mean exactly what they say (a
means the letter "a"), and metacharacters, which have special meaning in the regex. In python the special metacharacters are:
. ^ $ * + ? { } [ ] \ | ( )
Denoting Single Characters¶
There are several ways in which specific individual characters can be referred to within a regex.
- Normal characters (not metacharacters) can just be written as-is like
a
means the letter "a". - Some control characters have special shorthand like
\n
for newline. - Characters can be denoted by escaped octal numbers like
\012
for newline. - Characters can be denoted by escaped and lettered hexadecimal numbers like
\x0D
for a two-digit and\uFFFF
for a four-digit
Character Classes¶
Classes are specific subsets of characters and a regex engine will try to match a single character of the input string to a character from the set. There is a variety of syntax for specifying different sets, and inside a class you can use the dash -
to mean a slice.
[ ]
matches any of the included characters like[a-z]
means match any character in the set a through z (all the lowercase letters).[^ ]
matches the complement of the specified characters..
matches every character except the newline\d
,\w
, and\s
match all digits, word characters (alphanumerics plus underscore) and space characters, respectively. Using the uppercase like\D
means their complement.
Inside the [ ]
subset definition all the metacharacters are stripped of their meaning and revert being regular characters, except that special classes will still be recognized. Thus [\d$]
means the set including all digits and the dollars sign character, and [.]
means match the actual period symbol (which is otherwise a metacharacter).
Anchors¶
Anchor are metacharacters and metasequences that match a specific position in the input string rather than matching characters. They are also called "zero-width assertions" because they don't actually consume a character of the input string when they match.
^
matches the start of the input string$
matches the end of the input string\b
matches a "word boundary" which is a place where a word character (alphanumeric) is next to a non-word character (like punctuation). Using the uppercase\B
matches any place that's not a word boundary.
Lookarounds are a different kind of zero-width assertion. They match positions where a specified sub-pattern would have matched, but they don't actually consume those characters that match the sub-pattern.
- Lookahead matches locations where a subpattern is or is not matched in the subsequent text. Like
foo(?=bar)
matches all "foo"s followed by "bar"s, whilefoo(?!bar)
matches all "foo"s not followed by "bar"s. - Lookbehind matches locations where a subpattern is or is not matched in the preceeding text. LIke
(?<=foo)bar
matches all "bar"s preceeded by "foo"s, while(?<!foo)
bar matches all "bar"s not preeeded by "foo"s.
Control Statements¶
These are miscellaneous functionalities that are often handy.
...|...
tries the two specfied subpatterns in alteration likea|b
will first try to match "a" then if that fails try to match "b"( )
groups a subpattern so that the entire subpattern can be referrred to by a quantifier or alternator like(a/d)|(b/d)
will first try to match "a" followed by digit and if that fails try "b" followed by a digit.
Note that if you don't use grouping by ()
wih the alernator |
then it tries to match everything to the left of the pipe and then everything to the right of the pipe.
Quantifiers¶
Quantifiers control how many times the parser tries to match an element, and can be placed after single characters, character classes in brackets and subpatterns in parentheses. Greedy quantifiers will match as many times as allowed while lazy quantifiers will match as few times as allowed.
*
,+
,?
, means greedily match at least 0 times, at least 1 time, and 0 or 1 times, respectively{x,y}
means greedily match at least x times but no more than y times likea{3, 5}
means match the character "a" as many times as possible subject to the constraint of at least 3 but not more than 5 times.{n}
means match exactly n times
Each greedy quantifier has a corresponding lazy quantifier whose syntax is identical just with an additional appended ?
, so for example a+?
means match the character "a" as few times as possible subject to the constraint of at least one time.
re
Module for Python Regex¶
The re
module in the python standard library is where all of Python's regex functionality lives, the official docs and this simpler HOW TO are both great resources.
In the re
module you create a regex object from a pattern string by using the compile()
function. The resulting object then has helper methods to do things like searching for matches or performing substitutions based on the pattern.
It's important to write your regex pattern strings as python raw strings (r""
rather than ""
) to avoid some craziness when it comes to backslash escaping. I'll just refer to this excellent SO answer.
import re
pattern_str = r"[bcf]at+y?" # see below for why the r-prefix
regx= re.compile(pattern_str)
How will this regex match on "ba", "bat", "catty", "fattyy" or "faaty"? Let's break it down going from left to right through the regex pattern:
- insist on starting with a single "b", "c" or "f"
- insist on a single "a"
- insist on at least one "t", but as many as possible
- insist on 0 or 1 "y"s, but as few as possible
So it won't match "ba" (no "t") or "faaty" (more than one "a"). From "bat" it returns "bat", from "catty" it returns "catt", and from "fattyy" it returns "fatty".
Look for a match with search
¶
This is how we actually use the regex pattern we compiled to check for matches against input strings. Remember, the regex will give us back the leftmost match and then stop.
# Search input string "ba" for a match to the regex - FAIL
result = regx.search("ba")
type(result)
# Search input string "bat" for a match to the regex - SUCCESS!
result = regx.search("fattyy")
type(result)
Inspect matches as match
objects¶
# Look at what text the resulting "match" contains
result.group()
# The start and stop positions of the match within the original input string
result.span() # "fattyy[0:5] = "fatty"
Get text from subpattern groups¶
You may think that group
is a weird name for a method that shows you the text that matched the pattern. Actually in regexes each subpattern group in parentheses, ( )
, has it's match text internally captured and saved. If our regex has several subpattern groups defined in it then we can get all the resulting group strings with groups
, while the whole matched string we get from group
.
# Compile and search a regex that has subpattern groups defined
regx_grp = re.compile(r"(\w*)@(\w*)\.[\w]*") # has two defined groups
result = regx_grp.search("coolemail@hotdomain.net")
# Get a tuple of the text for matched groups
result.groups()
# Get the full matched text
result.group()
Get ALL the matches!¶
The finditer()
method will return a generator that pops out a match object for each substring in the input string which matches the regex, while findall()
just returns a list of the string matches!
result = regx.finditer("my fat cat was behaving in a very batty fashion.")
for mtch in result:
print(mtch.group()) # print the matched substring for each match object
regx.findall("my fat cat was behaving in a very batty fashion.")
Replace substrings with other text¶
You can replace occurrences of the pattern in the input string with new text of your choosing, and you can specify how many replacements will occur with kwarg count
(they will always happen leftmost-first).
inp_str = "my fat cat was behaving in a very batty fashion."
regx.sub(repl="**CENSORED**", string=inp_str, count=2) # Only do the first two replacements
You can even pass it a function, to be run on each found match object, which will dictate the text that match is replaced by.
# Define function which will uppercase the text of the match objects
replace_with = lambda match: match.group().upper()
regx.sub(repl=replace_with, string=inp_str)
Split strings based on pattern matches¶
At every occurrence of a match the matched text is removed from the input string and it is cleaved at that spot.
regx.split(string=inp_str)
Compile readable regexes in Verbose mode¶
Regexes suck to try to read, the syntax is just too compact and confusing. If you want to write a regex that allows you to use liberal whitespace and even include comments then you can send the re.Verbose
mode as an input to the compile function and using triple quotes to enclose your multi-line regex string.
regx = re.compile(r"""
[bcf] # insist on starting with a single "b", "c" or "f"
a # insist on a single "a"
t+ # insist on at least one "t", but as many as possible
y? # insist on 0 or 1 "y"s, but as many as possible
""", re.VERBOSE)
inp_str = "From hurst@missouri.co.jp Fri Aug 23 11:03:04 2002"
regx = re.compile(r"^From\s+(\S*)\s+\w+\s+\w+\s+\d+\s+([\d:]*)\s+\d{4}")
regx.search(inp_str).groups()
Replace any sequence of whitespaces of any length with a single space.¶
inp_str = "This is a crazily spaced sentence."
regx = re.compile(r"\s+")
regx.sub(repl=" ", string=inp_str)
[From diveintopython] You're working with addresses and you need to replace the word "road" (could be any case) with "Rd." but only when it's at the end of the string (address). Hint: you can compile a case-insensitive regex.¶
inp_str1 = "540 Hard Road Road"
inp_str2 = "78 RIVER ROAD"
regx = re.compile(r"road$", re.IGNORECASE)
regx.sub(repl="Rd.",string=inp_str1)
regx.sub(repl="Rd.",string=inp_str2)
[From diveintopython] You need to parse phone numbers to get the area code, local code, and last four digits. The phone numbers could come in any of the following forms:¶
- 800-555-1212
- 800 555 1212
- 800.555.1212
- (800) 555-1212
- 1-800-555-1212
inp_str1 = "800-555-1212"
inp_str2 = "800.555.1212"
inp_str3 = "(800) 555-1212"
inp_str4 = "1-800-555-1212"
regx = re.compile(r".*(\d{3})[^\w]{1,2}(\d{3})[^\w]{1,2}(\d{4})")
regx.search(inp_str3).groups()
Censor every occurrence of a number in a medical statement like "The patient, aged 62, has BP measuring 130.6 over 25." Use your favorite censorship stand-in. Careful of floats with decimal points!¶
inp_str = "The patient, aged 62, has BP measuring 130.6 over 25."
regx = re.compile(r"\b\d+[.]*\d*\b")
regx.sub(repl="**CENSORED**", string=inp_str)
Now instead replace every occurrence of an integer with its binary equivalent - your regex should not match floats! Hint: bin()
gives a text representation of the binary equivalent of a base-10 integer.¶
inp_str = "The patient, aged 62, presented with blood pressure measuring 130.6 over 25."
regx = re.compile(r"\b(?<!\d[.])\d+(?![.]\d)\b")
def replace_with(match):
txt = match.group()
return bin(int(txt))
regx.sub(repl=replace_with, string=inp_str)
[From the official docs]. Find and print all adverbs in a sentence - you can assume they will all end with "ly". Hint: make sure your code can handle adverbs immediately followed by punctuation!¶
inp_str = "He was carefully disguised but captured quickly by police."
regx = re.compile(r"\b\w+ly\b")
regx.findall(inp_str)
[From the HOWTO]. Match all filenames whose extension is not ".bat" and capture the filename and extension as two subpattern groups.¶
inp_str1 = "myfile.bmp"
inp_str2 = "myfile2.bat"
regx = re.compile(r"(.+)[.](?!bat)(.+)") # Use a negative look ahead
regx.search(inp_str1).groups()
type(regx.search(inp_str2))
Split up a sentence by every occurrence of punctuation in it. Hint: punctuation is not a letter, digit, or whitespace¶
inp_str = "They say 'stop, in the name of love'. I quite agree!"
regx = re.compile(r"[^\w\d\s]+")
regx.split(string=inp_str)
Match the interior content of all the html tags in the string.¶
inp_str = "< img src=test.png>this is a pic</img> < div>this is a div</div>"
regx = re.compile(r"(?<=>)([^<>]*)(?=</)") # yeah this breaks for content with greater/less than
regx.findall(inp_str)
More Resources¶
- A very thorough guide to regex constructs and syntax
- An intro with lots of great examples