In which the contents of numerous spam folders gradually erodes my faith in humanity.¶
Week 7 of Andrew Ng's ML course on Coursera introduces the Support Vector Machine algorithm and challenges us to use it for classifying email as spam or ham. Here I use the SpamAssassin public corpus to build an SVM spam email classifier in order to learn about the relevant python tools. Part I focuses on the preprocessing of individual emails while Part II focuses on the actual classifier.
Tools Covered:¶
re
for regular expressions to do Natural Language Processing (NLP)stopwords
text corpus for removing information-poor words in NLPSnowballStemmer
for stemming text in NLPBeautifulSoup
for HTML parsing
# Set up environment
import scipy.io
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import numpy as np
import pickle
import os
import re
from nltk.stem.snowball import SnowballStemmer
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
import snips as snp # my snippets
snp.prettyplot(matplotlib) # my aesthetic preferences for plotting
%matplotlib inline
cd hw-wk7-spam-preprocessing
Quick Look at the Data¶
I'm going to pull a set of spam and "ham" (non-spam) emails from the SpamAssassin public corpus data sets. Each email is stored a a plain text file with the email header information and the email body including HTML markup if applicable.
# Setup for accessing all the spam and ham text files
from os import listdir
from os.path import isfile, join
spampath = join(os.getcwd(), "spam")
spamfiles = [join(spampath, fname) for fname in listdir(spampath)]
hampath = join(os.getcwd(), "easy_ham")
hamfiles = [join(hampath, fname) for fname in listdir(hampath)]
Example Formatted File¶
Here is what an email would look like if viewed with proper formatting, like in your browser.
with open(hamfiles[3]) as myfile:
for line in myfile.readlines():
print(line)
Example Raw File¶
Now we want to see what the actual strings look like that we will do all our processing on.
with open(hamfiles[3], "r") as myfile:
lines = myfile.readlines()
print(lines)
Some preliminary thoughts: The first line of every file is the most basic header info about the originating address and time the email was sent. There follows a section of keyword-value pairs in the form keyword: value\n
. Finally, the body of each email is separated from the meta info by two newline characters \n\n
. Note that some of the email bodies contain HTML.
Easy Mode: Email Body as a Bag of Words¶
The first thing I'll try is just doing some NLP on only the email bodies, ignoring all the header info. Ultimately we need to represent each email in some numeric feature space in order to feed it into a classifier algorithm: this is where the Bag of Words model comes in. Each email is represented by a vector which quantifies the presence of specific vocab words in that email... that means we need to do some preprocessing.
First write a function that grabs only the body lines of a single email:
def get_body(fpath):
'''Get email body lines from fpath using first occurence of empty line.'''
with open(fpath, "r") as myfile:
try:
lines = myfile.readlines()
idx = lines.index("\n") # only grabs first instance
return "".join(lines[idx:])
except:
print("Couldn't decode file %s" %(fpath,))
# Test it out
body= get_body(hamfiles[3])
body # This is the actual string we are going to be processing
print(body) # This is what it would look like properly displayed
Preprocessing Plan of Attack (order matters)¶
The order of steps in text processing matters a lot if you are trying to extract other features alongside a simple "Bag of Words" or "Word Salad" model. For instance, if you want to count the number of question marks in the email text then you should probably do it before removing all punctuation, but after replacing all http addresses (which sometimes contain special characters).
Here is a rough outline of all the steps we'll take to get from a messy, marked-up raw text to a delicious word salad:
- Strip any HTML tags and leave only text content (also count HTML tags)
- Lowercase everything
- Strip all email and web addresses (also count them)
- Strip all dollar signs and numbers (also count them)
- Strip away all other punctuation (also count exclamation and question marks)
- Standardize all white space to single space (also count newlines and blank lines)
- Count the total number of words in our word salad
- Strip away all useless "Stopwords" (like "a", "the", "at")
- Stem all the words down to their root to simplify
Now, when I say count what I really mean is substitute each occurrence with some fixed string: like every web address gets replaced with "httpaddr"
. That way when we ultimately convert each email to a vector of word counts, we'll get a feature that reflects the occurrence of the word "httpaddr".
Parsing HTML¶
Some of the email bodies contain HTML formatting - the amount of such formatting might be a helpful feature, but the tags themselves we want to strip away, along with some other HTML shorthand. Let's no reinvent the wheel though: a package called beatiful soup
implements a great HTML parser. The parsed object can be interrogated in various ways as it contains all the information about the HTML structure of the original document. You should check out the official soup docs.
# Parse the email body into HTML elements
from bs4 import BeautifulSoup
soup = BeautifulSoup(body, 'html.parser')
# Pull out only the non-markup tex
body = soup.get_text()
# Count the number of HTML elements and specific link elements
nhtml = len(soup.find_all())
nlinks = len(soup.find_all("a"))
# Sub in special strings for "counting"
body = body + nhtml*" htmltag " + nlinks*" linktag "
Lowercasing¶
We don't expect whether a word is capitalized or not to reflect some deep difference in tone or meaning, so we'll lowercase everything.
# Lowercase everything
body = body.lower()
Finding email and web addresses¶
We'll find and count the appearances of email and web addresses, and then replace each one with blank space. A very useful tool for all language processing is the regular expression, which is housed in the re
module of the python standard lib. For more info you can refer to my brief but hopefully edifying overview of regexes in python.
# Replace all URLs with special strings
regx = re.compile(r"(http|https)://[^\s]*")
body, nhttps = regx.subn(repl=" httpaddr ", string=body)
# Replace all email addresses with special strings
regx = re.compile(r"\b[^\s]+@[^\s]+[.][^\s]+\b")
body, nemails = regx.subn(repl=" emailaddr ", string=body)
body
Finding numbers, dollar signs, and punctuation¶
We'd like to know the frequency of numbers and any punctuation which carries a tones, like exclamation marks, question marks, and dollar signs. After replacing the things we care about, we'll remove all other punctuation to get us closer to a pure bag of words.
# Replace all numbers with special strings
regx = re.compile(r"\b[\d.]+\b")
body = regx.sub(repl=" number ", string=body)
# Replace all $, ! and ? with special strings
regx = re.compile(r"[$]")
body = regx.sub(repl=" dollar ", string=body)
regx = re.compile(r"[!]")
body = regx.sub(repl=" exclammark ", string=body)
regx = re.compile(r"[?]")
body = regx.sub(repl=" questmark ", string=body)
# Remove all other punctuation (replace with white space)
regx = re.compile(r"([^\w\s]+)|([_-]+)")
body = regx.sub(repl=" ", string=body)
body
Standardizing White Space and Total Word Count¶
Standardizing white space is an important step, as it makes tokenizing the email into words straightforward. I do this as a last step since some of my preprocessing creates extra whitespace. The number of carriage returns (\n
) and the number of blank lines (\n\n
) might be predictive so we'll replace those with special strings.
# Replace all newlines and blanklines with special strings
regx = re.compile(r"\n")
body = regx.sub(repl=" newline ", string=body)
regx = re.compile(r"\n\n")
body = regx.sub(repl=" blankline ", string=body)
# Make all white space a single space
regx = re.compile(r"\s+")
body = regx.sub(repl=" ", string=body)
# Remove any trailing or leading white space
body = body.strip(" ")
body
This is a true bag of words, so now we can get our total word count to use in normalizing things like number of exclamation marks:
nwords = len(body.split(" "))
nwords
Remove Stop Words with nltk
¶
Each email is going to have lots of common words which are the "glue" of the english language but don't carry much information. These are called Stop Words and we will go ahead and strip them out from the start.
The Natural Language Tool Kit module (ntlk
) defines a ton of functionality for processing text, including a corpus of these so-called stop words.
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
len(stopwords.words("english"))
stopwords.words("english")[0:10]
# Remove all useless stopwords
bodywords = body.split(" ")
keepwords = [word for word in bodywords if word not in stopwords.words('english')]
body = " ".join(keepwords)
body
Stemming with nltk
¶
This classifier is trying to determine the intent or tone of an email (spam vs. ham) by virtue of the specific words in that email. But we don't expect that a variation on the same root word, like "battery" versus "batteries", carries much difference in intent or tone. In "stemming" we replace all the variants of each root with the root itself: this reduces the complexity of the email representation without really reducing the information. The nltk
module has several options for out-of-the-box stemmers.
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("generously")
# Stem all words
words = body.split(" ")
stemwords = [stemmer.stem(wd) for wd in words]
body = " ".join(stemwords)
body
Encapsulate Preprocessing in a Function¶
All of the above steps can into a function that spits out the final processed word salad.
def word_salad(body):
'''Produce a word salad from email body.'''
# Parse HTML extract content only (but count tags)
soup = BeautifulSoup(body, 'html.parser')
body = soup.get_text()
# Pull out only the non-markup tex
body = soup.get_text()
# Count the number of HTML elements and specific link elements
nhtml = len(soup.find_all())
nlinks = len(soup.find_all("a"))
# Sub in special strings for "counting"
body = body + nhtml*" htmltag " + nlinks*" linktag "
# lowercase everything
body = body.lower()
# Replace all URLs with special strings
regx = re.compile(r"(http|https)://[^\s]*")
body, nhttps = regx.subn(repl=" httpaddr ", string=body)
# Replace all email addresses with special strings
regx = re.compile(r"\b[^\s]+@[^\s]+[.][^\s]+\b")
body, nemails = regx.subn(repl=" emailaddr ", string=body)
# Replace all numbers with special strings
regx = re.compile(r"\b[\d.]+\b")
body = regx.sub(repl=" number ", string=body)
# Replace all $, ! and ? with special strings
regx = re.compile(r"[$]")
body = regx.sub(repl=" dollar ", string=body)
regx = re.compile(r"[!]")
body = regx.sub(repl=" exclammark ", string=body)
regx = re.compile(r"[?]")
body = regx.sub(repl=" questmark ", string=body)
# Remove all other punctuation (replace with white space)
regx = re.compile(r"([^\w\s]+)|([_-]+)")
body = regx.sub(repl=" ", string=body)
# Replace all newlines and blanklines with special strings
regx = re.compile(r"\n")
body = regx.sub(repl=" newline ", string=body)
regx = re.compile(r"\n\n")
body = regx.sub(repl=" blankline ", string=body)
# Make all white space a single space
regx = re.compile(r"\s+")
body = regx.sub(repl=" ", string=body)
# Remove any trailing or leading white space
body = body.strip(" ")
# Remove all useless stopwords
bodywords = body.split(" ")
keepwords = [word for word in bodywords if word not in stopwords.words('english')]
# Stem all words
stemmer = SnowballStemmer("english")
stemwords = [stemmer.stem(wd) for wd in keepwords]
body = " ".join(stemwords)
return body
# Try out our function
body = get_body(spamfiles[179])
processed = word_salad(body)
processed
Building Our Corpus of Emails¶
Whatever algorithm we ultimately use for classification will require numeric feature vectors, so mapping each word salad to such a vector is the next main task. We'll start by building a corpus of the raw email bodies, that is just a list of email body strings, and we'll build alongside it a list of the processed email body strings for our own inspection. These lists can later be fed into algorithms for vectorization.
emails_raw = ["email"]*len(hamfiles + spamfiles) # Reserve in memory, faster than append
emails_processed = ["email"]*len(hamfiles + spamfiles) # Reserve in memory, faster than append
y = [0]*len(hamfiles) + [1]*len(spamfiles) # Ground truth vector
for idx, fpath in enumerate(hamfiles + spamfiles):
body = get_body(fpath) # Extract only the email body text
emails_raw[idx] = body
processed = word_salad(body) # All preprocessing
emails_processed[idx] = processed
# Pickle these objects for easier access later
with open("easyham_and_spam_corpus_raw_and_processed_and_y.pickle", "wb") as myfile:
pickle.dump([emails_raw, emails_processed, y], myfile)
We're now in position to start mapping emails into a numeric vector space. It turns out there are a lot of ways in which to do this and the proper ML approach would be to search over this space using cross-validation to identify the best approach. This is the subject of Spam Part II. We'll explore different vectorization schemes and feed these vectors into a Support Vector Machine to classify each email.