Spam Part I: NLP with re, nltk and BeautifulSoup

Posted by Sonya Sawtelle on Sun 04 December 2016

In which the contents of numerous spam folders gradually erodes my faith in humanity.

Week 7 of Andrew Ng's ML course on Coursera introduces the Support Vector Machine algorithm and challenges us to use it for classifying email as spam or ham. Here I use the SpamAssassin public corpus to build an SVM spam email classifier in order to learn about the relevant python tools. Part I focuses on the preprocessing of individual emails while Part II focuses on the actual classifier.

Tools Covered:

  • re for regular expressions to do Natural Language Processing (NLP)
  • stopwords text corpus for removing information-poor words in NLP
  • SnowballStemmer for stemming text in NLP
  • BeautifulSoup for HTML parsing
In [1]:
# Set up environment
import scipy.io
import matplotlib.pyplot as plt
import matplotlib 
import pandas as pd
import numpy as np
import pickle
import os
import re

from nltk.stem.snowball import SnowballStemmer
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")

import snips as snp  # my snippets
snp.prettyplot(matplotlib)  # my aesthetic preferences for plotting
%matplotlib inline
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sonya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [2]:
cd hw-wk7-spam-preprocessing
C:\Users\Sonya\Box Sync\Projects\course-machine-learning\hw-wk7-spam-preprocessing

Quick Look at the Data

I'm going to pull a set of spam and "ham" (non-spam) emails from the SpamAssassin public corpus data sets. Each email is stored a a plain text file with the email header information and the email body including HTML markup if applicable.

In [3]:
# Setup for accessing all the spam and ham text files
from os import listdir
from os.path import isfile, join

spampath = join(os.getcwd(), "spam")
spamfiles = [join(spampath, fname) for fname in listdir(spampath)]

hampath = join(os.getcwd(), "easy_ham")
hamfiles = [join(hampath, fname) for fname in listdir(hampath)]

Example Formatted File

Here is what an email would look like if viewed with proper formatting, like in your browser.

In [4]:
with open(hamfiles[3]) as myfile:
    for line in myfile.readlines():
        print(line)
From irregulars-admin@tb.tf  Thu Aug 22 14:23:39 2002

Return-Path: <irregulars-admin@tb.tf>

Delivered-To: zzzz@localhost.netnoteinc.com

Received: from localhost (localhost [127.0.0.1])

	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9DAE147C66

	for <zzzz@localhost>; Thu, 22 Aug 2002 09:23:38 -0400 (EDT)

Received: from phobos [127.0.0.1]

	by localhost with IMAP (fetchmail-5.9.0)

	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 14:23:38 +0100 (IST)

Received: from web.tb.tf (route-64-131-126-36.telocity.com

    [64.131.126.36]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id

    g7MDGOZ07922 for <zzzz-irr@example.com>; Thu, 22 Aug 2002 14:16:24 +0100

Received: from web.tb.tf (localhost.localdomain [127.0.0.1]) by web.tb.tf

    (8.11.6/8.11.6) with ESMTP id g7MDP9I16418; Thu, 22 Aug 2002 09:25:09

    -0400

Received: from red.harvee.home (red [192.168.25.1] (may be forged)) by

    web.tb.tf (8.11.6/8.11.6) with ESMTP id g7MDO4I16408 for

    <irregulars@tb.tf>; Thu, 22 Aug 2002 09:24:04 -0400

Received: from prserv.net (out4.prserv.net [32.97.166.34]) by

    red.harvee.home (8.11.6/8.11.6) with ESMTP id g7MDFBD29237 for

    <irregulars@tb.tf>; Thu, 22 Aug 2002 09:15:12 -0400

Received: from [209.202.248.109]

    (slip-32-103-249-10.ma.us.prserv.net[32.103.249.10]) by prserv.net (out4)

    with ESMTP id <2002082213150220405qu8jce>; Thu, 22 Aug 2002 13:15:07 +0000

MIME-Version: 1.0

X-Sender: @ (Unverified)

Message-Id: <p04330137b98a941c58a8@[209.202.248.109]>

To: undisclosed-recipient: ;

From: Monty Solomon <monty@roscom.com>

Content-Type: text/plain; charset="us-ascii"

Subject: [IRR] Klez: The Virus That  Won't Die

Sender: irregulars-admin@tb.tf

Errors-To: irregulars-admin@tb.tf

X-Beenthere: irregulars@tb.tf

X-Mailman-Version: 2.0.6

Precedence: bulk

List-Help: <mailto:irregulars-request@tb.tf?subject=help>

List-Post: <mailto:irregulars@tb.tf>

List-Subscribe: <http://tb.tf/mailman/listinfo/irregulars>,

    <mailto:irregulars-request@tb.tf?subject=subscribe>

List-Id: New home of the TBTF Irregulars mailing list <irregulars.tb.tf>

List-Unsubscribe: <http://tb.tf/mailman/listinfo/irregulars>,

    <mailto:irregulars-request@tb.tf?subject=unsubscribe>

List-Archive: <http://tb.tf/mailman/private/irregulars/>

Date: Thu, 22 Aug 2002 09:15:25 -0400



Klez: The Virus That Won't Die

 

Already the most prolific virus ever, Klez continues to wreak havoc.



Andrew Brandt

>>From the September 2002 issue of PC World magazine

Posted Thursday, August 01, 2002





The Klez worm is approaching its seventh month of wriggling across 

the Web, making it one of the most persistent viruses ever. And 

experts warn that it may be a harbinger of new viruses that use a 

combination of pernicious approaches to go from PC to PC.



Antivirus software makers Symantec and McAfee both report more than 

2000 new infections daily, with no sign of letup at press time. The 

British security firm MessageLabs estimates that 1 in every 300 

e-mail messages holds a variation of the Klez virus, and says that 

Klez has already surpassed last summer's SirCam as the most prolific 

virus ever.



And some newer Klez variants aren't merely nuisances--they can carry 

other viruses in them that corrupt your data.



...



http://www.pcworld.com/news/article/0,aid,103259,00.asp

_______________________________________________

Irregulars mailing list

Irregulars@tb.tf

http://tb.tf/mailman/listinfo/irregulars



Example Raw File

Now we want to see what the actual strings look like that we will do all our processing on.

In [5]:
with open(hamfiles[3], "r") as myfile:
    lines = myfile.readlines()
print(lines)
['From irregulars-admin@tb.tf  Thu Aug 22 14:23:39 2002\n', 'Return-Path: <irregulars-admin@tb.tf>\n', 'Delivered-To: zzzz@localhost.netnoteinc.com\n', 'Received: from localhost (localhost [127.0.0.1])\n', '\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9DAE147C66\n', '\tfor <zzzz@localhost>; Thu, 22 Aug 2002 09:23:38 -0400 (EDT)\n', 'Received: from phobos [127.0.0.1]\n', '\tby localhost with IMAP (fetchmail-5.9.0)\n', '\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 14:23:38 +0100 (IST)\n', 'Received: from web.tb.tf (route-64-131-126-36.telocity.com\n', '    [64.131.126.36]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id\n', '    g7MDGOZ07922 for <zzzz-irr@example.com>; Thu, 22 Aug 2002 14:16:24 +0100\n', 'Received: from web.tb.tf (localhost.localdomain [127.0.0.1]) by web.tb.tf\n', '    (8.11.6/8.11.6) with ESMTP id g7MDP9I16418; Thu, 22 Aug 2002 09:25:09\n', '    -0400\n', 'Received: from red.harvee.home (red [192.168.25.1] (may be forged)) by\n', '    web.tb.tf (8.11.6/8.11.6) with ESMTP id g7MDO4I16408 for\n', '    <irregulars@tb.tf>; Thu, 22 Aug 2002 09:24:04 -0400\n', 'Received: from prserv.net (out4.prserv.net [32.97.166.34]) by\n', '    red.harvee.home (8.11.6/8.11.6) with ESMTP id g7MDFBD29237 for\n', '    <irregulars@tb.tf>; Thu, 22 Aug 2002 09:15:12 -0400\n', 'Received: from [209.202.248.109]\n', '    (slip-32-103-249-10.ma.us.prserv.net[32.103.249.10]) by prserv.net (out4)\n', '    with ESMTP id <2002082213150220405qu8jce>; Thu, 22 Aug 2002 13:15:07 +0000\n', 'MIME-Version: 1.0\n', 'X-Sender: @ (Unverified)\n', 'Message-Id: <p04330137b98a941c58a8@[209.202.248.109]>\n', 'To: undisclosed-recipient: ;\n', 'From: Monty Solomon <monty@roscom.com>\n', 'Content-Type: text/plain; charset="us-ascii"\n', "Subject: [IRR] Klez: The Virus That  Won't Die\n", 'Sender: irregulars-admin@tb.tf\n', 'Errors-To: irregulars-admin@tb.tf\n', 'X-Beenthere: irregulars@tb.tf\n', 'X-Mailman-Version: 2.0.6\n', 'Precedence: bulk\n', 'List-Help: <mailto:irregulars-request@tb.tf?subject=help>\n', 'List-Post: <mailto:irregulars@tb.tf>\n', 'List-Subscribe: <http://tb.tf/mailman/listinfo/irregulars>,\n', '    <mailto:irregulars-request@tb.tf?subject=subscribe>\n', 'List-Id: New home of the TBTF Irregulars mailing list <irregulars.tb.tf>\n', 'List-Unsubscribe: <http://tb.tf/mailman/listinfo/irregulars>,\n', '    <mailto:irregulars-request@tb.tf?subject=unsubscribe>\n', 'List-Archive: <http://tb.tf/mailman/private/irregulars/>\n', 'Date: Thu, 22 Aug 2002 09:15:25 -0400\n', '\n', "Klez: The Virus That Won't Die\n", ' \n', 'Already the most prolific virus ever, Klez continues to wreak havoc.\n', '\n', 'Andrew Brandt\n', '>>From the September 2002 issue of PC World magazine\n', 'Posted Thursday, August 01, 2002\n', '\n', '\n', 'The Klez worm is approaching its seventh month of wriggling across \n', 'the Web, making it one of the most persistent viruses ever. And \n', 'experts warn that it may be a harbinger of new viruses that use a \n', 'combination of pernicious approaches to go from PC to PC.\n', '\n', 'Antivirus software makers Symantec and McAfee both report more than \n', '2000 new infections daily, with no sign of letup at press time. The \n', 'British security firm MessageLabs estimates that 1 in every 300 \n', 'e-mail messages holds a variation of the Klez virus, and says that \n', "Klez has already surpassed last summer's SirCam as the most prolific \n", 'virus ever.\n', '\n', "And some newer Klez variants aren't merely nuisances--they can carry \n", 'other viruses in them that corrupt your data.\n', '\n', '...\n', '\n', 'http://www.pcworld.com/news/article/0,aid,103259,00.asp\n', '_______________________________________________\n', 'Irregulars mailing list\n', 'Irregulars@tb.tf\n', 'http://tb.tf/mailman/listinfo/irregulars\n', '\n']

Some preliminary thoughts: The first line of every file is the most basic header info about the originating address and time the email was sent. There follows a section of keyword-value pairs in the form keyword: value\n. Finally, the body of each email is separated from the meta info by two newline characters \n\n. Note that some of the email bodies contain HTML.

Easy Mode: Email Body as a Bag of Words

The first thing I'll try is just doing some NLP on only the email bodies, ignoring all the header info. Ultimately we need to represent each email in some numeric feature space in order to feed it into a classifier algorithm: this is where the Bag of Words model comes in. Each email is represented by a vector which quantifies the presence of specific vocab words in that email... that means we need to do some preprocessing.

First write a function that grabs only the body lines of a single email:

In [4]:
def get_body(fpath):
    '''Get email body lines from fpath using first occurence of empty line.'''
    with open(fpath, "r") as myfile:
        try: 
            lines = myfile.readlines()
            idx = lines.index("\n") # only grabs first instance
            return "".join(lines[idx:])
        except: 
            print("Couldn't decode file %s" %(fpath,))
In [5]:
# Test it out 
body= get_body(hamfiles[3])
In [6]:
body  # This is the actual string we are going to be processing
Out[6]:
"\nKlez: The Virus That Won't Die\n \nAlready the most prolific virus ever, Klez continues to wreak havoc.\n\nAndrew Brandt\n>>From the September 2002 issue of PC World magazine\nPosted Thursday, August 01, 2002\n\n\nThe Klez worm is approaching its seventh month of wriggling across \nthe Web, making it one of the most persistent viruses ever. And \nexperts warn that it may be a harbinger of new viruses that use a \ncombination of pernicious approaches to go from PC to PC.\n\nAntivirus software makers Symantec and McAfee both report more than \n2000 new infections daily, with no sign of letup at press time. The \nBritish security firm MessageLabs estimates that 1 in every 300 \ne-mail messages holds a variation of the Klez virus, and says that \nKlez has already surpassed last summer's SirCam as the most prolific \nvirus ever.\n\nAnd some newer Klez variants aren't merely nuisances--they can carry \nother viruses in them that corrupt your data.\n\n...\n\nhttp://www.pcworld.com/news/article/0,aid,103259,00.asp\n_______________________________________________\nIrregulars mailing list\nIrregulars@tb.tf\nhttp://tb.tf/mailman/listinfo/irregulars\n\n"
In [28]:
print(body)  # This is what it would look like properly displayed
Klez: The Virus That Won't Die
 
Already the most prolific virus ever, Klez continues to wreak havoc.

Andrew Brandt
>>From the September 2002 issue of PC World magazine
Posted Thursday, August 01, 2002


The Klez worm is approaching its seventh month of wriggling across 
the Web, making it one of the most persistent viruses ever. And 
experts warn that it may be a harbinger of new viruses that use a 
combination of pernicious approaches to go from PC to PC.

Antivirus software makers Symantec and McAfee both report more than 
2000 new infections daily, with no sign of letup at press time. The 
British security firm MessageLabs estimates that 1 in every 300 
e-mail messages holds a variation of the Klez virus, and says that 
Klez has already surpassed last summer's SirCam as the most prolific 
virus ever.

And some newer Klez variants aren't merely nuisances--they can carry 
other viruses in them that corrupt your data.

...

http://www.pcworld.com/news/article/0,aid,103259,00.asp
_______________________________________________
Irregulars mailing list
Irregulars@tb.tf
http://tb.tf/mailman/listinfo/irregulars


Preprocessing Plan of Attack (order matters)

The order of steps in text processing matters a lot if you are trying to extract other features alongside a simple "Bag of Words" or "Word Salad" model. For instance, if you want to count the number of question marks in the email text then you should probably do it before removing all punctuation, but after replacing all http addresses (which sometimes contain special characters).

Here is a rough outline of all the steps we'll take to get from a messy, marked-up raw text to a delicious word salad:

  • Strip any HTML tags and leave only text content (also count HTML tags)
  • Lowercase everything
  • Strip all email and web addresses (also count them)
  • Strip all dollar signs and numbers (also count them)
  • Strip away all other punctuation (also count exclamation and question marks)
  • Standardize all white space to single space (also count newlines and blank lines)
  • Count the total number of words in our word salad
  • Strip away all useless "Stopwords" (like "a", "the", "at")
  • Stem all the words down to their root to simplify

Now, when I say count what I really mean is substitute each occurrence with some fixed string: like every web address gets replaced with "httpaddr". That way when we ultimately convert each email to a vector of word counts, we'll get a feature that reflects the occurrence of the word "httpaddr".

Parsing HTML

Some of the email bodies contain HTML formatting - the amount of such formatting might be a helpful feature, but the tags themselves we want to strip away, along with some other HTML shorthand. Let's no reinvent the wheel though: a package called beatiful soup implements a great HTML parser. The parsed object can be interrogated in various ways as it contains all the information about the HTML structure of the original document. You should check out the official soup docs.

In [23]:
# Parse the email body into HTML elements
from bs4 import BeautifulSoup
soup = BeautifulSoup(body, 'html.parser')
In [24]:
# Pull out only the non-markup tex
body = soup.get_text()

# Count the number of HTML elements and specific link elements
nhtml = len(soup.find_all())
nlinks = len(soup.find_all("a"))
# Sub in special strings for "counting"
body = body + nhtml*" htmltag " + nlinks*" linktag "

Lowercasing

We don't expect whether a word is capitalized or not to reflect some deep difference in tone or meaning, so we'll lowercase everything.

In [25]:
# Lowercase everything
body = body.lower()

Finding email and web addresses

We'll find and count the appearances of email and web addresses, and then replace each one with blank space. A very useful tool for all language processing is the regular expression, which is housed in the re module of the python standard lib. For more info you can refer to my brief but hopefully edifying overview of regexes in python.

In [26]:
# Replace all URLs with special strings
regx = re.compile(r"(http|https)://[^\s]*")
body, nhttps = regx.subn(repl=" httpaddr ", string=body)

# Replace all email addresses with special strings
regx = re.compile(r"\b[^\s]+@[^\s]+[.][^\s]+\b")
body, nemails = regx.subn(repl=" emailaddr ", string=body)
In [27]:
body
Out[27]:
'\nklez  the virus that won t die\n \nalready the most prolific virus ever  klez continues to wreak havoc \n\nandrew brandt\n from the september  number  issue of pc world magazine\nposted thursday  august  number    number \n\n\nthe klez worm is approaching its seventh month of wriggling across \nthe web  making it one of the most persistent viruses ever  and \nexperts warn that it may be a harbinger of new viruses that use a \ncombination of pernicious approaches to go from pc to pc \n\nantivirus software makers symantec and mcafee both report more than \n number  new infections daily  with no sign of letup at press time  the \nbritish security firm messagelabs estimates that  number  in every  number  \ne mail messages holds a variation of the klez virus  and says that \nklez has already surpassed last summer s sircam as the most prolific \nvirus ever \n\nand some newer klez variants aren t merely nuisances they can carry \nother viruses in them that corrupt your data \n\n \n\n httpaddr \n \nirregulars mailing list\n emailaddr \n httpaddr \n\n'

Finding numbers, dollar signs, and punctuation

We'd like to know the frequency of numbers and any punctuation which carries a tones, like exclamation marks, question marks, and dollar signs. After replacing the things we care about, we'll remove all other punctuation to get us closer to a pure bag of words.

In [28]:
# Replace all numbers with special strings
regx = re.compile(r"\b[\d.]+\b")
body = regx.sub(repl=" number ", string=body)

# Replace all $, ! and ? with special strings
regx = re.compile(r"[$]")
body = regx.sub(repl=" dollar ", string=body)
regx = re.compile(r"[!]")
body = regx.sub(repl=" exclammark ", string=body)
regx = re.compile(r"[?]")
body = regx.sub(repl=" questmark ", string=body)

# Remove all other punctuation (replace with white space)
regx = re.compile(r"([^\w\s]+)|([_-]+)")  
body = regx.sub(repl=" ", string=body)
In [29]:
body
Out[29]:
'\nklez  the virus that won t die\n \nalready the most prolific virus ever  klez continues to wreak havoc \n\nandrew brandt\n from the september  number  issue of pc world magazine\nposted thursday  august  number    number \n\n\nthe klez worm is approaching its seventh month of wriggling across \nthe web  making it one of the most persistent viruses ever  and \nexperts warn that it may be a harbinger of new viruses that use a \ncombination of pernicious approaches to go from pc to pc \n\nantivirus software makers symantec and mcafee both report more than \n number  new infections daily  with no sign of letup at press time  the \nbritish security firm messagelabs estimates that  number  in every  number  \ne mail messages holds a variation of the klez virus  and says that \nklez has already surpassed last summer s sircam as the most prolific \nvirus ever \n\nand some newer klez variants aren t merely nuisances they can carry \nother viruses in them that corrupt your data \n\n \n\n httpaddr \n \nirregulars mailing list\n emailaddr \n httpaddr \n\n'

Standardizing White Space and Total Word Count

Standardizing white space is an important step, as it makes tokenizing the email into words straightforward. I do this as a last step since some of my preprocessing creates extra whitespace. The number of carriage returns (\n) and the number of blank lines (\n\n) might be predictive so we'll replace those with special strings.

In [30]:
# Replace all newlines and blanklines with special strings
regx = re.compile(r"\n")
body = regx.sub(repl=" newline ", string=body)
regx = re.compile(r"\n\n")
body = regx.sub(repl=" blankline ", string=body)

# Make all white space a single space
regx = re.compile(r"\s+")
body = regx.sub(repl=" ", string=body)

# Remove any trailing or leading white space
body = body.strip(" ")
In [31]:
body
Out[31]:
'newline klez the virus that won t die newline newline already the most prolific virus ever klez continues to wreak havoc newline newline andrew brandt newline from the september number issue of pc world magazine newline posted thursday august number number newline newline newline the klez worm is approaching its seventh month of wriggling across newline the web making it one of the most persistent viruses ever and newline experts warn that it may be a harbinger of new viruses that use a newline combination of pernicious approaches to go from pc to pc newline newline antivirus software makers symantec and mcafee both report more than newline number new infections daily with no sign of letup at press time the newline british security firm messagelabs estimates that number in every number newline e mail messages holds a variation of the klez virus and says that newline klez has already surpassed last summer s sircam as the most prolific newline virus ever newline newline and some newer klez variants aren t merely nuisances they can carry newline other viruses in them that corrupt your data newline newline newline newline httpaddr newline newline irregulars mailing list newline emailaddr newline httpaddr newline newline'

This is a true bag of words, so now we can get our total word count to use in normalizing things like number of exclamation marks:

In [32]:
nwords = len(body.split(" "))
nwords
Out[32]:
200

Remove Stop Words with nltk

Each email is going to have lots of common words which are the "glue" of the english language but don't carry much information. These are called Stop Words and we will go ahead and strip them out from the start.

The Natural Language Tool Kit module (ntlk) defines a ton of functionality for processing text, including a corpus of these so-called stop words.

In [33]:
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sonya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[33]:
True
In [34]:
len(stopwords.words("english"))
Out[34]:
153
In [35]:
stopwords.words("english")[0:10]
Out[35]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']
In [36]:
# Remove all useless stopwords
bodywords = body.split(" ")
keepwords = [word for word in bodywords if word not in stopwords.words('english')]
body = " ".join(keepwords)
In [37]:
body
Out[37]:
'newline klez virus die newline newline already prolific virus ever klez continues wreak havoc newline newline andrew brandt newline september number issue pc world magazine newline posted thursday august number number newline newline newline klez worm approaching seventh month wriggling across newline web making one persistent viruses ever newline experts warn may harbinger new viruses use newline combination pernicious approaches go pc pc newline newline antivirus software makers symantec mcafee report newline number new infections daily sign letup press time newline british security firm messagelabs estimates number every number newline e mail messages holds variation klez virus says newline klez already surpassed last summer sircam prolific newline virus ever newline newline newer klez variants merely nuisances carry newline viruses corrupt data newline newline newline newline httpaddr newline newline irregulars mailing list newline emailaddr newline httpaddr newline newline'

Stemming with nltk

This classifier is trying to determine the intent or tone of an email (spam vs. ham) by virtue of the specific words in that email. But we don't expect that a variation on the same root word, like "battery" versus "batteries", carries much difference in intent or tone. In "stemming" we replace all the variants of each root with the root itself: this reduces the complexity of the email representation without really reducing the information. The nltk module has several options for out-of-the-box stemmers.

In [38]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

stemmer.stem("generously")
Out[38]:
'generous'
In [39]:
# Stem all words
words = body.split(" ")
stemwords = [stemmer.stem(wd) for wd in words]
body = " ".join(stemwords)
In [40]:
body
Out[40]:
'newlin klez virus die newlin newlin alreadi prolif virus ever klez continu wreak havoc newlin newlin andrew brandt newlin septemb number issu pc world magazin newlin post thursday august number number newlin newlin newlin klez worm approach seventh month wriggl across newlin web make one persist virus ever newlin expert warn may harbing new virus use newlin combin pernici approach go pc pc newlin newlin antivirus softwar maker symantec mcafe report newlin number new infect daili sign letup press time newlin british secur firm messagelab estim number everi number newlin e mail messag hold variat klez virus say newlin klez alreadi surpass last summer sircam prolif newlin virus ever newlin newlin newer klez variant mere nuisanc carri newlin virus corrupt data newlin newlin newlin newlin httpaddr newlin newlin irregular mail list newlin emailaddr newlin httpaddr newlin newlin'

Encapsulate Preprocessing in a Function

All of the above steps can into a function that spits out the final processed word salad.

In [41]:
def word_salad(body):
    '''Produce a word salad from email body.'''    
    # Parse HTML extract content only (but count tags)
    soup = BeautifulSoup(body, 'html.parser')
    body = soup.get_text()
    
    # Pull out only the non-markup tex
    body = soup.get_text()

    # Count the number of HTML elements and specific link elements
    nhtml = len(soup.find_all())
    nlinks = len(soup.find_all("a"))
    # Sub in special strings for "counting"
    body = body + nhtml*" htmltag " + nlinks*" linktag "
    
    # lowercase everything
    body = body.lower()
    
    # Replace all URLs with special strings
    regx = re.compile(r"(http|https)://[^\s]*")
    body, nhttps = regx.subn(repl=" httpaddr ", string=body)

    # Replace all email addresses with special strings
    regx = re.compile(r"\b[^\s]+@[^\s]+[.][^\s]+\b")
    body, nemails = regx.subn(repl=" emailaddr ", string=body)
    
    # Replace all numbers with special strings
    regx = re.compile(r"\b[\d.]+\b")
    body = regx.sub(repl=" number ", string=body)

    # Replace all $, ! and ? with special strings
    regx = re.compile(r"[$]")
    body = regx.sub(repl=" dollar ", string=body)
    regx = re.compile(r"[!]")
    body = regx.sub(repl=" exclammark ", string=body)
    regx = re.compile(r"[?]")
    body = regx.sub(repl=" questmark ", string=body)

    # Remove all other punctuation (replace with white space)
    regx = re.compile(r"([^\w\s]+)|([_-]+)")  
    body = regx.sub(repl=" ", string=body)
    
    # Replace all newlines and blanklines with special strings
    regx = re.compile(r"\n")
    body = regx.sub(repl=" newline ", string=body)
    regx = re.compile(r"\n\n")
    body = regx.sub(repl=" blankline ", string=body)

    # Make all white space a single space
    regx = re.compile(r"\s+")
    body = regx.sub(repl=" ", string=body)

    # Remove any trailing or leading white space
    body = body.strip(" ")
 
    # Remove all useless stopwords
    bodywords = body.split(" ")
    keepwords = [word for word in bodywords if word not in stopwords.words('english')]

    # Stem all words
    stemmer = SnowballStemmer("english")
    stemwords = [stemmer.stem(wd) for wd in keepwords]
    body = " ".join(stemwords)

    return body
In [42]:
# Try out our function
body = get_body(spamfiles[179])
processed = word_salad(body)
processed
Out[42]:
'newlin newlin newlin hello emailaddr newlin newlin seen nbc cbs cnn even oprah exclammark health newlin discoveri actual revers age burn fat newlin without diet exercis exclammark proven discoveri even newlin report new england journal medicin newlin forget age diet forev exclammark guarante exclammark newlin newlin reduc bodi fat build lean muscl without exercis exclammark newlin enhac sexual perform newlin remov wrinkl cellulit newlin lower blood pressur improv cholesterol profil newlin improv sleep vision memori newlin restor hair color growth newlin strengthen immun system newlin increas energi cardiac output newlin turn back bodi biolog time clock number number year newlin number month usag exclammark exclammark exclammark newlin free inform get free newlin number month suppli hgh click newlin receiv email subscrib newlin opt america mail list newlin remov relat maillist newlin newlin click newlin newlin newlin htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag htmltag linktag linktag'

Building Our Corpus of Emails

Whatever algorithm we ultimately use for classification will require numeric feature vectors, so mapping each word salad to such a vector is the next main task. We'll start by building a corpus of the raw email bodies, that is just a list of email body strings, and we'll build alongside it a list of the processed email body strings for our own inspection. These lists can later be fed into algorithms for vectorization.

In [43]:
emails_raw =  ["email"]*len(hamfiles + spamfiles)  # Reserve in memory, faster than append
emails_processed =  ["email"]*len(hamfiles + spamfiles)  # Reserve in memory, faster than append
y = [0]*len(hamfiles) + [1]*len(spamfiles)  # Ground truth vector

for idx, fpath in enumerate(hamfiles + spamfiles):
    body = get_body(fpath)  # Extract only the email body text
    emails_raw[idx] = body
    processed = word_salad(body)  # All preprocessing
    emails_processed[idx] = processed
In [44]:
# Pickle these objects for easier access later
with open("easyham_and_spam_corpus_raw_and_processed_and_y.pickle", "wb") as myfile:
    pickle.dump([emails_raw, emails_processed, y], myfile)

We're now in position to start mapping emails into a numeric vector space. It turns out there are a lot of ways in which to do this and the proper ML approach would be to search over this space using cross-validation to identify the best approach. This is the subject of Spam Part II. We'll explore different vectorization schemes and feed these vectors into a Support Vector Machine to classify each email.