In which I learn how not to set my computer on fire by trying to read a 3GB CSV file.¶
Here I'll talk about some tricks for working with larger-than-memory data sets in python using our good friend pandas
as well as a standard lib module sqlite3
for interfacing local (on your machine) databases. If you have a laptop with 4 or even 8 GB of RAM you could find yourself in this position with even just a normal Kaggle competition (I certainly did, hence this post).
Tools covered:
- Pandas
nrows
andskiprows
kwargs forread_csv
- Pandas
dtypes
andastype
for "compressing" featuressqlite
relational database software and super basicSQL
syntaxsqlite3
Python bindings for sqlite- Pandas
read_sql_query
for pulling from a database to a dataframe- Pandas
to_sql
for writing a dataframe to a database
The Example Data Set¶
For the purposes of this post I'm going to use a laughably small .csv
just to to get my point across. The data is a set of bank customers with some demographic information, stored in data.csv
which has about 15 rows (pretend I said 15 million). Ok, first things first.
import pandas as pd
import numpy as np
import sqlite3
Peek Inside a Large CSV¶
If you are starting with a really huge file often there is no piece of software that will let you just straight up load it- this is frustrating because obviously before you start to do anything with the file you want an idea of what it contains! Luckily you can use pandas to take a quick look at the beginning or inside of the CSV.
# Look at the first few rows of the CSV file
pd.read_csv("data.csv", nrows=2).head()
# Peek at the middle of the CSV file
pd.read_csv("data.csv", nrows=2, skiprows=7, header=None).head()
Reading in A Large CSV Chunk-by-Chunk¶
Pandas provides a convenient handle for reading in chunks of a large CSV file one at time. By setting the chunksize
kwarg for read_csv
you will get a generator for these chunks, each one being a dataframe with the same header (column names). This can sometimes let you preprocess each chunk down to a smaller footprint by e.g. dropping columns or changing datatypes.
# Read in the data chunk by chunk
for chunk in pd.read_csv("data.csv", chunksize=4):
print(chunk.loc[0, "last_name"]) # Each chunk gives you a dataframe with the same header :)
If instead of having one large csv file you have a crap ton of small ones then you'll want to read each one into a dataframe and combine those frames. In this case, remember that append
is an expensive operation for data frames and it is preferable to create a long list of your dataframes and then execute concat
once. A great discussion of this is here.
Being Smart About Data Types¶
When reading in a .csv pandas does a good job inferring appropriate datatypes for each column, but it is not memory-optimized and with a large file this can cost you.
A good approach is to read in a very large but manageable chunk of the data frame, check what dtypes
pandas has defaulted to, and then inspect the columns of the dataframe to see if you can improve on the defaults. It's also informative to take a look at the dataframes memory_usage
.
df = pd.read_csv("data.csv", nrows=5)
df.head()
df.dtypes
df.memory_usage()
[Pandas calls strings "object" datatypes, more info on pandas data types is here. And here is the list of allowed numpy data types.]
A great example here is that we believe "active" is going to be just binary 1/0 values, but pandas wants to be safe so it has used np.int32
instead of the smaller np.int8
. Also, since np.int32
can represent positive and negative integers up to 2 billion, we don't really need np.int64
for the "age", or even the "id" column.
You can cast columns into a paricular datatype during the csv read by passing in a dictionary of {colname: dtype}
, or you can process the dataframe chunks after the read with the astype
method:
# Cast during the csv read
df = pd.read_csv("data.csv", nrows=5, dtype={"active": np.int8})
# ...or cast after reading
df["age"] = df["age"].astype(np.int16)
df["id"] = df["id"].astype(np.int16)
df.dtypes
df.memory_usage()
There is a actually a lot more than can be done in terms of processing data during pandas read_csv
. Check out this SO answer and also the official read_csv
docs for more info. You can do things like:
- parse columns to date time (
parse_dates
kwarg) - apply arbitrary conversion functions to each column (
converters
kwarg) - specify values that should be recognized as NaN (
na_values
kwarg) - specify reading a selection of columns (
usecols
kwarg)
NaNs Spoil Everything¶
A lot of the above settings for read_csv
will not work with columns that have NaN. The pandas representation for missing data, np.nan
, is actually a float64 data type, so if you have e.g. a binary column of 0s and 1s then even one NaN will be enough to force it to be typed float64 rather than int8. In these cases you can let pandas use a large dtype
when reading in a chunk and then do your own processing of the column to impute NaNs, followed by astype
to convert to get a smaller data type.
The na_values
kwarg during the csv read can convert things like the string "NaN" into a np.nan
object, which can save you from having a mixed-data column of e.g. numeric with string "NaN".
Compressing Categorical Variables¶
In the above example we have the "sex" category which ultimately is binary, but here is represented by strings. To save on RAM you can transform categoricals into an integer mapping, and then astype
the column to np.int8
or np.int16
.
A neat tool for doing categorical encoding is the scikit learn LabelEncoder
class. You "fit" this encoder by giving it data which contains all the possible unique values that category will take on. A word of warning though, LabelEncoder breaks on np.nan
so first impute these to e.g. the string "NaN" or some such.
# Impute the NaNs in the sex column, let's make them their own category:
df.fillna(value={"sex": "NAN"}, inplace=True)
df["sex"].unique()
# Create a label encoder and fit it
from sklearn.preprocessing import LabelEncoder
le_sex = LabelEncoder()
le_sex.fit(df["sex"].unique())
le_sex.classes_ # The fit results in two classes - M and F
Once the encoder has been fitted you can use it to "transform" this feature's category values into integers.
# Example of the transform
print(le_sex.transform(["M", "M", "NAN", "F"]))
# Transform the sex categories into integers and then change type
df["sex"] = le_sex.transform(df["sex"])
df["sex"] = df["sex"].astype(np.int8)
df["sex"]
Sometimes all of this bending over backwords with chunks and data type still isn't sufficient to let you work with the full data set in RAM. In such cases you can consider using a local SQL database.
How Does a Database Help With Memory?¶
Say you have a .csv
file and you want to extract from it only the rows with user ID having a certain value. The only way to do that is to look through the entire file line by line (read through it's bytestream): a .csv.
file is simply not a data structure you can interact with easily. [I found this SO answer enlightening regarding files just being byte streams.]
If the file is small enough we can read the .csv
into a data structure that is nice to interact with, like a pandas
DataFrame
, which lives in your working memory (RAM). But if the data set is very large then you instead need a data structure that lives on your disk rather than in RAM but is designed to still be easy and fast to interact with. The answer to this riddle is a database.
Intro to Local Databases with sqlite3
¶
Sqlite is a lean relational database software that runs locally on your PC (rather than a remote server). Each database is stored as a single file on disk. Sqlite understands most SQL language syntax although some things are omitted and other things added. The sqlite3
module for in the Python Standard Library provides the functionality for working with Sqlite databases within Python.
Preliminarily, a database is a collection of tables
each of which is like a spreadsheet with rows and columns. Fundamentally you must be able to Create (put new data into a table), Read (pull data out of a table), Update (modify existing data in a table), and Delete (drop data from a table).
[If you want to physically browse through a sqlite DB there is software for that - DB Browser for Sqlite is a good one.]
Create sqlite Database and Connect to it with Python¶
Within Python databases will be represented through connection
type objects which you create. To actually interact with the database requires a cursor
object to which we can pass SQL commands. This is super easy to set up:
import sqlite3
connex = sqlite3.connect("all_your_base.db") # Opens file if exists, else creates file
cur = connex.cursor() # This object lets us actually send messages to our DB and receive results
Read .csv into Database by Chunks¶
With a big file don't try to import the whole thing into your DB, use an intermediate tool that chunks the CSV up and then load each chunk into the DB. There are several tools for this, but I'll stick with pandas df.to_sql
method.
Remember how a database is a collection of tables? You must specify both the connection
object which identifies which database you are connected to, but also the identifier for the particular table within the database that you want to write to. If your database is empty it will create a table from the name you pass.
for chunk in pd.read_csv("data.csv", chunksize=4):
chunk.to_sql(name="data", con=connex, if_exists="append", index=False) #"name" is name of table
print(chunk.iloc[0, 1])
One "gotcha" with to_sql
is that by default it will write the pandas dataframe index to the database as it's own column. Since the index is usually just incidental, remember to set the kwarg index=False
to prevent this.
The Briefest Possible Introduction to SQL¶
You probably want to know something about the SQL language at this point, before we start throwing queries around willy nilly. Since the internet overfloweth with useful SQL tutorials I won't bother with that here - I'll just give the flavor of it.
Here is a SQL command example for creating a table
called "person" which houses general information about people:
sql = "CREATE TABLE person (id INTEGER PRIMARY KEY, first_name TEXT, last_name TEXT, age INTEGER);"
SQL commands are usually pretty easy to interpet. We are creating a table called person
which has several columns, each of which has it's type specified. Some things of note:
- SQL commands are terminated by a semicolon
- The convention is to UPPERCASE the words which are SQL keywords (as opposed to e.g. a column name)
- The
PRIMARY KEY
keyword says that this column is the basis of the index for looking things up in this table. That means finding things by "id" will be very fast. You can make additional indices based on different columns also.
Let's look at another example that fetches the first_name and last_name columns from the "person" table of the database. (I mean, we just created it so it's obviously empty, but pretend it's not.)
sql = "SELECT first_name, last_name FROM person;"
One more: let's add a new column to this table to hold what town the person lives in, notice we can specify a default value to be applied to all existing entries:
sql = "ALTER TABLE person ADD COLUMN town TEXT DEFAULT 'North Pole';"
So now you know SQL, so easy! OK you don't know SQL, but you can probably recognize the cadence of the queries, and you won't be intimidated by the weird mixture of UPPER and lower case...
Actually Doing Stuff With Your Database!¶
Query Your DB with execute¶
When you send a SQL query command via your cursor the returned results are stored in the cursor
object itself and are iterable. The cursor's fetchone
method acts like next()
to grab the next value, while the fetchall
method pulls values from the generator until it is exhausted.
sql = "SELECT DISTINCT sex FROM data;" # A SQL command to get unique values from the sex column
cur.execute(sql) # The cursor gives this command to our DB, the results are now available in cur
cur.fetchone() # Pop the first result of the query like `next()`
cur.fetchall() # Pop the remaining results of the query
# Execute returns a generator you can loop through.
for result in cur.execute(sql):
print(result)
Pull DataFrames from your DB with pandas
¶
Pandas
actually has a lot of built in functionality that makes it easy to shuttle data between DataFrames and SQL databases - you just pass it the cursor
object and the command string. This is soooo convenient - I heart pandas :)
Here is a simple row-selection operation that reads the results right into a dataframe.
ids = [494724, 102001]
ids = [str(id) for id in ids]
str_matching = "(" + ",".join(ids) + ")" # Construct the string of SQL language
print(str_matching)
sql = "SELECT * FROM data WHERE id IN " + str_matching + ";"
print(sql)
df = pd.read_sql_query(sql, connex)
df.head()
Write DataFrames to your DB with pandas
¶
Since you can pull a dataframe out of a DB, you would expect to be able to push one to it also. Let's make some new entries in a new dataframe and then write them to the database.
df_new = pd.DataFrame(columns=df.columns,
data={"id": [200100, 109333],
"last_name": ["Caster", "Spenny"],
"age": [81, 62],
"sex": ["F", "F"],
"acct_type": [np.nan, "savings"],
"active": [1, 1]})
df_new
df_new.to_sql(name="data", con=connex, if_exists="append", index=False)
The kwarg if_exists
tells pandas what to do if the table in the database already exists. We can double check the results of our push by reading out unique last names from the database:
sql = "SELECT DISTINCT last_name FROM data;" # A SQL command to get unique values from the sex column
cur.execute(sql) # The cursor gives this command to our DB, the results are now available in cur
cur.fetchall()
Improve Speed With a New Index¶
If you know you will be pulling records according to the value of a certain column(s) very frequently, make a new index for your database on that column. You can see the time difference even with a very tiny database.
sql2 = "CREATE INDEX id_idx ON data (id);"
cur.execute(sql2)
connex.commit()
Saving Your Changes and Disconnecting!¶
Whenever we would like to save changes that we have made to the database we must commit them, and once we are finished with the database we need to close it:
connex.commit()
connex.close()
We're Being Naughty With Parameter Passing¶
I've just been using vanilla python string formatting to construct my SQL queries, and then passing the string to the database. This is actually kinda frowned upon - you're instead supposed to use a special string formatting convention where variable values to be filled in are placeheld by ?
. Ideally you would do something like:
pd.read_sql_query("SELECT * FROM data WHERE ncodpers IN (?, ?);", con=connex, params=(100, 101))
This would fill in the question marks with whatever you provided for the params
kwarg. This doesn't work with the sqlite driver for some reason so instead you have to do all your string formatting outside the read_sql_query
call. Just thought you should know.
Helpful Links¶
- http://www.datacarpentry.org/python-ecology-lesson/08-working-with-sql
- https://plot.ly/python/big-data-analytics-with-pandas-and-sqlite/
- http://sebastianraschka.com/Articles/2013_sqlite_database.html#results-and-conclusions
- http://sebastianraschka.com/Articles/2014_sqlite_in_python_tutorial.html#creating-unique-indexes