Large Data Files with Pandas and SQLite

Posted by Sonya Sawtelle on Sun 13 November 2016

In which I learn how not to set my computer on fire by trying to read a 3GB CSV file.

Here I'll talk about some tricks for working with larger-than-memory data sets in python using our good friend pandas as well as a standard lib module sqlite3 for interfacing local (on your machine) databases. If you have a laptop with 4 or even 8 GB of RAM you could find yourself in this position with even just a normal Kaggle competition (I certainly did, hence this post).

Tools covered:

  • Pandas nrows and skiprows kwargs for read_csv
  • Pandas dtypes and astype for "compressing" features
  • sqlite relational database software and super basic SQL syntax
  • sqlite3 Python bindings for sqlite
  • Pandas read_sql_query for pulling from a database to a dataframe
  • Pandas to_sql for writing a dataframe to a database

The Example Data Set

For the purposes of this post I'm going to use a laughably small .csv just to to get my point across. The data is a set of bank customers with some demographic information, stored in data.csv which has about 15 rows (pretend I said 15 million). Ok, first things first.

In [1]:
import pandas as pd
import numpy as np
import sqlite3

Peek Inside a Large CSV

If you are starting with a really huge file often there is no piece of software that will let you just straight up load it- this is frustrating because obviously before you start to do anything with the file you want an idea of what it contains! Luckily you can use pandas to take a quick look at the beginning or inside of the CSV.

In [2]:
# Look at the first few rows of the CSV file
pd.read_csv("data.csv", nrows=2).head()
Out[2]:
id last_name age sex acct_type active
0 310001 Primmer 39 F savings 0
1 442877 Conner 31 M checking 0
In [3]:
# Peek at the middle of the CSV file
pd.read_csv("data.csv", nrows=2, skiprows=7, header=None).head()
Out[3]:
0 1 2 3 4 5
0 102001 Proust 72 M savings 1
1 279300 Piper 31 M checking 1

Reading in A Large CSV Chunk-by-Chunk

Pandas provides a convenient handle for reading in chunks of a large CSV file one at time. By setting the chunksize kwarg for read_csv you will get a generator for these chunks, each one being a dataframe with the same header (column names). This can sometimes let you preprocess each chunk down to a smaller footprint by e.g. dropping columns or changing datatypes.

In [4]:
# Read in the data chunk by chunk
for chunk in pd.read_csv("data.csv", chunksize=4):
    print(chunk.loc[0, "last_name"])  # Each chunk gives you a dataframe with the same header :)
Primmer
Sater
Smith

If instead of having one large csv file you have a crap ton of small ones then you'll want to read each one into a dataframe and combine those frames. In this case, remember that append is an expensive operation for data frames and it is preferable to create a long list of your dataframes and then execute concat once. A great discussion of this is here.

Being Smart About Data Types

When reading in a .csv pandas does a good job inferring appropriate datatypes for each column, but it is not memory-optimized and with a large file this can cost you.

A good approach is to read in a very large but manageable chunk of the data frame, check what dtypes pandas has defaulted to, and then inspect the columns of the dataframe to see if you can improve on the defaults. It's also informative to take a look at the dataframes memory_usage.

In [5]:
df = pd.read_csv("data.csv", nrows=5)
df.head()
Out[5]:
id last_name age sex acct_type active
0 310001 Primmer 39 F savings 0
1 442877 Conner 31 M checking 0
2 541890 Desalle 19 NaN checking 1
3 333365 Paulo 39 F savings 1
4 442877 Sater 22 M checking 1
In [6]:
df.dtypes
Out[6]:
id            int64
last_name    object
age           int64
sex          object
acct_type    object
active        int64
dtype: object
In [7]:
df.memory_usage()
Out[7]:
Index        80
id           40
last_name    40
age          40
sex          40
acct_type    40
active       40
dtype: int64

[Pandas calls strings "object" datatypes, more info on pandas data types is here. And here is the list of allowed numpy data types.]

A great example here is that we believe "active" is going to be just binary 1/0 values, but pandas wants to be safe so it has used np.int32 instead of the smaller np.int8. Also, since np.int32 can represent positive and negative integers up to 2 billion, we don't really need np.int64 for the "age", or even the "id" column.

You can cast columns into a paricular datatype during the csv read by passing in a dictionary of {colname: dtype}, or you can process the dataframe chunks after the read with the astype method:

In [12]:
# Cast during the csv read
df = pd.read_csv("data.csv", nrows=5, dtype={"active": np.int8})  

# ...or cast after reading 
df["age"] = df["age"].astype(np.int16)
df["id"] = df["id"].astype(np.int16)
In [9]:
df.dtypes
Out[9]:
id            int16
last_name    object
age           int16
sex          object
acct_type    object
active         int8
dtype: object
In [10]:
df.memory_usage()
Out[10]:
Index        80
id           10
last_name    40
age          10
sex          40
acct_type    40
active        5
dtype: int64

There is a actually a lot more than can be done in terms of processing data during pandas read_csv. Check out this SO answer and also the official read_csv docs for more info. You can do things like:

  • parse columns to date time (parse_dates kwarg)
  • apply arbitrary conversion functions to each column (converters kwarg)
  • specify values that should be recognized as NaN (na_values kwarg)
  • specify reading a selection of columns (usecols kwarg)

NaNs Spoil Everything

A lot of the above settings for read_csv will not work with columns that have NaN. The pandas representation for missing data, np.nan, is actually a float64 data type, so if you have e.g. a binary column of 0s and 1s then even one NaN will be enough to force it to be typed float64 rather than int8. In these cases you can let pandas use a large dtype when reading in a chunk and then do your own processing of the column to impute NaNs, followed by astype to convert to get a smaller data type.

The na_values kwarg during the csv read can convert things like the string "NaN" into a np.nan object, which can save you from having a mixed-data column of e.g. numeric with string "NaN".

Compressing Categorical Variables

In the above example we have the "sex" category which ultimately is binary, but here is represented by strings. To save on RAM you can transform categoricals into an integer mapping, and then astype the column to np.int8 or np.int16.

A neat tool for doing categorical encoding is the scikit learn LabelEncoder class. You "fit" this encoder by giving it data which contains all the possible unique values that category will take on. A word of warning though, LabelEncoder breaks on np.nan so first impute these to e.g. the string "NaN" or some such.

In [7]:
# Impute the NaNs in the sex column, let's make them their own category:
df.fillna(value={"sex": "NAN"}, inplace=True)
df["sex"].unique()
Out[7]:
array(['F', 'M', 'NAN'], dtype=object)
In [8]:
# Create a label encoder and fit it
from sklearn.preprocessing import LabelEncoder
le_sex = LabelEncoder()
le_sex.fit(df["sex"].unique())

le_sex.classes_  # The fit results in two classes - M and F
Out[8]:
array(['F', 'M', 'NAN'], dtype=object)

Once the encoder has been fitted you can use it to "transform" this feature's category values into integers.

In [11]:
# Example of the transform
print(le_sex.transform(["M", "M", "NAN", "F"]))
[1 1 2 0]
In [136]:
# Transform the sex categories into integers and then change type
df["sex"] = le_sex.transform(df["sex"])
df["sex"] = df["sex"].astype(np.int8)
df["sex"]
Out[136]:
0    0
1    1
2    2
3    0
4    1
Name: sex, dtype: int8

Sometimes all of this bending over backwords with chunks and data type still isn't sufficient to let you work with the full data set in RAM. In such cases you can consider using a local SQL database.

How Does a Database Help With Memory?

Say you have a .csv file and you want to extract from it only the rows with user ID having a certain value. The only way to do that is to look through the entire file line by line (read through it's bytestream): a .csv. file is simply not a data structure you can interact with easily. [I found this SO answer enlightening regarding files just being byte streams.]

If the file is small enough we can read the .csv into a data structure that is nice to interact with, like a pandas DataFrame, which lives in your working memory (RAM). But if the data set is very large then you instead need a data structure that lives on your disk rather than in RAM but is designed to still be easy and fast to interact with. The answer to this riddle is a database.

Intro to Local Databases with sqlite3

Sqlite is a lean relational database software that runs locally on your PC (rather than a remote server). Each database is stored as a single file on disk. Sqlite understands most SQL language syntax although some things are omitted and other things added. The sqlite3 module for in the Python Standard Library provides the functionality for working with Sqlite databases within Python.

Preliminarily, a database is a collection of tables each of which is like a spreadsheet with rows and columns. Fundamentally you must be able to Create (put new data into a table), Read (pull data out of a table), Update (modify existing data in a table), and Delete (drop data from a table).

[If you want to physically browse through a sqlite DB there is software for that - DB Browser for Sqlite is a good one.]

Create sqlite Database and Connect to it with Python

Within Python databases will be represented through connection type objects which you create. To actually interact with the database requires a cursor object to which we can pass SQL commands. This is super easy to set up:

In [15]:
import sqlite3
connex = sqlite3.connect("all_your_base.db")  # Opens file if exists, else creates file
cur = connex.cursor()  # This object lets us actually send messages to our DB and receive results

Read .csv into Database by Chunks

With a big file don't try to import the whole thing into your DB, use an intermediate tool that chunks the CSV up and then load each chunk into the DB. There are several tools for this, but I'll stick with pandas df.to_sql method.

Remember how a database is a collection of tables? You must specify both the connection object which identifies which database you are connected to, but also the identifier for the particular table within the database that you want to write to. If your database is empty it will create a table from the name you pass.

In [16]:
for chunk in pd.read_csv("data.csv", chunksize=4):
    chunk.to_sql(name="data", con=connex, if_exists="append", index=False)  #"name" is name of table 
    print(chunk.iloc[0, 1])
Primmer
Sater
Smith

One "gotcha" with to_sql is that by default it will write the pandas dataframe index to the database as it's own column. Since the index is usually just incidental, remember to set the kwarg index=False to prevent this.

The Briefest Possible Introduction to SQL

You probably want to know something about the SQL language at this point, before we start throwing queries around willy nilly. Since the internet overfloweth with useful SQL tutorials I won't bother with that here - I'll just give the flavor of it.

Here is a SQL command example for creating a table called "person" which houses general information about people:

In [3]:
sql = "CREATE TABLE person (id INTEGER PRIMARY KEY, first_name TEXT, last_name TEXT, age INTEGER);"

SQL commands are usually pretty easy to interpet. We are creating a table called person which has several columns, each of which has it's type specified. Some things of note:

  • SQL commands are terminated by a semicolon
  • The convention is to UPPERCASE the words which are SQL keywords (as opposed to e.g. a column name)
  • The PRIMARY KEY keyword says that this column is the basis of the index for looking things up in this table. That means finding things by "id" will be very fast. You can make additional indices based on different columns also.

Let's look at another example that fetches the first_name and last_name columns from the "person" table of the database. (I mean, we just created it so it's obviously empty, but pretend it's not.)

In [2]:
sql = "SELECT first_name, last_name FROM person;"

One more: let's add a new column to this table to hold what town the person lives in, notice we can specify a default value to be applied to all existing entries:

In [122]:
sql = "ALTER TABLE person ADD COLUMN town TEXT DEFAULT 'North Pole';"
Out[122]:
<sqlite3.Cursor at 0x1a9ee483d50>

So now you know SQL, so easy! OK you don't know SQL, but you can probably recognize the cadence of the queries, and you won't be intimidated by the weird mixture of UPPER and lower case...

Actually Doing Stuff With Your Database!

Query Your DB with execute

When you send a SQL query command via your cursor the returned results are stored in the cursor object itself and are iterable. The cursor's fetchone method acts like next() to grab the next value, while the fetchall method pulls values from the generator until it is exhausted.

In [17]:
sql = "SELECT DISTINCT sex FROM data;"  # A SQL command to get unique values from the sex column 
cur.execute(sql)  # The cursor gives this command to our DB, the results are now available in cur
Out[17]:
<sqlite3.Cursor at 0x1adb2ffd110>
In [18]:
cur.fetchone()  # Pop the first result of the query like `next()`
Out[18]:
('F',)
In [19]:
cur.fetchall()  # Pop the remaining results of the query
Out[19]:
[('M',), (None,)]
In [20]:
# Execute returns a generator you can loop through.
for result in cur.execute(sql):
    print(result)
('F',)
('M',)
(None,)

Pull DataFrames from your DB with pandas

Pandas actually has a lot of built in functionality that makes it easy to shuttle data between DataFrames and SQL databases - you just pass it the cursor object and the command string. This is soooo convenient - I heart pandas :)

Here is a simple row-selection operation that reads the results right into a dataframe.

In [8]:
ids = [494724, 102001]
ids = [str(id) for id in ids] 
str_matching = "(" + ",".join(ids) + ")"  # Construct the string of SQL language
print(str_matching)
(494724,102001)
In [9]:
sql = "SELECT * FROM data WHERE id IN " + str_matching + ";"
print(sql)
SELECT * FROM data WHERE id IN (494724,102001);
In [10]:
df = pd.read_sql_query(sql, connex)
df.head()
Out[10]:
id last_name age sex acct_type active
0 494724 Mellon 30 F checking 0
1 102001 Proust 72 M savings 1

Write DataFrames to your DB with pandas

Since you can pull a dataframe out of a DB, you would expect to be able to push one to it also. Let's make some new entries in a new dataframe and then write them to the database.

In [17]:
df_new = pd.DataFrame(columns=df.columns,
                      data={"id": [200100, 109333], 
                      "last_name": ["Caster", "Spenny"], 
                      "age": [81, 62],
                      "sex": ["F", "F"],
                      "acct_type": [np.nan, "savings"],
                      "active": [1, 1]})
In [18]:
df_new
Out[18]:
id last_name age sex acct_type active
0 200100 Caster 81 F NaN 1
1 109333 Spenny 62 F savings 1
In [23]:
df_new.to_sql(name="data", con=connex, if_exists="append", index=False) 

The kwarg if_exists tells pandas what to do if the table in the database already exists. We can double check the results of our push by reading out unique last names from the database:

In [24]:
sql = "SELECT DISTINCT last_name FROM data;"  # A SQL command to get unique values from the sex column 
cur.execute(sql)  # The cursor gives this command to our DB, the results are now available in cur
cur.fetchall()
Out[24]:
[('Primmer',),
 ('Conner',),
 ('Desalle',),
 ('Paulo',),
 ('Sater',),
 ('Mellon',),
 ('Proust',),
 ('Piper',),
 ('Smith',),
 ('Brimbo',),
 ('Chamberlin',),
 ('Yollinski',),
 ('Caster',),
 ('Spenny',)]

Improve Speed With a New Index

If you know you will be pulling records according to the value of a certain column(s) very frequently, make a new index for your database on that column. You can see the time difference even with a very tiny database.

In [27]:
sql2 = "CREATE INDEX id_idx ON data (id);"
cur.execute(sql2)
connex.commit()

Saving Your Changes and Disconnecting!

Whenever we would like to save changes that we have made to the database we must commit them, and once we are finished with the database we need to close it:

In [133]:
connex.commit()
connex.close()

We're Being Naughty With Parameter Passing

I've just been using vanilla python string formatting to construct my SQL queries, and then passing the string to the database. This is actually kinda frowned upon - you're instead supposed to use a special string formatting convention where variable values to be filled in are placeheld by ?. Ideally you would do something like:

In [ ]:
pd.read_sql_query("SELECT * FROM data WHERE ncodpers IN (?, ?);", con=connex, params=(100, 101))

This would fill in the question marks with whatever you provided for the params kwarg. This doesn't work with the sqlite driver for some reason so instead you have to do all your string formatting outside the read_sql_query call. Just thought you should know.