Treat Your Data Science Projects Like Open Source Packages

Posted by Sonya Sawtelle on Fri 17 February 2017

In which I play with cookie cutters and subsequently crave actual cookies.

I recently started freelance data sciencing on Upwork, which is awesome! My second contract there involved writing some Python for a client who wanted to be able to execute various analyses and visualizations from his data on securities trading. Since I was essentially responsible for delivering a piece of software to this dude, I spent a little bit of time reading on how to structure a project like this. What I discovered was a pretty solid concensus on how all publicly distributed python packages (modules) should be structured.

So I conformed the client's code to this standard python package structure and documented the hell out of it. Then something amazing happened... I felt like I had actually created a persistent, useful tool and code base for myself and this guy, rather than a collection of haphazard scripts with a workflow that nobody would be able to easily reproduce one month hence. So here is the revolutionary idea, treat your data science projects like distributed python packages - structure them a standard way and document the crap out of them.

Structuring a Python Project

So how does the standard layout for a publicly distributed python package look? The simplest version looks like this:

In [59]:
!tree /F /A
Folder PATH listing for volume OS
Volume serial number is 00000085 16EB:8E0B
C:.
|   coolprojectname.yml
|   LICENSE.txt
|   README.md
|   requirements.txt
|   
+---coolprojectname
|   \---test
\---docs

As you can see, your project directory should have a handful of top level files like README.md, LICENSE, and requirements.txt. Then it should have at minimum the following three folders:

  • docs to store documentation
  • coolprojectname to store your actual python package (the name of this folder is the name of the package)
  • coolprojectname/test to store scripts to run tests on your package code

The file(s) at the top level that lets anyone exactly duplicate the virtual environment needed for the project to run are a requirements.txt file (used by pip to recreate an environment) and a .yml file (used by conda to recreate an environment). (If that was confusing to you then you should read the section below on environments.)

Project Structure Resources

Structuring a Python Data Science Project

Turns out some really smart people have thought a lot about this task of standardized project structure. The cookiecutter tool is a command line tool that instantiates all the standard folders and files for a new python project. It turns out there is an awesome fork of this project, cookiecutter-data-science, that is specific to data science! The official cookiecutter-data-science docs are actually excellent (and short) so I recommend you read them cover-to-cover.

Make sure you have git.exe in your path (if, like me, you've just been using the Git Bash command line utility then you probably don't have it in path yet). Now, from system shell do:

In [ ]:
conda config --add channels conda-forge
conda install cookiecutter
cookiecutter https://github.com/drivendata/cookiecutter-data-science

You'll be prompted to input various things like the project name (directory name) and the repo name (the name of the github repo). After that you're new directory awaits, chock full of useful standardized stuff.

There is a .gitignore which excludes the data by default (since those are usually very big files).

There is also a tox.ini file. From the official docs: "Tox is a generic virtualenv management and test command line tool you can use for checking your package installs correctly with different Python versions and interpreters, running your tests in each of the environments and configuring your test tool of choice." I definitely have no experience with tox and since I feel like this adds an unnecessary complication I'm not going to be using it.

The docs folder is actually a sphinx project, so that the documentation can be auto-generated. sphinx is a tool that parses your project code to auto-generate documentation based on your docstrings!

Use an Environment, I Beg You

Please, for the love of god, use a virtual environment for any substantial project involving code you might ever want to reuse. I can reassure you that although they sound super technical they are actually dead simple to use. A virtual environment is like a self-contained, preserved, install of a specific version of python and packages. It let's you completely reproduce the working environment for your code at any time - you'll never again break ALL the things by updating!

In a more boring and realistic sense, a virtual environment is just an isolated folder holding specific versions of python and packages. You make a virtual environment by indicating which version of python should be copied into your new directory (or specify the path from which to grab python) and it will also install the pip package into the environment by default. After this you can enter the virtual environment and use pip (or conda) like normal to add packages, but they will be installed into your new virtual environment directory. Being inside a virtual environment is just like telling your OS that your python executable and your PATH (where it looks for modules) have moved to the new virtual environment directory.

The standard way of managing virtual environments in Python is with the package manager pip and a 3rd party tool virtualenv (in Python 3 there is built-in support and you use pyenv instead of virtualenv). But Conda, the package management solution that ships with the Anaconda distribution of Python has it's own (better) approach for virtual environments.

Here is a teaser to how simple conda environments are to use:

In [ ]:
conda create --name coolprojectname python=2.7  # Create a new env with only a fresh install of python 2.7 (no packages)
activate coolprojectname  # Enter the env
conda install pandas  # Install pandas, best module ever, into your new env

conda env export > coolprojectname.yml  # A file that conda can use to recreate the env
pip freeze > requirements.txt  # A file that pip can use to recreate the env
deactivate  # So long for now!

Conda by default puts your virtual environment folders inside /Anaconda3/envs, so you should be able to go see a new folder there called coolprojectname. To make a conda environment with Python 2.7 and all the standard Anaconda packages with it you do conda create -n myapp python=2.7 anaconda (notice the anaconda at the end).

Documenting the Code

In my opinion there are four essential pieces to this. Starting from highest-level and zooming into the detailed they are:

  1. A detailed README with the standard sections.
  2. Good, long docstrings at each .py file.
  3. Good, long, RST-style docstrings at the top of each function (basic helper functions can have the single-line docstring)
  4. Good, short, "block" and "inline" comments throught the code

The README.md is the most critical file in my opinion - the one that everyone looks for when faced with a new unknown repo. It should start with a description of the project and a link to additional documentation (for example hosted at readthedocs). It should also include a "Quickstart" section on how to install and start using the project). If the project has non-python dependencies these should also be stated in the README.

The definitive source on commenting and docstrings is the PEP8 style guide and the PEP257 semantics guide which tells you what a docstring should say and how it should say it (but is markup-syntax-agnostic). If you haven't already, go read at least the sections on commenting. Here is an example of the kind of commenting you aim for:

In [1]:
def plot_averages(df, startdate=None, stopdate=None, trailing=200):
    """
    Compute and plot a moving average of the daily sum of MKT_VAL and CURR_FACE together with their daily totals. 
    Do this for all transactions and also by ABS_TYPE groups. Return the matplotlib ax object of the plot.
    
    :param df: the pandas dataframe holding the data, indexed by timestamp
    :param startdate: first date of the range of interest (inclusive). Expects format mo/day/year
    :param stopdate: last date of the range of interest (inclusive). Expects format mo/day/year
    :param trailing: the size of the window for the moving average
    :return: axes object for the plot
    """
    
    # Block comment, applies to the lines of code right below it.
    pass
    pass  # Inline comment. Use sparingly.
    pass

I use the excellent PyCharm IDE, and after upgrading to the most recent version (2017.1) the default behavior is that after you begin a docstring with the """ and then press enter, it auto-generates a template for all the "param:" and "return:" lines based on the function signaure!

A big driver of why you should use a specific docstring convention is that some great tools exist for auto-generating documentation of your project based on parsing the module and function docstrings!

The Final Workflow

Move to where you want your project to be and create the cookiecutter template and provide the name etc. when prompted.

In [ ]:
cookiecutter https://github.com/drivendata/cookiecutter-data-science

Create a virtual environment for your project which has the same name as your project and freeze the requirements. You probably have a core set of package dependencies so you could instead make this environment once and name it something like scipybase and then clone it. This has the disadvantage that if you want to always be using the most current version of packages then you have to update them in the new cloned environment. Note that if you like using ipython as your development shell then you need to install it here! It also helps to just go ahead and install sphinx so that you can generate your docs without leaving the environment.

In [ ]:
conda create --name myrpoject --clone scipybase
activate coolprojectname  # Enter the env
conda update --all
conda env export > coolprojectname.yml  # A file that conda can use to recreate the env
pip freeze > requirements.txt  # A file that pip can use to recreate the env
deactivate 

Initialize the project directory as a local repo, and link it to a github repo. You can actually do this from your local command line (below, replace USER with your username and REPO with the name you want for the repo).

In [ ]:
curl -u 'USER' https://api.github.com/user/repos -d '{"name":"REPO"}'  # Replace USER and REPO!
git remote add origin git@github.com:USER/REPO.git
git push origin master

If you use a heavyweight IDE like I do (PyCharm) then go ahead and initialize a PyCharm project (or whatever your IDE calls them) from the top level of your cookiecutter project. It's going to add it's own control files here, but if they are hidden files (start with ".") then they'll be gitignored by default (otherwise consider adding them to .gitignore).

Now you can actually proceed with the important stuff! Data science workflows are notoriously hard to pigeonhole, but if you want to see people try to describe it just search google images for "data science workflow"

Finally, in refactoring jupyter notebooks to src code scripts you might find the following command helpful

In [ ]:
ipython nbconvert --to script NOTEBOOKNAME.ipynb

My plan right now is to use this workflow and template for my next freelance project or kaggle competition (whichever comes first). At that point I'll update this post with tips and challenges.