In which I play with cookie cutters and subsequently crave actual cookies.¶
I recently started freelance data sciencing on Upwork, which is awesome! My second contract there involved writing some Python for a client who wanted to be able to execute various analyses and visualizations from his data on securities trading. Since I was essentially responsible for delivering a piece of software to this dude, I spent a little bit of time reading on how to structure a project like this. What I discovered was a pretty solid concensus on how all publicly distributed python packages (modules) should be structured.
So I conformed the client's code to this standard python package structure and documented the hell out of it. Then something amazing happened... I felt like I had actually created a persistent, useful tool and code base for myself and this guy, rather than a collection of haphazard scripts with a workflow that nobody would be able to easily reproduce one month hence. So here is the revolutionary idea, treat your data science projects like distributed python packages - structure them a standard way and document the crap out of them.
Structuring a Python Project¶
So how does the standard layout for a publicly distributed python package look? The simplest version looks like this:
!tree /F /A
As you can see, your project directory should have a handful of top level files like README.md
, LICENSE
, and requirements.txt
. Then it should have at minimum the following three folders:
docs
to store documentationcoolprojectname
to store your actual python package (the name of this folder is the name of the package)coolprojectname/test
to store scripts to run tests on your package code
The file(s) at the top level that lets anyone exactly duplicate the virtual environment needed for the project to run are a requirements.txt
file (used by pip
to recreate an environment) and a .yml
file (used by conda
to recreate an environment). (If that was confusing to you then you should read the section below on environments.)
Project Structure Resources¶
Structuring a Python Data Science Project¶
Turns out some really smart people have thought a lot about this task of standardized project structure. The cookiecutter tool is a command line tool that instantiates all the standard folders and files for a new python project. It turns out there is an awesome fork of this project, cookiecutter-data-science, that is specific to data science! The official cookiecutter-data-science docs are actually excellent (and short) so I recommend you read them cover-to-cover.
Make sure you have git.exe
in your path (if, like me, you've just been using the Git Bash command line utility then you probably don't have it in path yet). Now, from system shell do:
conda config --add channels conda-forge
conda install cookiecutter
cookiecutter https://github.com/drivendata/cookiecutter-data-science
You'll be prompted to input various things like the project name (directory name) and the repo name (the name of the github repo). After that you're new directory awaits, chock full of useful standardized stuff.
There is a .gitignore
which excludes the data by default (since those are usually very big files).
There is also a tox.ini
file. From the official docs: "Tox is a generic virtualenv management and test command line tool you can use for checking your package installs correctly with different Python versions and interpreters, running your tests in each of the environments and configuring your test tool of choice." I definitely have no experience with tox
and since I feel like this adds an unnecessary complication I'm not going to be using it.
The docs
folder is actually a sphinx
project, so that the documentation can be auto-generated. sphinx
is a tool that parses your project code to auto-generate documentation based on your docstrings!
Use an Environment, I Beg You¶
Please, for the love of god, use a virtual environment for any substantial project involving code you might ever want to reuse. I can reassure you that although they sound super technical they are actually dead simple to use. A virtual environment is like a self-contained, preserved, install of a specific version of python and packages. It let's you completely reproduce the working environment for your code at any time - you'll never again break ALL the things by updating!
In a more boring and realistic sense, a virtual environment is just an isolated folder holding specific versions of python and packages. You make a virtual environment by indicating which version of python should be copied into your new directory (or specify the path from which to grab python) and it will also install the pip package into the environment by default. After this you can enter the virtual environment and use pip
(or conda
) like normal to add packages, but they will be installed into your new virtual environment directory. Being inside a virtual environment is just like telling your OS that your python executable and your PATH (where it looks for modules) have moved to the new virtual environment directory.
The standard way of managing virtual environments in Python is with the package manager pip
and a 3rd party tool virtualenv
(in Python 3 there is built-in support and you use pyenv
instead of virtualenv
). But Conda, the package management solution that ships with the Anaconda distribution of Python has it's own (better) approach for virtual environments.
Here is a teaser to how simple conda environments are to use:
conda create --name coolprojectname python=2.7 # Create a new env with only a fresh install of python 2.7 (no packages)
activate coolprojectname # Enter the env
conda install pandas # Install pandas, best module ever, into your new env
conda env export > coolprojectname.yml # A file that conda can use to recreate the env
pip freeze > requirements.txt # A file that pip can use to recreate the env
deactivate # So long for now!
Conda by default puts your virtual environment folders inside /Anaconda3/envs
, so you should be able to go see a new folder there called coolprojectname
. To make a conda environment with Python 2.7 and all the standard Anaconda packages with it you do conda create -n myapp python=2.7 anaconda
(notice the anaconda
at the end).
Virtual Environment Resources¶
- Primer on virtual environments by RealPython
- Official conda docs on managing environments.
- General overview of virtual environments in python
- Short stackoverflow answer and a longer official article about how conda environments are different (better) than virtualenv.
Documenting the Code¶
In my opinion there are four essential pieces to this. Starting from highest-level and zooming into the detailed they are:
- A detailed README with the standard sections.
- Good, long docstrings at each
.py
file. - Good, long, RST-style docstrings at the top of each function (basic helper functions can have the single-line docstring)
- Good, short, "block" and "inline" comments throught the code
The README.md
is the most critical file in my opinion - the one that everyone looks for when faced with a new unknown repo. It should start with a description of the project and a link to additional documentation (for example hosted at readthedocs). It should also include a "Quickstart" section on how to install and start using the project). If the project has non-python dependencies these should also be stated in the README.
The definitive source on commenting and docstrings is the PEP8 style guide and the PEP257 semantics guide which tells you what a docstring should say and how it should say it (but is markup-syntax-agnostic). If you haven't already, go read at least the sections on commenting. Here is an example of the kind of commenting you aim for:
def plot_averages(df, startdate=None, stopdate=None, trailing=200):
"""
Compute and plot a moving average of the daily sum of MKT_VAL and CURR_FACE together with their daily totals.
Do this for all transactions and also by ABS_TYPE groups. Return the matplotlib ax object of the plot.
:param df: the pandas dataframe holding the data, indexed by timestamp
:param startdate: first date of the range of interest (inclusive). Expects format mo/day/year
:param stopdate: last date of the range of interest (inclusive). Expects format mo/day/year
:param trailing: the size of the window for the moving average
:return: axes object for the plot
"""
# Block comment, applies to the lines of code right below it.
pass
pass # Inline comment. Use sparingly.
pass
I use the excellent PyCharm IDE, and after upgrading to the most recent version (2017.1) the default behavior is that after you begin a docstring with the """ and then press enter, it auto-generates a template for all the "param:" and "return:" lines based on the function signaure!
A big driver of why you should use a specific docstring convention is that some great tools exist for auto-generating documentation of your project based on parsing the module and function docstrings!
Documenting and Commenting Resources:¶
The Final Workflow¶
Move to where you want your project to be and create the cookiecutter template and provide the name etc. when prompted.
cookiecutter https://github.com/drivendata/cookiecutter-data-science
Create a virtual environment for your project which has the same name as your project and freeze the requirements. You probably have a core set of package dependencies so you could instead make this environment once and name it something like scipybase
and then clone it. This has the disadvantage that if you want to always be using the most current version of packages then you have to update them in the new cloned environment. Note that if you like using ipython
as your development shell then you need to install it here! It also helps to just go ahead and install sphinx
so that you can generate your docs without leaving the environment.
conda create --name myrpoject --clone scipybase
activate coolprojectname # Enter the env
conda update --all
conda env export > coolprojectname.yml # A file that conda can use to recreate the env
pip freeze > requirements.txt # A file that pip can use to recreate the env
deactivate
Initialize the project directory as a local repo, and link it to a github repo. You can actually do this from your local command line (below, replace USER with your username and REPO with the name you want for the repo).
curl -u 'USER' https://api.github.com/user/repos -d '{"name":"REPO"}' # Replace USER and REPO!
git remote add origin git@github.com:USER/REPO.git
git push origin master
If you use a heavyweight IDE like I do (PyCharm) then go ahead and initialize a PyCharm project (or whatever your IDE calls them) from the top level of your cookiecutter project. It's going to add it's own control files here, but if they are hidden files (start with ".") then they'll be gitignored by default (otherwise consider adding them to .gitignore
).
Now you can actually proceed with the important stuff! Data science workflows are notoriously hard to pigeonhole, but if you want to see people try to describe it just search google images for "data science workflow"
Finally, in refactoring jupyter notebooks to src code scripts you might find the following command helpful
ipython nbconvert --to script NOTEBOOKNAME.ipynb
My plan right now is to use this workflow and template for my next freelance project or kaggle competition (whichever comes first). At that point I'll update this post with tips and challenges.