A Boilerplate for Python Data Scripts

21 August 2015

Most of the data scripts I write follow a pretty basic pattern. Since I try to write scripts as programs, I do my best to properly set up argument parsing and logging and whatnot. But of course that’s tedious, and sometimes I get lazy.

So, in the name of not repeating work, and because other people might find it helpful, I’ve put together a boilerplate for data scripts. I’ve put up commented and uncommented versions on GitHub, But for discussion purposes, here’s the commented version:

#!/usr/bin/env python
"""A boilerplate script to be customized for data projects.

This script-level docstring will double as the description when the script is
called with the --help or -h option.

"""

# Standard Library imports go here
import argparse
import contextlib
import io
import logging
import sys

# External library imports go here
#
# Standard Library from-style imports go here
from pathlib import Path

# External library from-style imports go here
#
# Ideally we all live in a unicode world, but if you have to use something
# else, you can set it here
ENCODE_IN = 'utf-8'
ENCODE_OUT = 'utf-8'

# Set up a global logger. Logging is a decent exception to the no-globals rule.
# We want to use the logger because it sends to standard error, and we might
# need to use the standard output for, well, output. We'll set the name of the
# logger to the name of the file (sans extension).
log = logging.getLogger(Path(__file__).stem)


def manipulate_data(data):
    """This function is where the real work happens (or at least starts).

    Probably you should write some real documentation for it.

    Arguments:

    * data_in: the data to be manipulated

    """
    log.info("Doing some fun stuff here!")
    return data


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(description=__doc__)
    # If user doesn't specify an input file, read from standard input. Since
    # encodings are the worst thing, we're explicitly expecting std
    parser.add_argument('-i', '--infile',
                        type=lambda x: open(x, encoding=ENCODE_IN),
                        default=io.TextIOWrapper(
                            sys.stdin.buffer, encoding=ENCODE_IN)
                        )
    # Same thing goes with the output file.
    parser.add_argument('-o', '--outfile',
                        type=lambda x: open(x, 'w', encoding=ENCODE_OUT),
                        default=io.TextIOWrapper(
                            sys.stdout.buffer, encoding=ENCODE_OUT)
                        )
    # Set the verbosity level for the logger. The `-v` option will set it to
    # the debug level, while the `-q` will set it to the warning level.
    # Otherwise use the info level.
    verbosity = parser.add_mutually_exclusive_group()
    verbosity.add_argument('-v', '--verbose', action='store_const',
                           const=logging.DEBUG, default=logging.INFO)
    verbosity.add_argument('-q', '--quiet', dest='verbose',
                           action='store_const', const=logging.WARNING)
    return parser.parse_args()


def read_instream(instream):
    """Convert raw input for to a manipulatable format.

    Arguments:

    *Instream: a file-like object

    """
    # If you need to read a csv, create a DataFrame, or whatever it might be,
    # do it here.
    return instream.read()


def main():
    args = parse_args()
    logging.basicConfig(level=args.verbose)
    data = read_instream(args.infile)
    results = manipulate_data(data)
    args.outfile.write(results)

if __name__ == "__main__":
    main()

The comments explain a what’s going on, but the basic idea is to follow Doug McIlroy’s version of the Unix Philosophy:

This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

Let’s walk through that:

Do one thing well

A boilerplate can’t keep you from over-complicating your script, but the thrust here is to write one primary function, and whatever support functions you need. Keep it simple; one script per conceptual step. That one thing should be reinforced in:

The name of the script (which doubles as the log name)
The script-level docstring (which doubles as the help description)
The name of the manipulation function
The docstring of the manipulation function

Write programs to work together

In this context, I take this to mean two things. First, the script can be run stand-alone or imported as a library: the main, parse_args, and read_instream functions take care of all the file stuff so that the manipulate step does exactly that: manipulate data. The second thing it means to me is flexibility in input and output. This script can read or write from files, or it can read from standard input and write to standard output, or any combination thereof. Which takes me to the next point:

Write programs to handle text streams

You can pipe in or out of this script with no trouble. In fact, since the read_instream function is public,¹ you can even use text streams with no trouble when using the script as a library. The encodings are unicode by default, but you can change them if you absolutely have to. This is also one reason to use the logging module instead of, say, print functions. Logs from logging print to standard error, not standard output; you can log safely without worrying about screwing up your output.

Is this overkill?

The uncommented version of the boilerplate is 79 lines. Is all of it really necessary for every small job? Well, no. Not necessary. What I’m trying to do is encourage myself (and anybody who feels like using this boilerplate) to use practices that will be helpful over the long run.

I often find myself repeating tasks in different projects and different circumstances. When I can just drop in a script I already wrote and have it work, it makes my life better. When I know that I could have had that but didn’t take the time, it makes my life worse. All this helps me stay on the better side.

If you can think of a way to make it even better, feel free to hit me up on Twitter or open an issue on GitHub!

Because this is Python, and we’re all adults here.↩︎

Redesign and Missing Posts

26 July 2015

I’ve redesigned the site because I felt it was looking a little stale. For anyone interested this site is hosted on Github Pages, using the Jekyll site generator. I use the Pure CSS framework, the jQuery javasript library, and Modernizr.

Part of the redesign means that my older posts have to be reformatted a bit. There are few enough of them that I’ve decided just to pull them all off and then go through them by hand. Some old posts I don’t care for enough to bring back, but I doubt anyone will miss them.

For Data Projects, Write Programs not Scripts

12 March 2015

One thing I see a lot in data work is scripts that look something like this:

"""
Some Data Analysis
"""

import some_library

data = some_library.read_file("input.csv")
data = data.select(["this", "that", "the other"])
# Bunch of work goes here
some_library.write_file(data, "output.csv")
some_library.make_pretty_chart(data, "chart.png")

Sometimes you’ll get a few functions in there, and sometimes even a main function, but the central driving idea is that you write a script to do the specific task at hand.

In theory that sounds eminently sensible; in practice it makes your life harder.

The whole point of computers is that they do repetition better than we do, and there’s almost no step in data work that you will only want to do once.

A better solution is to write every script as if you intended it to be a stand-alone program. For example, we could make the above pseudo-script look like this:

"""
Do some data analysis
"""
import argparse

import some_library


def analyze(data):
    """Do analysis on data and return result"""
    # Data work goes here
    return data


def write_output(data, ouput_file, output_img):
    """
    Write data file to output_file and
    write chart to output_img
    """
    some_library.write_file(data, output_file)
        some_library.make_pretty_chart(data, output_img)


def create_output_from_file(
        input_file, output_file, output_img):
    """
    Read data from input_file
    Write analyzed data to output_file
    Write chart ot output_img
    """
    data = some_library.read(input_file)
    write_output(data, args.output_file, args.output_img)


    def parse_args():
        parser = argparse.ArgumentParser(description=__doc__)
        parser.add_argument("input_file")
        parser.add_argument("output_file")
        parser.add_argument("output_img")
        return parser.parse_args()


    def main():
        args = parser.parse_args()
        create_output_from_file(args.input_file,
                                args.output_file,
                                args.output_img)


    if __name__ == "__main__":
        main()

You would just run this script using python script_name.py input.csv output.csv output.png to get the same result as the original. The difference is, this version of the script is completely portable; you can take it and drop it into any project where you need the functionality.

Additionally, it’s import-safe, so you can pull your analyze or write_output function into another script.

Now of course, you still want to put your specific filepaths for the specific project you’re working on somewhere, so that you don’t have to type them over and over. You could do that using a shell script, a master python script, or even (and ideally), a makefile. But here again, the organization works to your advantage; you can separate your functionality into small, clean, reusable scripts and still have one file that concisely tells you what’s happening from beginning to end.

This, of course, is nothing but bringing the Unix Philosophy to data work:

This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
—Doug McIlroy

Well, OK, I didn’t get to text streams. But one thing at a time.