16 March 2020

A few days ago I started seeing a chart floating around of cases of the coronavirus by country for each day starting on the day the country hit 100 cases. It struck me as odd to put the US on that chart next to, say, Italy because the US could get a similar number of cases with small outbreaks in a large number of cities that Italy had with huge outbreaks in a few cities. I found out that the source data had subnational data, so I started putting that chart on Twitter. There’s been enough interest that I’ve decided to make a live tracker1 with that visualization and others as I think of them. You can find that tracker here.

  1. Were you aware that JavaScript charting libraries are, in fact, the worst? The more you know! 

1 March 2020

Background

Oren Cass thinks the middle class has a problem. Cass, a scholar at the Manhattan Institute and head of the new organization American Compass, argues that middle class families feel like they’re worse and worse off, and furthermore, he argues that they’re right. The solution he provides is for American conservatism to throw off the libertarian, market fundamentalist who have controlled them like brain slugs since the ’80s, and instead go for some good old fashioned industrial planning.

There’s a lot in there. But we’ll blow past most of it for now to focus on the empirical argument that Cass has provided for his point that the middle class is worse off than it used to be. He doesn’t think that the official statistics (which show the opposite) are very good because of the way they handle quality adjustments. Inflation measures try to capture the fact that goods get more expensive over time both because of inflation and because quality improves. The next iPhone will be more expensive than the last one, but also better, so a lot of that price increase doesn’t count as inflation.1 But aha! says Cass. what happens if they stop selling the old iPhone? Sure, the new one’s better, but it’s still more expensive to get a smartphone, even if you don’t think the improved quality justifies the higher price.

The Cost of Thriving Index

So to his credit, Cass has put forward an alternative. He picked four categories that he argues are essential to feeling like you’ve got a solid, middle class life. Those categories are:

  • Housing, as measured by the Department of Housing and Urban Development’s estimate of a fair market rent for a 3-bedroom house in Raleigh, NC

  • Health insurance, as measured by the Kaiser Family Foundation’s estimated average premium for employer-sponsored health insurance for a family

  • Transportation, as measured by the Bureau of Transportation Statistics’s estimate of the total cost of driving a car 15,000 miles per year

  • Education, as measured by the Department of Education’s estimate of the cost of one semester of public college

This basket he compares to the last component, Income, as measured by the median wages for male workers for one year. Why only males? Because, he reasons, it used to be perfectly normal to be able to afford a middle-class life on one income, so why shouldn’t it be so now?

The combined price of those four expenses compared to the income measure is what Cass calls the Cost of Thriving Index, which he lays out in a report called The Cost of Thriving.

And by that measure, the American middle class is, indeed, getting screwed over pretty good. Here’s the chart of the number of weeks of work needed to fund the four categories of major expenses for each year from 1985-2018:

Cass Original Chart, Weeks of Work to Cover Major
Expenses

Oh, no! It takes more than 52 weeks of work to afford a middle class life on a middle class income! There’s not that many weeks in a year! That sounds pretty bad!

The chart that’s gone viral is this next one, showing the expenses broken out vs. income over time.

Cass Original Chart, Income vs Major
Expenses.

Pretty dramatic stuff. Wages have gone up over time, sure, but they’ve just been dwarfed by the increased price of the major expenses. Heck, after all that, there’s not even money to pay for food!

The Problem

Again, there’s a lot in there. Almost every bit of it is controversial on a theoretical level. See, for example, this, this, and this.

What I want to focus on is that health care component. You can see from the second chart that it’s the biggest single driver of rising costs. That’s not too surprising because we all know that health care costs have been going up.

But there’s a problem. As most folks with jobs are aware, an employee rarely pays the whole cost of their health insurance; their employer pays some, too. But Cass is using the total cost as the price of health care. So in essence, his “cost” of thriving includes costs that a wage-earner does not pay. 2

This mistake matters because it goes directly to what the COTI is supposed to measure. We’re supposed to be looking at how long someone has to work to pay for these four major categories. If we’re including things that the wage-earner doesn’t pay for himself, then we’re going to get the wrong picture—at least if the difference is large enough to matter.

What the COTI Says When Done Correctly

I’m not the first person to point this out, but I was curious just how big a problem this was, so I decided to reconstruct the index myself. First, I included the employer-paid health insurance premiums to make sure my number matched Cass’s. Then I recalculated it using using only the employee-paid portion, which I got from the same source. It’s worth pointing out that Cass gave clear enough indications of his sources and methods that I was able to reproduce most of his numbers exactly, and the ones that weren’t exact are close enough as makes no difference.3

So here’s what the weeks-of-work chart looks like with both with the original index that incorrectly includes employer-paid premiums, and with the recalculated index that only includes what employees actually pay:

Weeks of Work to Pay for Major Expenses, Original and Recalculated
Index

Oh. Instead of taking 53 weeks of work to pay for a middle class life, it’s actually 39. Not only is that a good bit less than a year, but it’s lower than where Cass’s original index had us in 1996. It’s unclear whether Cass thinks 1996 were good times and the trend is still up, but we see now that the four categories are less than all-consuming.

And what about the raw expenses chart? Here’s my recalculation of the original, so you can see that even with a slightly different health series, it’s the same story:

Income vs Major Expenses, Recalculation of
Original

And here’s what it looks like when you count health insurance properly:

Income vs Major Expenses, Recalculation of
Original

Well, then. Looks like the middle class still has a good bit of headroom, even after paying the price of thriving. And let’s not forget: the things that Cass includes in his major expenses are the ones where costs have gone up the most over time. The rest of that income can go to things like food, leisure, technology, and travel that have gotten cheaper, have improved in quality or convenience, and where a far greater variety is available than used to be. Not only can a single wage-earner still cover Cass’s expenses like they could in 1985, they can go beyond in many ways that were unimaginable back then.

A lot of folks have raised good criticisms of the Cost of Living Index. I think it’s wrong to cherry-pick the fastest-growing categories of expenditures. I think it’s wrong to ignore changes in quality over time. I think it’s wrong to ignore the unpaid work that women used to do more of as a group than they do now. But even if you ignore all of that, what the Cost of Thriving Index shows, at least when calculated properly, is that the middle class is doing just fine, thank you very much. Would it be good if the price of health care insurance and college were growing more slowly? Sure. Of course it would. But are they eating up the middle class? No. No they are not.

  1. In theory, at least. I don’t have time for an Apple rant in the middle of this chart rant. 

  2. Economically, you could argue that the employee bears the cost of the employer share of insurance payments. In fact, I would argue that. But if you want to look at it that way, you’d need to add that to wages as income to make the index make sense. It wouldn’t really make a difference to my argument, so I kept it this way for simplicity. 

  3. All my work, including my source data is available on GitHub. The repository is here, the notebook with the calculations is here.

    Cass uses a Kaiser Family Foundation estimate for health insurance costs; I looked at Kaiser and it pointed to the Agency for Healthcare Research and Quality. The AHRQ numbers do not exactly match the one Cass reports (they’re off by a few hundred dollars at both ends) and cover more years than he says Kaiser does. The differences are small enough that you can’t see them on the graphs except for the first few years, and I’m confident that whatever series he’s using and the one I used are very similar, especially in recent years. 

5 December 2018

A while back I wrote up a Python data script boilerplate that crystallized some of the things I found myself doing over and over. And while that boilerplate has served surprisingly well, I’ve found myself regularly making a few changes, so I figure it’s probably time for an update to version two.

I’ll show you the finished product first, and then walk through each chunk, noting what I’ve changed1.

The Boilerplate

#!/usr/bin/env python3
"""
A boilerplate script to be customized for data projects.

This script-level docstring will double as the description when the script is
called with the --help or -h option.
"""

# Standard Library imports
import argparse
# import collections
# import csv
# import itertools
import logging

# External library imports
# import pandas as pd
# import numpy as np

# Standard Library from-style imports go here
from pathlib import Path

# External library from-style imports go here
# from matplotlib import pyplot as plt

__version__ = '0.1'

log = logging.getLogger(__name__ if __name__ != '__main__ '
                        else Path(__file__).stem)


def manipulate_data(data):
    """This function is where the real work happens (or at least starts).

    Probably you should write some real documentation for it.

    Arguments:

    * data: the data to be manipulated

    """
    log.info("Doing some fun stuff here!")
    return data


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', nargs='?', default='-')
    parser.add_argument('-ei', '--infile_encoding', default='utf-8')
    parser.add_argument('-o', '--outfile', default='-')
    parser.add_argument('-eo', '--outfile_encoding', default='utf-8')

    verbosity = parser.add_mutually_exclusive_group()
    verbosity.add_argument('-v', '--verbose', action='store_const',
                           const=logging.DEBUG, default=logging.INFO)
    verbosity.add_argument('-q', '--quiet', dest='verbose',
                           action='store_const', const=logging.WARNING)

    parser.add_argument('--version', action='version',
                        version=f'%(prog)s v{__version__}')

    args = parser.parse_args()
    args.infile = argparse.FileType(encoding=args.infile_encoding)(args.infile)
    args.outfile = argparse.FileType(
        mode='w',
        encoding=args.outfile_encoding,
        # newline='', # for csvs
    )(args.outfile)
    return args


def read_instream(instream):
    """Convert raw input for to a manipulable format.

    Arguments:

    * instream: a file-like object

    Returns: probably a DataFrame

    """
    log.info('Reading Input')
    return instream.read()


def main():
    args = parse_args()
    logging.basicConfig(level=args.verbose)
    data = read_instream(args.infile)
    results = manipulate_data(data)
    print(results, file=args.outfile)


if __name__ == "__main__":
    main()

Walkthrough

The first chunk is pretty self explanatory. It sets the shebang (now explicitly Python 3), gives a docstring that doubles as program help info later, and organizes the imports. This time I’ve included commented imports that I often use, like Pandas and NumPy. I’ve also dropped some imports that became unnecessary since argparse is a bit more sophisticated than it used to be.

#!/usr/bin/env python3
"""
A boilerplate script to be customized for data projects.

This script-level docstring will double as the description when the script is
called with the --help or -h option.
"""

# Standard Library imports
import argparse
# import collections
# import csv
# import itertools
import logging

# External library imports
# import pandas as pd
# import numpy as np

# Standard Library from-style imports go here
from pathlib import Path

# External library from-style imports go here
# from matplotlib import pyplot as plt

The next chunk sets up our module-level info. I’ve added in a version this time around, because you should version things. I’ve also made the log name dependent on how whether the script is loaded as a module or not, because the full name may be more helpful if the script ends up as a submodule somewhere, which occasionally happens.

__version__ = '0.1'

log = logging.getLogger(__name__ if __name__ != '__main__ '
                        else Path(__file__).stem)

The data manipulation function is more-or-less unchanged, since this is where the actual work occurs. In general, you’ll want to rename this function to what it actually does.

def manipulate_data(data):
    """This function is where the real work happens (or at least starts).

    Probably you should write some real documentation for it.

    Arguments:

    * data: the data to be manipulated

    """
    log.info("Doing some fun stuff here!")
    return data

The parse_args function is in many ways the star of the show here, and I’m going to break it into different chunks. In the first chunk, we create the parser and add an infile and outfile argument. We create optional encoding for each of those as well. I’ve changed infile to be a positional argument because that makes it easier to use with make-style workflow tools. We’re taking the infile and outfile arguments as strings, with default values of ‘-‘; as we’ll see below, this is the least ugly 2 way to make use of argparse’s neat FileType object but also let the user set the encoding at runtime.

That encoding point is another difference between the old and new version. Previously, encodings were set as script-level constants, which really works against the reusability idea.



def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', nargs='?', default='-')
    parser.add_argument('-ei', '--infile_encoding', default='utf-8')
    parser.add_argument('-o', '--outfile', default='-')
    parser.add_argument('-eo', '--outfile_encoding', default='utf-8')

In the next chunk we just throw in some handy helpers. First we add mutually exclusive verbose and quiet flags to set the logging level. Then we add in a version flag, because gosh darn are we professional.


    verbosity = parser.add_mutually_exclusive_group()
    verbosity.add_argument('-v', '--verbose', action='store_const',
                           const=logging.DEBUG, default=logging.INFO)
    verbosity.add_argument('-q', '--quiet', dest='verbose',
                           action='store_const', const=logging.WARNING)

    parser.add_argument('--version', action='version',
                        version=f'%(prog)s v{__version__}')

Now we parse our arguments and convert our input and output files to FileType objects. The great thing about FileTypes is that you can set the properties like mode and encoding, and the constructor is smart enough to wrap standard input and output if the provided filename is ‘-‘. No more messing around with sys.stdin and io objects! It looks a bit odd because FileType actually creates a new type, which is then instantiated with the path to the file.

I’ll admit that while I included standard input and output in my original boilerplate three years ago, it’s only in the last year or so that I’ve found myself using it a lot. It plays very well with cloud infrastructure, and makes modularity all that much easier. Working with text flows also allows you to use command-line tools like grep and sed which are often undervalued, expecially when working with large files.

    args = parser.parse_args()
    args.infile = argparse.FileType(encoding=args.infile_encoding)(args.infile)
    args.outfile = argparse.FileType(
        mode='w',
        encoding=args.outfile_encoding,
        # newline='', # for csvs
    )(args.outfile)
    return args

The read_instream function isn’t always one that lives through to the production script. In some cases, the read_instream function is entirely replaced by a pd.read_csv or something like that. If it’s simple enough, I keep it in the main function. But when you do have a complicated few steps to get the data in the right shape, it’s best to segregate it to its own function. The temptation is to put the code getting your data ready for manipulation or analysis in the manipulation or analysis function, but that’s bad design because it means you spend a lot of time in a function not doing the thing that is the point of that function. If only for mental clarity, keep it separate. Tidy your data here.

def read_instream(instream):
    """Convert raw input for to a manipulable format.

    Arguments:

    * instream: a file-like object

    Returns: probably a DataFrame

    """
    log.info('Reading Input')
    return instream.read()

Finally we have the standard main function: input, manipulate, output. On a suggestion from Arya McCarthy, I’ve switched to using the print function to print the final results, since print will implicitly handle conversion to a text format, while you have to do that yourself when using outfile.write. Of course, that line will often be replaced with to_csv or something like that.

def main():
    args = parse_args()
    logging.basicConfig(level=args.verbose)
    data = read_instream(args.infile)
    results = manipulate_data(data)
    print(results, file=args.outfile)


if __name__ == "__main__":
    main()

Why Scripts instead of Notebooks?

I’m not going to re-hash the Unix Philosophy or the overkill question, since I covered those last time. But the question that’s even more pressing now than it was three years ago is: why the heck are we writing scripts instead of doing everything as a Jupyter Notebook?

I guess I’m a bit of a notebook skeptic, even though I use notebooks every day. I recognize that people use them all the time to do large-scale, impressive things at production. Look at Netflix. They’re great for experimentation, they’re great for graphics.

But I just don’t trust them. I don’t trust that the cell can be run out of order, or multiple times, or that you can have a variable that you defined and then deleted or changed the definition of so that it doesn’t match anything on the screen. I don’t like that it doesn’t work cleanly and directly with version control, and I don’t like that it doesn’t work cleanly and directly with text streams on the command line. You can’t import them, and 99 percent of them are named “Untitled”.

Maybe that means I’m just not disciplined enough, and maybe it means I’m a grumpy old man. I can live with that. But scripts have never let me down.

Tell me what you think!

So that’s the new boilerplate. If you use it, or have questions or edits, I’d love to hear from you on Twitter or just email me.

  1. It’s also on GitHub here, of course 

  2. It’s still a little ugly. 

17 April 2018

This piece in the Atlantic has been making the rounds, suggesting that Jupyter notebooks are “what comes next” as the scientific paper goes the way of legal vellum and religious papyrus. The argument runs something like this: most of the actual work of science is computational. Notebook environments give an excellent environment for doing replicable computation while also interspersing explanatory text, and Jupyter notebooks are the best ones around. This clears away the need for much of the gobbledygook for which scientific papers are so widely criticized, and puts the work front and center.

And some of that is right. Jupyter is amazing; my team and I are actually working right now to build a Jupyter-based research platform for our organization. And well-documented Jupyter notebooks are certainly a step up over incomprehensible do files that we see with many papers, especially when they’re augmented with simple controls to modify parameters, for example.

But the problem with this line of thinking is that it gets the basic point of a scientific paper wrong. The computation is not the most important part of a paper. The most important part is the writing.

A paper should do three things:

  1. Orient the reader to the current state of knowledge
  2. Offer relevant evidence
  3. Integrate the evidence to produce a new current state of knowledge

Computations happen, if at all, in number 2. Jupyter’s great for presenting that. Jupyter doesn’t help at all for numbers 1 and 3, and they are equally as important, if not more so. What helps them—and really the only thing that helps them—is clear thinking and writing.

Now, obviously you can have text in a Jupyter notebook. You have Markdown and HTML, and that’s a lot! But Jupyter is not a writing environment. There’s no formatting help if you forget the Markdown syntax. You can’t track changes or get comments easily from your peers or coauthors. You can’t manage your bibliography. And it’s just not a pleasant interface for composing text.

For that matter, it’s also not great for reading text. The lines are too long, for one thing. You can’t easily jump to the footnotes and back, like you can in a well-formatted PDF. There aren’t columns, and can’t be.

That’s not really a criticism; that’s not what Jupyter is for. It has text support for writing short bits of text that are secondary to the computational work. But let’s not pretend that Jupyter will ever be an excellent replacement for Word Vim on the editing side, or a PDF on the reading side.

So, if not notebooks, is there something that can replace the crappy scientific paper? At the risk of being dull, how about: the good scientific paper? Papers where introductions and conclusions aren’t unnecessarily ponderous, where the evidence is delivered with minimal gobbledygook and presented in a replicable way1, and implications are explored with straightforwardness and intellectual humility?

Of course, that’s what we’re supposed to be doing now. John H. Cochrane has a wonderful write-up on how to do it in economics2. And there are good examples out there. Watson and Crick’s paper introducing the double-helix structure of DNA? It’s one page long. Including a picture.

Now, if we don’t do that in the future, it’ll be for the same reason we don’t do it now. Journals and peers don’t make us. Publish-or-perish puts more emphasis on producing papers than on producing good papers. Papers are written to be published, not read.

Fixing that technology—the institutional, meta-technology of publishing, would fix the scientific paper. But I don’t think Jupyter Notebooks, glorious powerful magic though they are, can do that.

  1. Jupyter can help with this! 

  2. NOT SCIENCE LOL! I know, I know. 

29 January 2018

This contentious interview of Jordan Peterson, a University of Toronto Psychology Professor, by Cathy Newman of the UK’s Channel 4, has garnered a huge amount of attention. While the interview was nominally to promote Peterson’s upcoming book, Newman clearly believed that she was going to be able to nail him as an ignorant bigot. Unfortunately for her, the general consensus is that Peterson was able to avoid that outcome, and make her look pretty silly in the process.

Much of the conversation (see here, for example has focused on Newman’s interrogatory tactics and how Peterson chose to respond to them, but I there are lessons to be learned here about communicating with statistics. The first time I watched the video my initial reaction was that Peterson clearly understood the statistics he wanted to use to support his points, and the interviewer did not. Those statistics are not all that controversial, even among those who tend to disagree with Peterson’s conclusions, but throughout the interview Newman consistently jumps from his rather modest claims to extreme (and sometimes bizarre) conclusions that she assigns to him.

Even if, as some suggest, Newman’s ignorance here was deliberate, her responses reflect the kind of intuitive interpretation of statistics that I’ve seen many times. Statistics are not intuitive. They are tricky. If you need to communicate with them to a non-statistician—and you will—it’s important to help people understand what the statistics you’re using do and do not imply.

Let’s look at two sections where, with the help of hindsight, we might be able to improve on Peterson’s presentation. First, let’s examine the initial conversation about the pay gap.

Peterson makes two mistakes here. First, in an uncharacteristically imprecise use of language, he says that the pay gap “does not exist,” when that’s not what he means. Over a minute later, he clarifies that he actually means “does not exist solely due to gender”, but by that point a minute of airtime has gone to waste.

The more common mistake Peterson makes in the pay gap discussion, though, is focusing on the method. He starts talking about multi-variate analysis, and the interviewer—and most home viewers—have no idea what it means.1. When challenged by Newman on why he keeps talking about it, he enters into a mostly fine description of why controls are important in regression (although he does make it sound like he’s doing a series of one-to-one comparisons rather than a single composite analysis). He’s not wrong, but he’s also not making his point; the only thing that this part of the conversation does for him is make it sound like he knows what he’s talking about, but the lay audience won’t get anything out of it.

Everyone who communicates about regression-type analysis needs to have a stock phrase to describe what’s important about it and move on, and I was a bit surprised Peterson didn’t have one ready. Here’s how I might have phrased the point he was making in a way that could keep the conversation focused on the point Peterson was driving at:

“It does seem that way, but what repeated studies have reliably found is that when you account for an person’s age and their personality and their aptitude and their interest, then the difference their gender makes to their salary is very small. So a man and a woman who are similar in other ways should expect to make about the same amount of money. So we know that the pay gap is not mostly due to gender bias.”

I timed myself and that took 22 seconds to say, without getting the methodology behind the point in the way of the point itself. Peterson and Newman six times that on an unfruitful conversation about how statistics work.

The second difficulty that stood out to me about the interview was the way that Peterson and Newman talked past each other on the subject of population characteristics and individual characteristics.

The best example picks up right where the last stopped:

Again, Peterson makes an unforced error when he says “Women are less agreeable than men,” and again, the problem isn’t that he’s wrong exactly but rather that what he’s saying will be taken differently by the viewers than he means it. The natural implication of “woman are less agreeable than men” is that all women are less agreeable than all men.

This confusion is nicely demonstrated by the exchange that follows. Newman accuses Peterson of “a vast generalization,” by which she means that he’s making a statement about all individual women. He says that “it’s not a generalization,” and what he means is that it’s a statement about the distribution of that trait among the population of all women. The disconnect is that the same words mean something slightly different to the two because one is thinking statistically and the other isn’t. And the onus has to be on Peterson to make his point clear.

At first I thought the best phrase to do that would be “agreeableness is more prevalent among women than men,” but I don’t think that’s quite right, because agreeableness is a continuous variable. You could opt for something less precise like “more women are highly agreeable than men,” but that doesn’t quite fit right either. I think the best solution here is a small modification: “Women tend to be more agreeable than men.” People understand the non-universality of tend, and that avoids the confusion.

This one isn’t so much a question of wasting time as of avoiding confusion. To their credit, Newman and Peterson reach consensus of what they mean fairly quickly with the final exchange in that clip. They just both get a bit annoyed doing it.

Peterson warmed up as the interview went along, and I think he handled a second go at much the same argument much better:

In that exchange, Newman fires off a number of conclusions that she claims are implied by Peterson’s arguments. All of them are predicated on the idea that his population statistics determine what will happen with every woman. Instead of talking about how statistics work, he goes to the concrete example of Newman herself. That allows him to make his point without any confusion: she’s been successful precisely because she’s pursued her career in the way that he says matters more than gender. There’s no way to confuse “you, as a woman, are successful because you have battled for it” with “the need to battle for success means women will never succeed.” Sometimes when you’re talking about statistical truths, the best way to do it is to avoid discussing them statistically at all.

Now, the point of this isn’t that Peterson’s dumb and I’m smart; I’ve had time to consider and edit. The point is that communicating statistics is incredibly difficult, even if you understand them well yourself. It’s a separate skill, and takes practice. When you screw it up, it’s easy to blame the ignorance of our listeners, but that’s too easy; it’s far better in the long run to focus how you can be better at communicating statistical facts. Then people might be more interested in what you have to say.

Some stray other thoughts about the interview:

  • In general, the interview has been scored as a hands-down win for Peterson. If it had ended after the first ten minutes or so, I’m not sure that would have been the case. I think Newman and Channel 4 deserve a bit of credit for resisting the urge to edit it down to that.

  • That said, the utter blue screen that happens to Newman is one of the most stunning things I’ve ever seen on television.

  • At one point Newman argues against the phrase “the typical woman” because “all women are different, to which Peterson replies that “they’re different in some ways and the same in others.” I found her comment utterly bizarre; if all women are totally different, what makes them women?

  1. As if to prove this point, both Newman and the Channel 4 caption-writer who worked on this clip thought he was saying “multi-varied analysis.” 

21 August 2017

Introduction

I’ve been using Vim as my editor for over ten years1. That’s a long time to build up settings and plugins, and generally get a lot of cruft into my vimrc. These days, when I go in there, I don’t always remember what a particular setting or plugin does or why I put it there, and I rarely look to see if there are updated versions of anything.

So I thought it would be both advantageous and fun to clear out my settings and start again: go Vim Zero, and build up from there. And it was fun! Here’s what I came up with.

Requirements

The first question was: what do I really want an editor to do? I use Vim for writing and for coding. The former is nearly always in markdown, generally Pandoc-flavored but sometimes a different one. I write code most frequently in Python, and a decent amount in HTML, JavaScript, CSS and SCSS, Make. That means I need good support for multiple languages—including syntax checkers and completion—and I need a writing environment that feels comfortable.

I also write both code and text on multiple computers. I use Linux when I can, but use Windows at work and occasionally find myself on a Mac. That means I need to be able to sync everything using git, and all my plugins and settings have to work in the same way. I want my experience in GVim on Windows to be as close as possible to my experience in the terminal on my Arch box.

Finally, I want to keep things as simple and elegant as possible. I want this to still be Vim when I’m done, which means I don’t want a bunch of functionality I’m not using or a bunch of nonsense I don’t need on my screen2. In general, I want to prefer built-in functionality to plugins, and simple, tightly focused plugins to wide-ranging and powerful ones. With that in mind, I started with the settings that made vanilla vim as pleasant as possible.

Built-in Functionality

Well, that’s almost true. I knew I was going to be using Tim Pope’s vim-sensible plugin simply because it sets a goodly number of the things you see in almost every vimrc, like being able to backspace over anything and setting incremental search. We’ll get back to that later.

General Functionality

First I set the mapleader to comma so that it applies for all my mappings. I set hidden to allow for unsaved background tabs, and spell so that I don’t reveal my horrible spelling to the world. Turns out you can set new splits to be on the left, so I turn that on (we live in a world of widescreen monitors, how is this not the default?), and I turn on persistent undo files. All that looks like this:

" Built-In Functionality
"" General
let mapleader = ','

set hidden " Allow background buffers without saving
set spell spelllang=en_us
set splitright " Split to right by default

Text-Wrapping

In general, I want things wrapped at 79 characters—enough, in fact, that it’s easier for me to turn it off when I don’t want it than turn it on when I do. I also like having a highlighted column at 80 characters as a visual guide. I always want hard wraps, so I turn off soft-wrapping.

"" Text Wrapping
set textwidth=79
set colorcolumn=80
set nowrap

Search and Substitutions

I find I want the g flag in my s/ commands far more often than I don’t, so I set it to be on by default. I use highlight searches because that’s half the point, and use the handy combination of ignorecase and smartcase to ignore case when I type in lowercase, but not when I type in capital letters. I also have my first mapping here: comma-space for clearing the highlighted searches. It makes a nice slapping noise, which I quite enjoy, as if to say “get that out of here.”

"" Search and Substitute
set gdefault " use global flag by default in s: commands
set hlsearch " highlight searches
set ignorecase 
set smartcase " don't ignore capitals in searches
nnoremap <leader><space> :nohls <enter>

Tabs

Because I am not a horrible human being who hates joy and love and light, I use four spaces instead of tabs whenever I can3. The following combo will do that, and should be required by law.

"" Tabs
set tabstop=4
set softtabstop=4
set shiftwidth=4
set expandtab

Backup, Swap, and Undo

The next section might be a little controversial. Backup files, swap files, and undo files are great features of vim, but I hate having them clutter up my actual work directories. This isn’t so bad on Linux, where hidden files are simple things, but on Windows, which will incomprehensibly ignore leading dots when doing file completion, it’s awful. So, after I turn on undo files for persistent undo across sessions, I set folders inside my vim folder to hold all of these (see implementation notes at the end to see how I make empty folders with with Git).

"" Backup, Swap and Undo
set undofile " Persistent Undo
if has("win32")
    set directory=$HOME\vimfiles\swap,$TEMP
    set backupdir=$HOME\vimfiles\backup,$TEMP
    set undodir=$HOME\vimfiles\undo,$TEMP
else
    set directory=~/.vim/swap,/tmp
    set backupdir=~/.vim/backup,/tmp
    set undodir=~/.vim/undo,/tmp
endif

NetRW

Some folks won’t like this section either, because it’s about NetRW, vim’s file explorer. It gets more hate than it deserves, but I find it useful4. I set it to have the detail view with human-readable file sizes. The hiding behavior is a little odd, so I just tell the explorer to hide dotfiles, and to set them as hidden by default (this can be toggled with a). Then I turn off the banner. I also add a mapping to start the explorer; the exclamation point means that if the current buffer has unsaved changes, the Explorer will split vertically instead of horizontally.

""" NetRW
let g:netrw_liststyle = 1 " Detail View
let g:netrw_sizestyle = "H" " Human-readable file sizes
let g:netrw_list_hide = '\(^\|\s\s\)\zs\.\S\+' " hide dotfiles
let g:netrw_hide = 1 " hide dotfiles by default
let g:netrw_banner = 0 " Turn off banner
""" Explore in vertical split
nnoremap <Leader>e :Explore! <enter>

General Mappings

To wrap-up the built-in functionality, I have my general mappings. Mapping a semicolon to the colon in normal mode is surprisingly useful. I use control-H and -L to cycle through my buffers, because that feels like moving left and right to me. I use comma-q to quit a buffer and comma-w to save. Finally, I use comma x to access the copy buffer, which allows me to copy and paste between vim and other programs. That last is the only mapping which I have set to work in all modes, and I use it all the time.

"" Mappings
nnoremap ; :
nnoremap <C-H> :bp <enter>
nnoremap <C-L> :bn <enter>
nnoremap <Leader>w :w <enter>
nnoremap <Leader>q :bd <enter>

noremap <Leader>x "+

Python Version

I use Python 3 more or less exclusively. Many of the libraries I use most are (finally) moving to require it, and I like it better anyway. So I have this little autocommand group to set my omnicompletion to Python 3:

"" Python Version
augroup python3
    au! BufEnter *.py setlocal omnifunc=python3complete#Complete
augroup END

Plugins

Plugins are a wonderful part of the Vim infrastructure, and they’re what let you really make the editor your own. That said, folks tend to go overboard; I see vimrc files floating around with dozens of plugins, and it’s just not necessary. When I started this project, I decided to only add plugins I didn’t want to live without, and I think I’ve kept it to a reasonable number. A fantastic resource has been Vim-Awesome, which makes it easy to find plugins by functionality, and also see which are popular, which are maintained, and so on. I knew about some of these, but others I didn’t, and so the site was a huge help.

Plugin Manager

Once upon a time, I installed plugins manually. Then I used my package manager and a script called vim-plugin-manager. Then Tim Pope wrote Pathogen, and like the rest of the world, I switched to it immediately. Then Vundle came along with its Git-driven management, and I happily used that until I started this project.

When I went to see what was out there, I found that Vundle was still a good option, but I was charmed by the simplicity of VimPlug, which didn’t need any rtp manipulation in my .vimrc and could do parallel installations and updates. I decided it was worth making the switch.

This breaks my rule a little bit about preferring built-in functionality; Vim 8 does have a built-in plugin manager of sorts. Unfortunately it would mean taking a step back: there’s no way to keep your plugins updated and I just don’t want to go back to doing it manually and fiddling with submodules in my vimfiles repository. So VimPlug it is! The entirety of my plugin installation section looks like this:

" Plugins 

"" Installation with VimPlug
if has("win32")
    call plug#begin('~/vimfiles/plugged')
else
    call plug#begin('~/.vim/plugged')
endif

""" Basics
Plug 'tpope/vim-sensible'
Plug 'sheerun/vim-polyglot'
Plug 'flazz/vim-colorschemes'

""" General Functionality
Plug 'lifepillar/vim-mucomplete'
Plug 'scrooloose/syntastic'
Plug 'sirver/ultisnips'
Plug 'honza/vim-snippets'
Plug 'tpope/vim-commentary'
Plug 'chiel92/vim-autoformat'

""" Particular Functionality
Plug 'junegunn/goyo.vim'
Plug 'junegunn/limelight.vim'
Plug 'vim-pandoc/vim-pandoc-syntax'
Plug 'godlygeek/tabular'

call plug#end()

I’ll walk through each of those in more detail and give the configuration I have for each. As you can see above, I group my plugins into three groups: basics, which include simple settings, filetype and syntax, and color schemes; general functionality plugins, which add features that are generally useful when editing code or writing text, and particular functionality plugins, when are only useful in particular situations.

Basics

I’ve already mentioned Tim Pope’s excellent Vim-Sensible plugin, which there’s really no downside to installing. It just gives you a lot of sane defaults, and the code is perfectly readable if you want the details.

Vim-Polyglot and Vim-Colorschemes are both omnibus packages. Essentially, they’re curated lists. At first this seemed like overkill to me—why not just install the ones I want? But then I remembered just how many times I’ve switched to a new language and found that either vim didn’t have a filetype for it, or that the user community had a few fixes for the built-in version of indentation or something. Vim-Polyglot collects all of the best of those, and that just saves me having to do it later. Similarly, Vim-Colorschemes has at least one color scheme you will like, even if you’re as picky as I am5. I turn on gui-style colors for the terminal and use the Darth style:

"" Colors
set termguicolors
colorscheme darth

General Functionality Plugins

Autocompletion

Vim isn’t an IDE, and shouldn’t be, but autocompletion is really, really nice. That said, I lived without it for a long time because I didn’t like my options. YouCompleteMe is a pain on Windows. So is NeoComplete, and while it’s predecessor NeoComplCache is more easily cross-platform, it can be slow and frustrating and isn’t updated any more. VimCompletesMe isn’t bad, but has a few quirks I don’t like and is entirely tab-driven, when I would rather just have my options pop up for me.

MuComplete gives me what I want. It does omnicompletion, file completion, snippet completion (see below), pops up as I type and doesn’t get in my way. And it’s fast. Here’s the configuration to make it work:

"" Autocompletion
set completeopt=menuone,noinsert,noselect
set shortmess+=c " Turn off completion messages

inoremap <expr> <c-e> mucomplete#popup_exit("\<c-e>")
inoremap <expr> <c-y> mucomplete#popup_exit("\<c-y>")
inoremap <expr>  <cr> mucomplete#popup_exit("\<cr>")

let g:mucomplete#enable_auto_at_startup = 1 

Snippets

I’ve gone back and forth on snippets for years, but for the moment I’m pro. They save a lot of time writing HTML and encourage me to write docstrings6. Here, Ultisnips has been around for a long time, and while SnipMate also exists as a venerable option, Ultisnips still feels like the gold standard. A solid compilation of snippets is available with the Vim-Snippets plugin. You don’t need any configuration for either as far as I’m concerned, but you can configure MuComplete to take advantage of Ultisnips with this line:

call add(g:mucomplete#chains['default'], 'ulti')

Commenting

I’ve used NerdCommenter for a long time, but Commentary, another from Tim Pope, gives a minimalist yet powerful implementation. You use gc to to toggle comments, and that’s about it. That’s all I want here.

Syntax Checking

Everything I write is perfect the first time, obviously, but sometimes I read other people’s code. Syntastic is a truly clever plugin for running syntax checking. Rather than write its own rules it uses external checkers, like flake8 and tidy, which is a very Unix way of approaching the problem. Hard to beat.

Autoformatting

Of course, if I can get a program to fix my—er, other people’s mistakes automatically, so much the better. The aptly named Vim-Autoformat solves this problem nicely. Like Syntastic, Autoformat uses external programs to format your code. It doesn’t have support for as many programs built-in as does Syntastic, but it’s very easy to define your own.

I set autoformat to run when I save a file, but not to do the default vim sequence of autoindenting, retabbing, and removing trailing spaces. Effectively, this means it only does anything if I have a formatter installed.

"" Autoformat
au! BufWrite * :Autoformat
let g:autoformat_autoindent = 0
let g:autoformat_retab = 0
let g:autoformat_remove_trailing_spaces = 0

One odd corner case I’ve had to be a bit clever to deal with is markdown. Generally when I write, I’m writing in Pandoc’s syntax, but for a few situations (primarily this blog), I’m using a slightly different form of the language. Now, I can use Pandoc to auto-format in either case, but I’ll need to vary the external call. The way I’ve solved this is to use an autocommand to set a default markdown flavor in a buffer-scoped variable, then setting a different flavor for specific matches—in this case when I open a file with a full .markdown extension that has a directory named “blog” somewhere in its path. Then I use the value of the flavor in the call to pandoc. That all looks like this:

augroup markdown_flavor
    au! BufNewFile,BufFilePre,BufRead *.md 
                \ let b:markdown_flavor="markdown"
    au! BufNewFile,BufFilePre,BufRead *.markdown
                \ let b:markdown_flavor="markdown"
    au! BufNewFile,BufFilePre,BufRead */blog/*.markdown
                \ let b:markdown_flavor="markdown_github".
                \"+footnotes".
                \"+yaml_metadata_block".
                \"-hard_line_blocks"
augroup END

let g:formatdef_pandoc =
            \'"pandoc  --standalone --atx-headers --columns=79'.
            \' -f markdown -t ".b:markdown_flavor'
let g:formatters_markdown_pandoc = ['pandoc']

Particular Plugin Functionality

Distraction-Free Writing

Distraction-free writing is an interesting concept that has been around for a while. The first implementation I remember hearing about was WriteRoom, and since then the concept has even made its way a bit into recent version of Word. I don’t always use it when I’m writing, but sometimes it’s helpful. Goyo is about as good an implementation as you could ask for, especially when combined with the Limelight plugin to focus on individual paragraphs. You only need two lines to make this work together nicely:

"" Goyo & Limelight
autocmd! User GoyoEnter Limelight
autocmd! User GoyoLeave Limelight!

Pandoc Syntax

Vim doesn’t have a built-in Pandoc filetype or syntax file, and Pandoc really goes a long way beyond simple markdown. There’s a Vim-Pandoc plugin, but I found myself turning off an awful lot of the functionality because it was either in my way or a re-implementation of something I already had. Finally I decided just to use the syntax file, which is helpfully separated into its own plugin named Vim-Pandoc-Syntax.

To get files to use the correct syntax, you have to use an autocommand. The syntax plugin is also surprisingly powerful; I turn off the “conceal” functionality because I don’t like the way it looks, but I’m very impressed by the ability to use the syntax of the embedded language in fenced code blocks. Here’s my configuration:

"" Pandoc
augroup pandoc_syntax
    au! BufNewFile,BufFilePre,BufRead *.md set filetype=markdown.pandoc
    au! BufNewFile,BufFilePre,BufRead *.markdown set filetype=markdown.pandoc
augroup END

let g:pandoc#syntax#conceal#use = 0
let g:pandoc#syntax#codeblocks#embeds#langs = ['python', 'vim', 'make',
            \  'bash=sh', 'html', 'css', 'scss', 'javascript']

Table Formatting

Tabular is one of those wonderful little pieces of code that does one thing extremely well. Tabular makes tables. That’s it. When you don’t need a table, you don’t have to think about it. When you do, it saves you ten minutes of fiddling around. It plays very well with Pandoc.

Plugins I Didn’t Use

Obviously there are lots of plugins I didn’t install. Here are a few that I know are popular, and why I didn’t use them:

  • Fugitive: See, I don’t love everything Tim Pope does. I had this installed for a while, but never found myself using it. I’m happy on the command line.
  • Airline: These kinds of plugins just strike me as a way to throw lots of distracting information onto the screen. I don’t see the attraction.
  • NerdTree: I’m happy with NetRW the way I have it.
  • Tagbar: I don’t want to have to install ctags, and I can just search.
  • CtrlP: Apparently people have more trouble than I do finding things?
  • Multiple-Cursors: I almost went for this one, but Christoph Hermann’s article on it convinced me that it didn’t do anything you can’t just do with built-in functionality.

Gvim

My gvimrc is simple: I turn everything off, and set my fonts:

set guioptions-=m " Turn off menubar
set guioptions-=T " Turn off toolbar
set guioptions-=r " Turn off right-hand scrollbar
set guioptions-=R " Turn off right-hand scrollbar when split
set guioptions-=L " Turn off left-hand scrollbar
set guioptions-=l " Turn off left-hand=scrollbar when split
set guicursor+=a:blinkon0 " Turn off blinking cursor

if has("win32")
    set guifont=Consolas:h11
else
    set guifont=Inconsolata\ 12
endif

Implementation Notes

Vim keeps its files in a .vim folder on Linux, and a vimfiles folder on Windows. Happily, in new versions of vim, the vimrc and gvimrc can live inside this folder, which makes keeping everything in git easier.

To ensure that my directories for undo, backup, and swap exist but aren’t versioned, I put a .gitignore in each with this content:

*
!.gitignore

I also have a .gitignore in the root directory, to ignore both my netrw history, my spelling files, and my plugins.

.netrwhist
spell/*
plugged/*

I leave the autoload directory versioned, which means VimPlug itself gets versioned. This saves me a step when I move to a different machine and it’s not too terrible to update the repositiory when I occasionally update VimPlug itself. So here’s the gitignore.

With all that done, all I need to do to get set up on a new box (with git installed) is clone my vimfiles repository, fire up vim or Gvim, run :PlugInstall and restart, which take almost no time at all.

Final Thoughts

I should disclose that this post has taken me an absurd amount of time to write. I’ve spent hours now comparing plugins, trying things out, reading through documentation, and changing my mind. I know Vim far better than I did when I started, no doubt about it. If your vimrc has gotten a bit stale, it might be a good time for you to do a Vim Zero experiment yourself. Or, feel free to build off my setup, which you can find on GitHub here. If you decide to give it a go, be sure to let me know on Twitter.

  1. Yes, I’ve heard of Sublime and PyCharm and Atom and all of them. No thank you. Too much noise. You have fun, though. 

  2. I mean, just look at this website. I like things clean and simple. 

  3. Makefiles are the exception, which is infuriating

  4. The only real problem I have with it is that if you try to edit a directory path, you get a NetRW buffer that doesn’t go away. So don’t do that. 

  5. My requirements here include: black background and nothing else, not too much yellow, not too much orange, no neon-type colors (the hot pink completion menu just ruins the Ocean family of schemes, and my old vimrc had code to manually replace it), &c, &c. 

  6. Oh, don’t judge, you don’t write them either. 

3 January 2017

I’m a Linux guy at heart, but like a lot of folks, I’m stuck on Windows at work. But even on Windows, I spend a lot of time in the command line. Now, the whole point of Windows was to get away from the command line, so the default command prompt, cmd.exe, has never been much more than a glorified DOS shell. The alternative for power users, PowerShell, is something I’ve always found to be confusing and terrible.

I want my Unix tools, dang it! And, open source being open source, there have been a few attempts to make that happen, most notably Cygwin and MinGW. But while those are certainly impressive projects, they’re not something I want to try to keep updated for my team, or document for researchers who I want to use my scripts; the install and update tools are just too complicated for that.

Happily, Anaconda is there to save the day. Again1 . The Anaconda default channels include a suite of tools built with M2, a project descended from MinGW. That means that not only do you get to manage the tools you want with the excellent conda package manager, but you also can easily reproduce your environment elsewhere using conda requirement files. And no administrator privileges needed!

There’s only one drawback, which really isn’t a drawback: you can’t run m2-based programs in the default anaconda environment. Why isn’t that really a drawback? Because it keeps your path clean if you ever need to be using the Windows built-in tools instead of Unix ones that happen to have the same name.

To install m2 (assuming you have Anaconda already installed), just create a new conda environment or switch to one that already exists. Then install the m2-base package. I have one environment called “main” which I use when I’m not isolating requirements on a project for release. You can create one by running this command (replacing main with the desired name):

conda create -n main m2-base

Now, whenever you want to use your Unix tools, just run activate main (or whatever you called the environment).

Or, if you have an environment you want to use already, just activate it and conda install m2-base.

That will give you all the coreutils, as well as bash if you want to use that. And, while this gives you a perfectly good base system, there are plenty of more tools, like make. Just run conda search m2-.* to see them all.

Is it as good as using Linux? No. No it is not. But as far as I can tell, it’s the next best thing.

  1. If you aren’t using Anaconda to manage your Python environment on Windows, you really, really should be. Start here

8 November 2016

Back in January, I introduced the 430 Model for using polls to forecast elections. The name comes from the idea that you get 20 percent of your results from 80 percent of your work, and I only wanted to do about 20 percent of what Nate Silver does with FiveThirtyEight.

Since then I’ve wanted to do a bigger and better version for the general election. But since I’m the kind of guy who only wants to do 20 percent of the work, most of that didn’t happen. But I do have a slightly bigger and better version! So that’s something.

As before, I’m using the poll data from the Pollster API, combining the results and weighting by age to create a super-poll. Then I’m running 10,000 simulations using the results of the super-poll to figure out how likely each candidate is to win.

The code for this version is in a Jupyter notebook on GitHub, so you can play with it if you want to pretend Jill Stein matters or something.

Now to the results! Using the super-poll results since July, the race has looked like this:

Clinton is winning by a lot in the super-poll

Those lines are the 95% confidence bands. The story here is that, while things were pretty uncertain early in the race, Clinton began to separate after the first debate in September, and Trump has never been able to catch up. The recent revelations about the emails, for all the sturm und drang they caused, don’t really show up here. She’s just ahead. How ahead? This ahead:

Candidate Mean Standard Deviation
Hillary Clinton 48.4% (0.2)
Donald Trump 42.7% (0.2)
Other 8.9% (0.1)

Well, she’s not over fifty. But five and a half points is a pretty solid lead with as many polls as we’ve had. In 10,000 trial simulations, she won every single one. That’s as comfortable as you get.

So, any caution? If, as many suspect, there’s a weird Trumpian version of the Bradley Effect, then he has a better chance than the model projects, which is none. But with 0.2 percentage points standard error, he needs to get to swing better than two and a half points if he’s drawing them all from Clinton, and More if he’s getting them from the “Other” category. In the best case, that’s one secret supporter to every public one. That’s a tough road to hoe.

Still, this is a strictly poll-based model. If the polls are systematically wrong, then so is the model. We didn’t really see systematically wrong polling in the primaries, though we have seen it in votes abroad, such as Brexit. But that has to be the hope, if you’re a Trump supporter: that the polls are almost all wrong and in the same way.

My gut says no. Clinton’s got this one.

7 April 2016

Data work can take a long, long time. Once you’ve moved beyond small scale projects, you just have to get used to doing something else while your machine chugs away at cleaning, preparing, or analyzing your data. And since most of the time you have to try a few things to see what comes out, you’ll have to do sit and wait through multiple rounds of each step.

But a lot of people make it worse by doing one of two things: either they have one big script which they re-run from start to finish every time they make a change, or they run each step on-demand (say, in an Jupyter notebook), and mentally keep track of what they need to do in what order.

The first is bad because every single run will take the whole run time, and your life will be swallowed up in waiting. The second is bad because you have to depend on yourself to keep what can be a pretty intricate map of your analysis flow in your head.

The Unix utility Make provides a better solution. Make is the old standard build system, designed to help with the compilation of large programs. Just like with data projects, compiling large programs takes forever, has a variety of smaller steps where you might need to go back and edit something, and has a web of dependent steps you might need to re-run depending on what you’ve changed. In other words: it’s a perfect fit.

What’s more, Make helps you keep your steps separate computationally, which helps you keep them clear conceptually. You don’t even have to use the same language with each step: I’ve had occasion to mix Python, Scala, and Stata in the same project,1 and make made that much less painful.

And here’s the good news: if you’re on a Mac or Linux, it’s probably already installed. On windows, just install Gow.

So how does it work? The basic idea is, you write a makefile filled with recipes that tell you how to make a target. Each target can have dependencies, and if a dependency has been modified more recently than the target, the recipe for the target gets re-run.

So let’s say I want to classify some text in a folder, using some trainers in another folder. Both have to be prepped for analysis, using a script in my scripts subdirectory. I might put the following recipe in my makefile:

data/clean/trainers.csv: scripts/clean_text.py data/raw/trainers
	python -o data/clean/trainers.csv scripts/clean_text.py data/raw/trainers

What’s going on here? The file before the colon is the target, the thing we’re trying to make. The files after are dependencies, things that need to be made before the target. The line beneath, which is indented by a tab—yes, a tab—is the command that makes the file. Now, if we were to run make data/clean/trainers, make would check to see if either the clean_text.py script or the trainers directory2 had been modified more recently than the output file, and if so, it would run the script to create the file.

We can write this recipe a simpler way:

data/clean/trainers.csv: scripts/clean_text.py data/raw/trainers
	python -o $@ $^

In a makefile, $@ stands for the target, and $^ stands for the list of dependencies in order. This means if our dependencies are a script and a list of arguments to that script, we can use them as a stand-in for the recipe.

Now let’s say we use the same script to clean the unlabeled input. We just need to add it as a new target:

data/clean/trainers.csv: scripts/clean_text.py data/raw/trainers
	python -o $@ $^

data/clean/unlabeled.csv: scripts/clean_text.py data/raw/unlabeled
	python -o $@ $^

Easy! Now if we update clean_text.py, Make knows we need to remake both those targets. But I hate repeating myself. Luckily, Make gives us canned recipes:

define pyscript
python -o $@ $^
endef

data/clean/trainers.csv: scripts/clean_text.py data/raw/trainers
	$(pyscript)

data/clean/unlabeled.csv: scripts/clean_text.py data/raw/unlabeled
	$(pyscript)

In fact, since I write all my scripts based off of the same boilerplate, let’s fill out what the whole project might look like:

define pyscript
python -o $@ $^
endef

data/results/output.csv: scripts/classify.py data/classifier.pickle data/clean/unlabeled.csv
	$(pyscript)

data/classifier.pickle: scripts/train_classifier.py data/clean/trainers.csv data/trainer_labels.csv
	$(pyscript)

data/results/analysis.csv: scripts/tune_classifier.py data/clean/trainers.csv data/clean/unlabeled.csv
	$(pyscript)

data/clean/trainers.csv: scripts/clean_text.py data/raw/trainers
	$(pyscript)

data/clean/unlabeled.csv: scripts/clean_text.py data/raw/unlabeled
	$(pyscript)

So that’s five scripts total: each reasonably separated and able to be dropped into other projects with minimal modification. If we change any of them, either because of a bug, or because we wanted to try something different, we can use make to update only those parts that are dependent on the chain. And if we just type the command make, it automatically makes the first recipe, so we can be sure that our output.csv is using all the latest and greatest work we’ve put in.

There’s a lot more to Make, and I’ll focus in on a few features, tips, and tricks in an occasional series here. It’s been a big help to me, and if you find it helpful too, I’d love to hear from you!

  1. I said I did it, not that I’m proud of it. 

  2. The directory itself, mind you, not the files inside it! 

9 February 2016

Just before the Iowa Caucuses, I showed how you can make a 80-percent-of-the-way there poll-based election forecasting system with surprisingly little work. I called it the 430 Model, because 430 is about 80 percent of 538, and I was ripping off Nate Silver. My model did OK on the Democratic side, though (like the polls) it was off on the Republican side; Trump’s support was overstated, or rather the opposition to Trump was understated.

But, since people have been interested, I’ve updated the model a bit and made forecasts for New Hampshire! I’ve simplified the weighting decay function to halve the weighted sample size for a poll each day1. I’ve also added limited my results to polls with likely voter screens, and set the window of polls included in the super-poll by number instead of by date. You can download and play around with the new model as a Jupyter notebook here.

So, let’s get to the fun!

This is what the New Hampshire race has looked like over the last three weeks on the GOP side, when polling has happened pretty steadily:

New Hampshire GOP Chart

The picture’s been pretty stable the whole time. There’s three tiers: the down-and-outs (Christie, Fiorina, Carson), the hope-for-seconds (Rubio, Kasich, Cruz, Bush), and Trump. The superpoll results bear that out:

Candidate Estimate Standard Error
Donald Trump 31.2% 1.7%
Marco Rubio 14.5% 1.3%
John Kasich 13.8% 1.2%
Ted Cruz 11.7% 1.2%
Jeb Bush 11.0% 1.1%
Chris Christie 5.8% 0.8%
Carly Fiorina 4.6% 0.7%
Ben Carson 2.6% 0.6%

Rubio’s bump has gotten him to the top of the second tier, but realistically those guys are all tied, and each of them (except Cruz) really, really wants a second place finish. First place seems utterly out of reach for all of them; in my 10,000-trial simulation, Trump won every single time.

On the Democratic side, things seem equally fatalistic:

New Hampshire Dem Chart

Not even a story there. Bernie Sanders is winning. So also says the superpoll:

Candidate Estimate Standard Error
Bernie Sanders 55.1% 1.9%
Hillary Clinton 40.9% 1.9%

As Trump did on the GOP side, Sanders wins every time in a 10,000-trial simulation.

But.

Polls have historically been bad in New Hampshire. These projections are a great summary of what the polls are telling us, but if the polls are consistently off by a significant amount, we could still get some real surprises. If I were Sanders or (especially) Trump, I would actually wish my numbers were a little worse; in both cases, not dominating is going to look like a loss. It’s entirely possible that one of those second-tier GOP candidates is really going to break out, though I have no idea who it would be or why.

Or maybe they’ve finally learned how to poll New Hampshire and this is all effectively a done deal. We’ll find out tonight!

  1. This sounds more extreme than it is. A poll with 300 respondents has a margin of error of about 2.9 percent, one with 150 respondents about 4.1 percent, and one with 75 respondents about 5.8 percent. When you think about how many people can change their minds every day during an election, that looks reasonable to me.