The Cost of Thriving Index, Fixed

01 March 2020

Background

Oren Cass thinks the middle class has a problem. Cass, a scholar at the Manhattan Institute and head of the new organization American Compass, argues that middle class families feel like they’re worse and worse off, and furthermore, he argues that they’re right. The solution he provides is for American conservatism to throw off the libertarian, market fundamentalist who have controlled them like brain slugs since the ’80s, and instead go for some good old fashioned industrial planning.

There’s a lot in there. But we’ll blow past most of it for now to focus on the empirical argument that Cass has provided for his point that the middle class is worse off than it used to be. He doesn’t think that the official statistics (which show the opposite) are very good because of the way they handle quality adjustments. Inflation measures try to capture the fact that goods get more expensive over time both because of inflation and because quality improves. The next iPhone will be more expensive than the last one, but also better, so a lot of that price increase doesn’t count as inflation.¹ But aha! says Cass. what happens if they stop selling the old iPhone? Sure, the new one’s better, but it’s still more expensive to get a smartphone, even if you don’t think the improved quality justifies the higher price.

The Cost of Thriving Index

So to his credit, Cass has put forward an alternative. He picked four categories that he argues are essential to feeling like you’ve got a solid, middle class life. Those categories are:

Housing, as measured by the Department of Housing and Urban Development’s estimate of a fair market rent for a 3-bedroom house in Raleigh, NC
Health insurance, as measured by the Kaiser Family Foundation’s estimated average premium for employer-sponsored health insurance for a family
Transportation, as measured by the Bureau of Transportation Statistics’s estimate of the total cost of driving a car 15,000 miles per year
Education, as measured by the Department of Education’s estimate of the cost of one semester of public college

This basket he compares to the last component, Income, as measured by the median wages for male workers for one year. Why only males? Because, he reasons, it used to be perfectly normal to be able to afford a middle-class life on one income, so why shouldn’t it be so now?

The combined price of those four expenses compared to the income measure is what Cass calls the Cost of Thriving Index, which he lays out in a report called The Cost of Thriving.

And by that measure, the American middle class is, indeed, getting screwed over pretty good. Here’s the chart of the number of weeks of work needed to fund the four categories of major expenses for each year from 1985-2018:

Cass Original Chart, Weeks of Work to Cover Major Expenses

Oh, no! It takes more than 52 weeks of work to afford a middle class life on a middle class income! There’s not that many weeks in a year! That sounds pretty bad!

The chart that’s gone viral is this next one, showing the expenses broken out vs. income over time.

Cass Original Chart, Income vs Major Expenses .

Pretty dramatic stuff. Wages have gone up over time, sure, but they’ve just been dwarfed by the increased price of the major expenses. Heck, after all that, there’s not even money to pay for food!

The Problem

Again, there’s a lot in there. Almost every bit of it is controversial on a theoretical level. See, for example, this, this, and this.

What I want to focus on is that health care component. You can see from the second chart that it’s the biggest single driver of rising costs. That’s not too surprising because we all know that health care costs have been going up.

But there’s a problem. As most folks with jobs are aware, an employee rarely pays the whole cost of their health insurance; their employer pays some, too. But Cass is using the total cost as the price of health care. So in essence, his cost of thriving includes costs that a wage-earner does not pay. ²

This mistake matters because it goes directly to what the COTI is supposed to measure. We’re supposed to be looking at how long someone has to work to pay for these four major categories. If we’re including things that the wage-earner doesn’t pay for himself, then we’re going to get the wrong picture—at least if the difference is large enough to matter.

What the COTI Says When Done Correctly

I’m not the first person to point this out, but I was curious just how big a problem this was, so I decided to reconstruct the index myself. First, I included the employer-paid health insurance premiums to make sure my number matched Cass’s. Then I recalculated it using using only the employee-paid portion, which I got from the same source. It’s worth pointing out that Cass gave clear enough indications of his sources and methods that I was able to reproduce most of his numbers exactly, and the ones that weren’t exact are close enough as makes no difference.³

So here’s what the weeks-of-work chart looks like with both with the original index that incorrectly includes employer-paid premiums, and with the recalculated index that only includes what employees actually pay:

Weeks of Work to Pay for Major Expenses, Original and Recalculated Index

Oh. Instead of taking 53 weeks of work to pay for a middle class life, it’s actually 39. Not only is that a good bit less than a year, but it’s lower than where Cass’s original index had us in 1996. It’s unclear whether Cass thinks 1996 were good times and the trend is still up, but we see now that the four categories are less than all-consuming.

And what about the raw expenses chart? Here’s my recalculation of the original, so you can see that even with a slightly different health series, it’s the same story:

Income vs Major Expenses, Recalculation of Original

And here’s what it looks like when you count health insurance properly:

Well, then. Looks like the middle class still has a good bit of headroom, even after paying the price of thriving. And let’s not forget: the things that Cass includes in his major expenses are the ones where costs have gone up the most over time. The rest of that income can go to things like food, leisure, technology, and travel that have gotten cheaper, have improved in quality or convenience, and where a far greater variety is available than used to be. Not only can a single wage-earner still cover Cass’s expenses like they could in 1985, they can go beyond in many ways that were unimaginable back then.

A lot of folks have raised good criticisms of the Cost of Living Index. I think it’s wrong to cherry-pick the fastest-growing categories of expenditures. I think it’s wrong to ignore changes in quality over time. I think it’s wrong to ignore the unpaid work that women used to do more of as a group than they do now. But even if you ignore all of that, what the Cost of Thriving Index shows, at least when calculated properly, is that the middle class is doing just fine, thank you very much. Would it be good if the price of health care insurance and college were growing more slowly? Sure. Of course it would. But are they eating up the middle class? No. No they are not.

In theory, at least. I don’t have time for an Apple rant in the middle of this chart rant.↩︎
Economically, you could argue that the employee bears the cost of the employer share of insurance payments. In fact, I would argue that. But if you want to look at it that way, you’d need to add that to wages as income to make the index make sense. It wouldn’t really make a difference to my argument, so I kept it this way for simplicity.↩︎
All my work, including my source data is available on GitHub. The repository is here, the notebook with the calculations is here.
Cass uses a Kaiser Family Foundation estimate for health insurance costs; I looked at Kaiser and it pointed to the Agency for Healthcare Research and Quality. The AHRQ numbers do not exactly match the one Cass reports (they’re off by a few hundred dollars at both ends) and cover more years than he says Kaiser does. The differences are small enough that you can’t see them on the graphs except for the first few years, and I’m confident that whatever series he’s using and the one I used are very similar, especially in recent years.↩︎

Python Data Script Boilerplate Version 2

05 December 2018

A while back I wrote up a Python data script boilerplate that crystallized some of the things I found myself doing over and over. And while that boilerplate has served surprisingly well, I’ve found myself regularly making a few changes, so I figure it’s probably time for an update to version two.

I’ll show you the finished product first, and then walk through each chunk, noting what I’ve changed¹.

The Boilerplate

#!/usr/bin/env python3
"""
A boilerplate script to be customized for data projects.

This script-level docstring will double as the description when the script is
called with the --help or -h option.
"""

# Standard Library imports
import argparse
# import collections
# import csv
# import itertools
import logging

# External library imports
# import pandas as pd
# import numpy as np

# Standard Library from-style imports go here
from pathlib import Path

# External library from-style imports go here
# from matplotlib import pyplot as plt

__version__ = '0.1'

log = logging.getLogger(__name__ if __name__ != '__main__ '
                        else Path(__file__).stem)


def manipulate_data(data):
    """This function is where the real work happens (or at least starts).

    Probably you should write some real documentation for it.

    Arguments:

    * data: the data to be manipulated

    """
    log.info("Doing some fun stuff here!")
    return data


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', nargs='?', default='-')
    parser.add_argument('-ei', '--infile_encoding', default='utf-8')
    parser.add_argument('-o', '--outfile', default='-')
    parser.add_argument('-eo', '--outfile_encoding', default='utf-8')

    verbosity = parser.add_mutually_exclusive_group()
    verbosity.add_argument('-v', '--verbose', action='store_const',
                           const=logging.DEBUG, default=logging.INFO)
    verbosity.add_argument('-q', '--quiet', dest='verbose',
                           action='store_const', const=logging.WARNING)

    parser.add_argument('--version', action='version',
                        version=f'%(prog)s v{__version__}')

    args = parser.parse_args()
    args.infile = argparse.FileType(encoding=args.infile_encoding)(args.infile)
    args.outfile = argparse.FileType(
        mode='w',
        encoding=args.outfile_encoding,
        # newline='', # for csvs
    )(args.outfile)
    return args


def read_instream(instream):
    """Convert raw input for to a manipulable format.

    Arguments:

    * instream: a file-like object

    Returns: probably a DataFrame

    """
    log.info('Reading Input')
    return instream.read()


def main():
    args = parse_args()
    logging.basicConfig(level=args.verbose)
    data = read_instream(args.infile)
    results = manipulate_data(data)
    print(results, file=args.outfile)


if __name__ == "__main__":
    main()

Walkthrough

The first chunk is pretty self explanatory. It sets the shebang (now explicitly Python 3), gives a docstring that doubles as program help info later, and organizes the imports. This time I’ve included commented imports that I often use, like Pandas and NumPy. I’ve also dropped some imports that became unnecessary since argparse is a bit more sophisticated than it used to be.

#!/usr/bin/env python3
"""
A boilerplate script to be customized for data projects.

This script-level docstring will double as the description when the script is
called with the --help or -h option.
"""

# Standard Library imports
import argparse
# import collections
# import csv
# import itertools
import logging

# External library imports
# import pandas as pd
# import numpy as np

# Standard Library from-style imports go here
from pathlib import Path

# External library from-style imports go here
# from matplotlib import pyplot as plt

The next chunk sets up our module-level info. I’ve added in a version this time around, because you should version things. I’ve also made the log name dependent on how whether the script is loaded as a module or not, because the full name may be more helpful if the script ends up as a submodule somewhere, which occasionally happens.

__version__ = '0.1'

log = logging.getLogger(__name__ if __name__ != '__main__ '
                        else Path(__file__).stem)

The data manipulation function is more-or-less unchanged, since this is where the actual work occurs. In general, you’ll want to rename this function to what it actually does.

def manipulate_data(data):
    """This function is where the real work happens (or at least starts).

    Probably you should write some real documentation for it.

    Arguments:

    * data: the data to be manipulated

    """
    log.info("Doing some fun stuff here!")
    return data

The parse_args function is in many ways the star of the show here, and I’m going to break it into different chunks. In the first chunk, we create the parser and add an infile and outfile argument. We create optional encoding for each of those as well. I’ve changed infile to be a positional argument because that makes it easier to use with make-style workflow tools. We’re taking the infile and outfile arguments as strings, with default values of -; as we’ll see below, this is the least ugly ² way to make use of argparse’s neat FileType object but also let the user set the encoding at runtime.

That encoding point is another difference between the old and new version. Previously, encodings were set as script-level constants, which really works against the reusability idea.


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', nargs='?', default='-')
    parser.add_argument('-ei', '--infile_encoding', default='utf-8')
    parser.add_argument('-o', '--outfile', default='-')
    parser.add_argument('-eo', '--outfile_encoding', default='utf-8')

In the next chunk we just throw in some handy helpers. First we add mutually exclusive verbose and quiet flags to set the logging level. Then we add in a version flag, because gosh darn are we professional.

    verbosity = parser.add_mutually_exclusive_group()
    verbosity.add_argument('-v', '--verbose', action='store_const',
                           const=logging.DEBUG, default=logging.INFO)
    verbosity.add_argument('-q', '--quiet', dest='verbose',
                           action='store_const', const=logging.WARNING)

    parser.add_argument('--version', action='version',
                        version=f'%(prog)s v{__version__}')

Now we parse our arguments and convert our input and output files to FileType objects. The great thing about FileTypes is that you can set the properties like mode and encoding, and the constructor is smart enough to wrap standard input and output if the provided filename is -. No more messing around with sys.stdin and io objects! It looks a bit odd because FileType actually creates a new type, which is then instantiated with the path to the file.

I’ll admit that while I included standard input and output in my original boilerplate three years ago, it’s only in the last year or so that I’ve found myself using it a lot. It plays very well with cloud infrastructure, and makes modularity all that much easier. Working with text flows also allows you to use command-line tools like grep and sed which are often undervalued, expecially when working with large files.

    args = parser.parse_args()
    args.infile = argparse.FileType(encoding=args.infile_encoding)(args.infile)
    args.outfile = argparse.FileType(
        mode='w',
        encoding=args.outfile_encoding,
        # newline='', # for csvs
    )(args.outfile)
    return args

The read_instream function isn’t always one that lives through to the production script. In some cases, the read_instream function is entirely replaced by a pd.read_csv or something like that. If it’s simple enough, I keep it in the main function. But when you do have a complicated few steps to get the data in the right shape, it’s best to segregate it to its own function. The temptation is to put the code getting your data ready for manipulation or analysis in the manipulation or analysis function, but that’s bad design because it means you spend a lot of time in a function not doing the thing that is the point of that function. If only for mental clarity, keep it separate. Tidy your data here.

def read_instream(instream):
    """Convert raw input for to a manipulable format.

    Arguments:

    * instream: a file-like object

    Returns: probably a DataFrame

    """
    log.info('Reading Input')
    return instream.read()

Finally we have the standard main function: input, manipulate, output. On a suggestion from Arya McCarthy, I’ve switched to using the print function to print the final results, since print will implicitly handle conversion to a text format, while you have to do that yourself when using outfile.write. Of course, that line will often be replaced with to_csv or something like that.

def main():
    args = parse_args()
    logging.basicConfig(level=args.verbose)
    data = read_instream(args.infile)
    results = manipulate_data(data)
    print(results, file=args.outfile)


if __name__ == "__main__":
    main()

Why Scripts instead of Notebooks?

I’m not going to re-hash the Unix Philosophy or the overkill question, since I covered those last time. But the question that’s even more pressing now than it was three years ago is: why the heck are we writing scripts instead of doing everything as a Jupyter Notebook?

I guess I’m a bit of a notebook skeptic, even though I use notebooks every day. I recognize that people use them all the time to do large-scale, impressive things at production. Look at Netflix. They’re great for experimentation, they’re great for graphics.

But I just don’t trust them. I don’t trust that the cell can be run out of order, or multiple times, or that you can have a variable that you defined and then deleted or changed the definition of so that it doesn’t match anything on the screen. I don’t like that it doesn’t work cleanly and directly with version control, and I don’t like that it doesn’t work cleanly and directly with text streams on the command line. You can’t import them, and 99 percent of them are named Untitled.

Maybe that means I’m just not disciplined enough, and maybe it means I’m a grumpy old man. I can live with that. But scripts have never let me down.

Tell me what you think!

So that’s the new boilerplate. If you use it, or have questions or edits, I’d love to hear from you on Twitter or just email me.

It’s also on GitHub here, of course↩︎
It’s still a little ugly.↩︎

Jupyter Notebooks Can't Fix the Scientific Paper

17 April 2018

This piece in the Atlantic has been making the rounds, suggesting that Jupyter notebooks are what comes next as the scientific paper goes the way of legal vellum and religious papyrus. The argument runs something like this: most of the actual work of science is computational. Notebook environments give an excellent environment for doing replicable computation while also interspersing explanatory text, and Jupyter notebooks are the best ones around. This clears away the need for much of the gobbledygook for which scientific papers are so widely criticized, and puts the work front and center.

And some of that is right. Jupyter is amazing; my team and I are actually working right now to build a Jupyter-based research platform for our organization. And well-documented Jupyter notebooks are certainly a step up over incomprehensible do files that we see with many papers, especially when they’re augmented with simple controls to modify parameters, for example.

But the problem with this line of thinking is that it gets the basic point of a scientific paper wrong. The computation is not the most important part of a paper. The most important part is the writing.

A paper should do three things:

Orient the reader to the current state of knowledge
Offer relevant evidence
Integrate the evidence to produce a new current state of knowledge

Computations happen, if at all, in number 2. Jupyter’s great for presenting that. Jupyter doesn’t help at all for numbers 1 and 3, and they are equally as important, if not more so. What helps them—and really the only thing that helps them—is clear thinking and writing.

Now, obviously you can have text in a Jupyter notebook. You have Markdown and HTML, and that’s a lot! But Jupyter is not a writing environment. There’s no formatting help if you forget the Markdown syntax. You can’t track changes or get comments easily from your peers or coauthors. You can’t manage your bibliography. And it’s just not a pleasant interface for composing text.

For that matter, it’s also not great for reading text. The lines are too long, for one thing. You can’t easily jump to the footnotes and back, like you can in a well-formatted PDF. There aren’t columns, and can’t be.

That’s not really a criticism; that’s not what Jupyter is for. It has text support for writing short bits of text that are secondary to the computational work. But let’s not pretend that Jupyter will ever be an excellent replacement for ~~Word~~ Vim on the editing side, or a \[\LaTeX\] PDF on the reading side.

So, if not notebooks, is there something that can replace the crappy scientific paper? At the risk of being dull, how about: the good scientific paper? Papers where introductions and conclusions aren’t unnecessarily ponderous, where the evidence is delivered with minimal gobbledygook and presented in a replicable way¹, and implications are explored with straightforwardness and intellectual humility?

Of course, that’s what we’re supposed to be doing now. John H. Cochrane has a wonderful write-up on how to do it in economics². And there are good examples out there. Watson and Crick’s paper introducing the double-helix structure of DNA? It’s one page long. Including a picture.

Now, if we don’t do that in the future, it’ll be for the same reason we don’t do it now. Journals and peers don’t make us. Publish-or-perish puts more emphasis on producing papers than on producing good papers. Papers are written to be published, not read.

Fixing that technology—the institutional, meta-technology of publishing, would fix the scientific paper. But I don’t think Jupyter Notebooks, glorious powerful magic though they are, can do that.

Jupyter can help with this!↩︎
NOT SCIENCE LOL! I know, I know.↩︎