# For Data Projects, Write Programs not Scripts

12 Mar 2015

One thing I see a lot in data work is scripts that look something like this:

"""
Some Data Analysis
"""

import some_library

data = data.select(["this", "that", "the other"])
# Bunch of work goes here
some_library.write_file(data, "output.csv")
some_library.make_pretty_chart(data, "chart.png")

Sometimes you’ll get a few functions in there, and sometimes even a main function, but the central driving idea is that you write a script to do the specific task at hand.

In theory that sounds eminently sensible; in practice it makes your life harder.

The whole point of computers is that they do repetition better than we do, and there’s almost no step in data work that you will only want to do once.

A better solution is to write every script as if you intended it to be a stand-alone program. For example, we could make the above pseudo-script look like this:

"""
Do some data analysis
"""
import argparse

import some_library

def analyze(data):
"""Do analysis on data and return result"""
# Data work goes here
return data

def write_output(data, ouput_file, output_img):
"""
Write data file to output_file and
write chart to output_img
"""
some_library.write_file(data, output_file)
some_library.make_pretty_chart(data, output_img)

def create_output_from_file(
input_file, output_file, output_img):
"""
Write analyzed data to output_file
Write chart ot output_img
"""
write_output(data, args.output_file, args.output_img)

def parse_args():
parser = argparse.ArgumentParser(description=__doc__)
return parser.parse_args()

def main():
args = parser.parse_args()
create_output_from_file(args.input_file,
args.output_file,
args.output_img)

if __name__ == "__main__":
main()

You would just run this script using python script_name.py input.csv output.csv output.png to get the same result as the original. The difference is, this version of the script is completely portable; you can take it and drop it into any project where you need the functionality.

Additionally, it’s import-safe, so you can pull your analyze or write_output function into another script.

Now of course, you still want to put your specific filepaths for the specific project you’re working on somewhere, so that you don’t have to type them over and over. You could do that using a shell script, a master python script, or even (and ideally), a makefile. But here again, the organization works to your advantage; you can separate your functionality into small, clean, reusable scripts and still have one file that concisely tells you what’s happening from beginning to end.

This, of course, is nothing but bringing the Unix Philosophy to data work:

This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

—Doug McIlroy

Well, OK, I didn’t get to text streams. But one thing at a time.

# Of Note from NBER

09 Mar 2015

Here’s what’s going into my to-read pile from todays NBER paper drop:

• “A Retrospective Look at Rescuing and Restructuring General Motors and Chrysler” by Goolsbee and Krueger:

This paper takes a retrospective look at the U.S. government’s effort to rescue and restructure General Motors and Chrysler in the midst of the 2009 economic and financial crisis. The paper describes how two of the largest industrial companies in the world came to seek a bailout from the U.S. government, the analysis used to evaluate their request, and the steps taken by the government to rescue them. The paper also summarizes the performance of the U.S. auto industry since the bailout and draws some general lessons from the episode.

• Starting in the late 1990s, China undertook a dramatic transformation of the large number of firms under state control. Small state-owned firms were privatized or closed. Large state-owned firms were corporatized and merged into large industrial groups under the control of the Chinese state. The state also created many new and large firms. We use detailed firm-level data to show that from 1998 to 2007, (i) state-owned firms that were closed were smaller and had low labor and capital productivity; (ii) the labor productivity of state-owned firms converged to that of private firms; (iii) the capital productivity of state-owned firms remained significantly lower than that of private firms; and (iv) total factor productivity (TFP) growth of state-owned firms was faster than that of private firms. We find the reforms of the state sector were responsible for 20 percent of aggregate TFP growth from 1998 to 2007.

• Policymakers frequently propose to use capital tax reform to stimulate investment and increase labor earnings. This paper tests for such real impacts of the 2003 dividend tax cut—one of the largest reforms ever to a U.S. capital tax rate—using a quasi-experimental design and a large sample of U.S. corporate tax returns from years 1996-2008. I estimate that the tax cut caused zero change in corporate investment, with an upper bound elasticity with respect to one minus the top statutory tax rate of .08 and an upper bound effect size of .03 standard deviations. This null result is robust across specifications, samples, and investment measures. I similarly find no impact on employee compensation. The lack of detectable real effects contrasts with an immediate impact on financial payouts to shareholders. Economically, the findings challenge leading estimates of the cost-of-capital elasticity of investment, or undermine models in which dividend tax reforms affect the cost of capital. Either way, it may be difficult for policymakers to implement an alternative dividend tax cut that has substantially larger near-term effects.

# The Metric System vs. the Soul

26 Feb 2015

I found this article on Facebook1, which argues that the Fahrenheit system is better for everyday use than the Celsius scale because it corresponds to a human range of hot and cold, rather than to the scientific but arbitrary freezing and boiling points of water. I find this argument obviously correct.

But in fact, the Celsius scale is only the tip of the 32-degree iceberg. I hold that the whole metric system dehumanizes us, when we use it out of its proper context.

That’s not me being funny to make a point: I actually believe that using the metric system for everything cheapens the human experience. Some people use this map, with countries that use the metric system in green and those that do not in gray, to mock the United States as a hopeless yokel of a nation:2

To me, that map shows the US as a lone holdout of common sense and civilization.

The metric system was developed to accomplish a few specific goals. It simplifies calculating higher or lower by its omnipresent powers of ten. It aligns, where possible, different kinds of measurement; a cubic centimeter of water is also one milliliter, and at four degrees Celsius it has a mass of one gram. In the laboratory, say, or in large scale manufacturing, these properties are no doubt desirable, because the extreme precision required comes most easily when unencumbered by factors purely human.

And it is for the exact same reason that the metric system ruins the glory and splendor and even romance of every day life.

A meter, for example, is the length light travels in about one three-hundred millionth of a second. That is a very precise definition, but to any person who does not go around noting the precise locations of photons, it is a useless definition. A foot, on the other hand, is about the length of a man’s foot when he wears a shoe.

A liter is the volume of a container 10 centimeters long, wide, and high. A cup is about as much as you get in a cup. A pint is two of those; the perfect size for a serving of beer.

The metric system has no connection to humanity as such. You can see this just by looking at the arts.

When Shylock demands a pound of flesh, we shudder; if he demanded a kilogram, we would laugh. When Falstaff says “Peace, good pint-pot” to the hostess, he is a having a good time; if he said “Peace, good point-five-liter-pot,” he would be a pedant.

No-one would be much moved if Frost sighed “But I have promises to keep and kilometers to go before I sleep, and kilometers to go before I sleep.”

I do not say that no-one has ever or will ever write a poem about a kilometer; only that I doubt that anyone has or will write a good one.

It does no good to say that measurement has nothing to do with art. That answer proves my point; it loses that part of the human experience that sees the romance in the mile of a thousand steps, that perceives the relationship of man to the cosmos.

The imposition of the metric system on the public first occurred during the French Revolution. If it was the most minor atrocity of the Jacobin’s bloody and merciless rationalism, it was also the most lasting. It embodies the Revolution’s determination to cram the majestic complexity of the world into a human mechanical design.

When someone says that we should give up our old miles for kilometers or pounds for kilograms, what they are really saying is that our everyday life were more like a machine, or a laboratory, or a mass production facility; that it would be less like humanity, and less like life.

I prefer humanity to machinery, and I value art over easy convertibility. And if I am the last man to measure my journeys in miles, I will probably be the man who enjoys them most along the way.

1. Hat tip to Nancy vander Veer

12 Jan 2015

Today I’m reading a few papers from NBER:

• “Cognitive Economics is the economics of what is in people’s minds. It is a vibrant area of research (much of it within Behavioral Economics, Labor Economics and the Economics of Education) that brings into play novel types of data—especially novel types of survey data. Such data highlight the importance of heterogeneity across individuals and highlight thorny issues for Welfare Economics. A key theme of Cognitive Economics is finite cognition (often misleadingly called “bounded rationality”), which poses theoretical challenges that call for versatile approaches. Cognitive Economics brings a rich toolbox to the task of understanding a complex world.”

• Austerity in 2009-2013, ungated version here:

“The conventional wisdom is (i) that fiscal austerity was the main culprit for the recessions experienced by many countries, especially in Europe, since 2010 and (ii) that this round of fiscal consolidation was much more costly than past ones. The contribution of this paper is a clarification of the first point and, if not a clear rejection, at least it raises doubts on the second.”

I’m hoping that this paper on austerity will be a little more illumating than the fly-by analysis I was talking about earlier

# Austerity Arguments are a Mess (Chart Fight!)

12 Jan 2015

Quick chart fight. A while back, Matt Yglesias posted this, saying that “2014 is the year American austerity came to an end”:

Econ blogger Angus argued that Yglesias is trying to re-define austerity because we’re now seeing some decent growth. He posted the nominal graph and quipped, “Either austerity means nominal cuts and we never had any of it, or austerity means cuts relative to trend and we are still savagely in its grasp”:

Kevin Drum says that’s bogus, because you have to look at real spending per capita, like so:

So here’s my entry. I’m going to add two economic indicators to that same chart: growth in real GDP per capita, and the prime-age employment-population ratio (which I like better than unemployment):

To put growth and the E-P ratio on the same scale, I’ve arbitrarily subtracted 79%, which is about the average over the period in question. It’s the trend, not the level, that matters.

The point, as I see it, is this: to make an argument about the “end of austerity” and what it means, you have to look at that graph and say that the 2014 part of that chart is meaningfully different from the 2009-2013 part. If you see that, you have better eyes than I do.

This is why people don’t trust economists or economics writers. It’s why they shouldn’t. You can’t tell anything from that graph, and claiming you can means you’re at best overstating your case, and at worst lying. It can be a data point1, but only as part of a larger analysis and I haven’t seen any that I’m particularly thrilled about or ready to bank on.

1. Paul Krugman, for what it’s worth, has taken this route, Scott Sumner responds to him and Simon Wren-Lewis here.