Real Data Science pt1: Review of Numbersense

So far, when I’ve written on Data Science topics I’ve written about the fun part: the statistical analysis, graphs, conclusions, insights, etc. For this next series of postings, I’m going to concentrate more on what we can call Real Data Science®: the less glamorous side of the job, where you have to beat your data and software into submission, where you don’t have access to the tools or data you need, and so on. In other words, where you spend the vast majority of your time as a Data Scientist.

I’ll start the series with a review of Kaiser Fung’s Numbersense, published in 2013. It’s not mainly about Real Data Science, but I’ll start with it because it’s a great book that illustrate several common data pitfalls, and in the epilogue Kaiser shares one of his own Real Data Science stories and I found myself nodding my head and saying, “Yup, that’s how I spent several days in the last couple of weeks!”

Numbersense is a wonderful and accessible book that consists of a series of stories about data that illustrate how to think about the kinds of statistics you read about on a daily basis. The emphasis isn’t mathematical, it’s more about when you should think, “Hmmm… that doesn’t sound right”, when you hear some statistics thrown around. The summary on the jacket cover mentions Big Data, but none of the principles depend on Big Data so they’re applicable in most any situation.

The Prologue throws out several short stories to illustrate how underlying assumptions can fool you, using situations like how Bill Gates was fooled about the efficacy of small schools versus larger schools, how airline on-time statistics can say totally different things depending on which direction you slice it, and how Mitt Romney’s pollsters allowed themselves to be blindsided by the elections. After that, each chapter revolves around a more in-depth story and can range from how Kaiser developed some metrics to help a friend in his Fantasy Football league, to looking at how economists seasonally adjust data or why economists are puzzled about consumer perceptions of inflation. (Turns out, seasonal adjustment is useful and important, but “core inflation” is misleading both from an economic policy viewpoint and from an understanding-consumers viewpoint.)

Kaiser’s a good story teller, and he dives deeply into the whole context and environment of each story. This gives the book a great flavor, though I’ll have to warn you that, depending on your interests, you’ll probably find one or two stories less interesting than the rest and might need to go into skim mode for those. In my case, the Bureau of Labor Statistics (BLS) chapter was fascinating to me in all of its texture and flavor, but the Fantasy Football chapter got too deeply into personalities for my taste.

Each story is real-world in the way that you don’t always know where Kaiser’s heading when it starts, which is a lot like data exploration in the real world. It can be a little disorienting to suddenly realize that economists were the good guys in the last story, but they’re the bad guys in the current story, but that’s a small price to pay.

Remember not to overlook the Epilogue! It’s a Real Data Science story, that illuminates the part of Data Science that you don’t usually read about.

One of my favorite topics in the book was counterfactuals. That is, what are the alternatives to what actually happened and how do you determine an appropriate baseline for comparison of results before and after something changed. In a designed experiment, you establish treatment and control groups, and if the experiment is well-designed you’ll have a very good idea the actual difference the treatment makes. But when you’re just handed some data you only know what did happen and have to do some work to figure out what realistically would have happened if something hadn’t changed.

For example, your company brings up a new website and at the end of the year they proclaim how much new business the site has brought in. The ROI is incredible… until you realize that they’re making the assumption that 100% of the business done through the website would not have occurred if not for the new version. A more realistic baseline would acknowledge that some of the website business is due to the new website but some of it would have occurred anyhow — over the phone, in person, via mail, via the older website, etc — and the trick is to try to allocate the website’s business appropriately to find a realistic ROI. That’s counterfactuals.

A great book that’s entertaining and educational, that you can read in bite-sized chunks, and that you’ll find useful in our data-intensive society.