So far, when I’ve written on Data Science topics I’ve written about the fun part: the statistical analysis, graphs, conclusions, insights, etc. For this next series of postings, I’m going to concentrate more on what we can call Real Data Science®: the less glamorous side of the job, where you have to beat your data and software into submission, where you don’t have access to the tools or data you need, and so on. In other words, where you spend the vast majority of your time as a Data Scientist.
I’ll start the series with a review of Kaiser Fung’s Numbersense, published in 2013. It’s not mainly about Real Data Science, but I’ll start with it because it’s a great book that illustrate several common data pitfalls, and in the epilogue Kaiser shares one of his own Real Data Science stories and I found myself nodding my head and saying, “Yup, that’s how I spent several days in the last couple of weeks!”
Longitudinal Structural Equation Modeling, Todd D. Little, Guilford Press 2013.
Let me start by saying that this is one of the best textbooks I’ve ever read. It was written as if the author was our mentor, and I really get the feeling that he’s sharing his wisdom with us rather than trying to be pedagogically correct. The book is full of insights on how he thinks about building and applying SEMs, and the lessons he’s learned the hard way.
I’ve just discovered a unique app on the Mac App Store called Calca. It’s like a simple word-processor, except you can define variables and functions and do arithmetic with them, and it understands units and currencies and it handles matrices and vectors, and supports basic Markdown, and … it’s pretty amazing.
I’ve been working with some linear programming (LP) lately, and have looked at a bunch of non-commercial, non-Academic-use tools for LP, and in particular IP (integer programming). Open source solvers I’ve looked at include:
Gnu GLPK’s glpsol
I’m working on a scheduling project and have so far been using linear programming tools. As I look down the road, I’m beginning to think that I’ll need to supplement or move beyond LP, perhaps with something like constraint programming (CP) techniques. Unfortunately, the CP arena seems to be littered with tools that are highly-spoken of but are no longer (or barely) supported. Anyone out there who can address CP tools?
So far, I’m leaning towards OscaR, because it seems fairly complete and it’s in Scala which is the Official General-Purpose Language of Thinkinator. I’ve looked at a variety of CP tools like Zinc (seems dead) and MiniZinc (seems barely alive), Geocode (C++), JaCoP (java-based), Choco (java-based), Comet (seems dead), etc. It’s a bit worrisome when I compare the level of activity and the apparent long-term stability of these projects to linear programming (Symphony, glpsol, lpsolv, etc).
Any experiences or insights on CP tools? (I’d prefer standalone tools with high-level syntax or Scala/Java-based tools, and not C++/Prolog-based tools.) Thanks!
There are many free online training opportunities, and many of them are reasonable experiences. For example, you can watch videos from Stanford on Youtube, and I’m sure other services come to mind. Today I want to recommend Coursera, which I’m pretty impressed with.
When I first heard of SSA (Singular Spectrum Analysis) and the EMD (Empirical Mode Decomposition) I though surely I’ve found a couple of magical methods for decomposing a time series into component parts (trend, various seasonalities, various cycles, noise). And joy of joys, it turns out that each of these methods is implemented in R packages:
In this posting, I’m going to document some of my explorations of the two methods, to hopefully paint a more realistic picture of what the packages and the methods can actually do. (At least in the hands of a non-expert such as myself.)