I did a quick Google search on “Stata for R users” (both as separate words and as a quoted phrase) and there really isn’t much out there. At best, there are a couple of equivalence guides that show you how to do certain tasks in both programs. (Plus a whole lot of “R for (ex-) Stata users” articles.) I’m writing this post, as a long-term R user who recently bought Stata, because I believe that Stata is a good complement to R, and many R users should consider adding it to their toolbox.
I’m going to write this in two parts. Part one will describe why an R user might be interested in Stata — with various Stata examples. Part Two will give specific tips and warnings to R users who do decide to use Stata.
I’m a big fan of R, and it will be my primary tool for a long time, but I wanted to add another tool to my toolbox and decided on Stata. Stata 13 was just released (June 2013), and I have to say that it’s a very nice package.
Why would anyone pick Stata over R? R has many advantages, but here are some reasons that you might pick Stata:
In Part 4 of this series, I created a Bayesian model in Stan. A member of the Stan team, Bob Carpenter, has been so kind as to send me some comments via email, and has given permission to post them on the blog. This is a great opportunity to get some expert insights! I’ll put his comments as block quotes:
That’s a lot of iterations! Does the data fit the model well?
I hadn’t thought about this. Bob noticed in both the call and the summary results that I’d run Stan for 300,000 iterations, and it’s natural to wonder, “Why so many iterations? Was it having trouble converging? Is something wrong?” The
stan command defaults to four chains, each with 2,000 iterations, and one of the strengths of
Stan‘s HMC algorithm is that the iterations are a bit slower than other methods but it mixes much faster so you don’t need nearly as many iterations. So 300,000 is a bit excessive. It turns out that if I run for 2,000 iterations, it takes 28 seconds on my laptop and mixes well. Most of the 28 seconds is taken up by compiling the model, since I can get 10,000 iterations in about 40 seconds.
So why 300,000 iterations? For the silly reason that I wanted lots of samples to make the CI’s in my plot as smooth as possible. Stan is pretty fast, so it only took a few minutes, but I hadn’t thought of implication of appearing to need that many iterations to converge.
Last time, we modeled the Association’s electricity expenditure using Bayesian Analysis. Besides the fact that MCMC and Bayesian are sexy and resume-worthy, what have we gained by using
Stan? MCMC runs more slowly than alternatives, so it had better be superior in other ways, and in this posting, we’ll look at an example of how. I’d recommend pulling the previous posting up in another browser window or tab, and position the “Inference for Stan model” table so that you can quickly consult it in the following discussion.
If you look closely at the numbers, you may notice that the high season (warmer-high, ratetemp 3, beta) appears to have a lower slope than the mid season (warmer-low, ratetemp 2, beta), as was the case in an earlier model. This seems backwards: the high season should cost more per additional kWh, and thus should have a higher slope. This raises two questions: 1) is the apparent slope difference real, and 2) if it is real, is there some real-world basis for this counter-intuitive result?
This is the fourth article in the series, where the techiness builds to a crescendo. If this is too statistical/programming geeky for you, the next posting will return to a more investigative and analytical flavor. Last time, we looked at a fixed-effects model:
m.fe <- lm (dollars ~ 1 + regime + ratetemp * I(dca - 55))
which looks like a plausible model and whose parameters are all statistically significant. A question that might arise is: why not use a hierarchical (AKA multilevel, mixed-effects) model instead? While we’re at it, why not go full-on Bayesian as well? It just so happens that there is a great new tool called
Stan which fits the bill and which also has an
rstan package for
“Sometimes people think that if a coefficient estimate is not significant, then it should be excluded from the model. We disagree. It is fine to have nonsignificant coefficients in a model, as long as they make sense.” Gelman & Hill 2007, page 42
“Include all variables that, for substantive reasons, might be expected to be important.” Ibid, page 69.
When a field adopts a common word and uses it in a technical sense, it’s sometimes lucky and sometimes unlucky.
In a previous installment, we looked at modeling electricity usage for the infrastructure and common areas of a condo association, and a fairly simple model was reasonably accurate. This makes sense in a large system such as mid-sized, high-rise condo building, which has a lot of electricity-usage inertia. The cost of that electricity has a lot more variability, however, because of rate changes (increases and decreases), refunds, high/low seasons, and other factors that affect the bottom line.