As you probably know, I’m a big fan of R’s
brms package, available from CRAN. In case you haven’t heard of it,
brms is an R package by Paul-Christian Buerkner that implements Bayesian regression of all types using an extension of R’s formula specification that will be familiar to users of
lmer. Under the hood, it translates the formula into Stan code, Stan translates this to C++, your system’s C++ compiler is used to compile the result and it’s run.
brms is impressive in its own right. But also impressive is how it continues to add capabilities and the breadth of Buerkner’s vision for it. I last posted something way back on version 0.8, when
brms gained the ability to do non-linear regression, but now we’re up to version 1.1, with 1.2 around the corner. What’s been added since 0.8, you may ask? Here are a few highlights:
There are several reasons why everyone isn’t using Bayesian methods for regression modeling. One reason is that Bayesian modeling requires more thought: you need pesky things like priors, and you can’t assume that if a procedure runs without throwing an error that the answers are valid. A second reason is that MCMC sampling — the bedrock of practical Bayesian modeling — can be slow compared to closed-form or MLE procedures. A third reason is that existing Bayesian solutions have either been highly-specialized (and thus inflexible), or have required knowing how to use a generalized tool like BUGS, JAGS, or Stan. This third reason has recently been shattered in the R world by not one but two packages:
rstanarm. Interestingly, both of these packages are elegant front ends to Stan, via
This article describes
rstanarm, how they help you, and how they differ.
In Part 4 of this series, I created a Bayesian model in Stan. A member of the Stan team, Bob Carpenter, has been so kind as to send me some comments via email, and has given permission to post them on the blog. This is a great opportunity to get some expert insights! I’ll put his comments as block quotes:
That’s a lot of iterations! Does the data fit the model well?
I hadn’t thought about this. Bob noticed in both the call and the summary results that I’d run Stan for 300,000 iterations, and it’s natural to wonder, “Why so many iterations? Was it having trouble converging? Is something wrong?” The
stan command defaults to four chains, each with 2,000 iterations, and one of the strengths of
Stan‘s HMC algorithm is that the iterations are a bit slower than other methods but it mixes much faster so you don’t need nearly as many iterations. So 300,000 is a bit excessive. It turns out that if I run for 2,000 iterations, it takes 28 seconds on my laptop and mixes well. Most of the 28 seconds is taken up by compiling the model, since I can get 10,000 iterations in about 40 seconds.
So why 300,000 iterations? For the silly reason that I wanted lots of samples to make the CI’s in my plot as smooth as possible. Stan is pretty fast, so it only took a few minutes, but I hadn’t thought of implication of appearing to need that many iterations to converge.
Last time, we modeled the Association’s electricity expenditure using Bayesian Analysis. Besides the fact that MCMC and Bayesian are sexy and resume-worthy, what have we gained by using
Stan? MCMC runs more slowly than alternatives, so it had better be superior in other ways, and in this posting, we’ll look at an example of how. I’d recommend pulling the previous posting up in another browser window or tab, and position the “Inference for Stan model” table so that you can quickly consult it in the following discussion.
If you look closely at the numbers, you may notice that the high season (warmer-high, ratetemp 3, beta) appears to have a lower slope than the mid season (warmer-low, ratetemp 2, beta), as was the case in an earlier model. This seems backwards: the high season should cost more per additional kWh, and thus should have a higher slope. This raises two questions: 1) is the apparent slope difference real, and 2) if it is real, is there some real-world basis for this counter-intuitive result?
This is the fourth article in the series, where the techiness builds to a crescendo. If this is too statistical/programming geeky for you, the next posting will return to a more investigative and analytical flavor. Last time, we looked at a fixed-effects model:
m.fe <- lm (dollars ~ 1 + regime + ratetemp * I(dca - 55))
which looks like a plausible model and whose parameters are all statistically significant. A question that might arise is: why not use a hierarchical (AKA multilevel, mixed-effects) model instead? While we’re at it, why not go full-on Bayesian as well? It just so happens that there is a great new tool called
Stan which fits the bill and which also has an
rstan package for
In the first posting of this series, I simply applied Bayes Rule repeatedly: Posterior Prior Likelihood. I didn’t have to know anything about conjugate priors, hyperparameters, sufficient statistics, parametric forms, or anything beyond the basics. I got a reasonable posterior for and used that to find the correct answer. Why go beyond this?
Well, first, there’s really no good way for me to communicate my distribution to anyone else. It’s a (long) vector of values, and that’s the weakness of a non-parametric system: there is no well-known function and no sufficient statistics to easily describe what I’ve discovered. (Of course, it’s possible that there is no well-know function that’s appropriate, in which case the simulation method is actually the only option.)
Second, my posterior density is discrete. This works reasonably well for the exercise I attempted, but it’s still a discrete approximation which is less precise and can suffer from simulation-related issues (number of samples, etc) that have nothing to do with my proposed model.
Third, if an analytical method can be used, it may be possible to directly calculate a final posterior without repeated applications of Bayes Rule. As I mentioned in the previous posting, Gelman got an answer of Gamma(238, 10) analytically, not through approximations and simulation. If we look in Wikipedia, we can find that the conjugate prior for , the parameter of a Poisson distribution, is the Gamma () distribution, and given our series of accident counts and an initial (posterior) and , the posterior density is Gamma ()
In the last post (Bayesian Data Analysis 1), I ran a Bayesian data analysis using a simple, first-principles approach. Armed with only the fact that a Poisson distribution is appropriate for modeling airplane accidents, Bayes Rule, and R, we got the correct answer to the problem through non-parametric simulation.
Before we get into precision and the other topics in the list of issues to explore, let’s tie up one loose end. I could have gotten the right answer for the wrong reason, so let’s look at the posterior distribution of as compared to the provided answer to make sure we’re close. In the answer, Gelman looks at the parametrical form for and considers conjugate priors and analytically determines that the correct parametric posterior distribution is a Gamma function with and . (More on this in the next posting.) So I simulated out 20,000 samples from that distribution:
rgam <- rgamma (20000, 238, 10)
And then we can compare our posterior distribution with the official, parametric distribution:
Looks reasonably close (though biased by about 0.25) and it didn’t take a lot of fancy machinery to pull off. So why not use this method as our default? Why talk of things like conjugate priors and hyper parameters? And did we just get lucky with numeric precision because we only had 10 accidents and hence only 10 applications of Bayes Rule? Let’s cover that in the next posting, and finish this one with a graph of our answer (not but the predictive distribution of accidents) with the 95% CI :