In Part 4 of this series, I created a Bayesian model in Stan. A member of the Stan team, Bob Carpenter, has been so kind as to send me some comments via email, and has given permission to post them on the blog. This is a great opportunity to get some expert insights! I’ll put his comments as block quotes:
That’s a lot of iterations! Does the data fit the model well?
I hadn’t thought about this. Bob noticed in both the call and the summary results that I’d run Stan for 300,000 iterations, and it’s natural to wonder, “Why so many iterations? Was it having trouble converging? Is something wrong?” The stan
command defaults to four chains, each with 2,000 iterations, and one of the strengths of Stan
‘s HMC algorithm is that the iterations are a bit slower than other methods but it mixes much faster so you don’t need nearly as many iterations. So 300,000 is a bit excessive. It turns out that if I run for 2,000 iterations, it takes 28 seconds on my laptop and mixes well. Most of the 28 seconds is taken up by compiling the model, since I can get 10,000 iterations in about 40 seconds.
So why 300,000 iterations? For the silly reason that I wanted lots of samples to make the CI’s in my plot as smooth as possible. Stan is pretty fast, so it only took a few minutes, but I hadn’t thought of implication of appearing to need that many iterations to converge.
None-the-less, Bob has some great advice for speeding things up if mixing was actually the problem:
It might mix a bit better with a variable transform on the normal priors (see the manual chapter that includes
the funnel example). It might also be handy to scale the predictors so the scales of the alphas and betas aren’t so high.The main loop could be vectorized a bit painfully by recognizing there are only three values of ratetemp[n]. Sorting each into its own vector at the get go (it’s known in the transformed data block) and then vectorizing should speed up the code considerably.
As Bob also says, “Definitely better to start with clear code and then optimize later.” And now that I’ve started with (hopefully) clear code, further improvements can be implemented. The current model doesn’t need a dramatic speedup, but it’s better to practice the techniques on a straight-forward model now, rather than waiting until I create a complex model that desperately needs optimization. I believe I can scale the predictors in the transformed data
section, and then compensate for the difference it makes in the parameters in a generated quantities
section, so I still see what I want to see. (I can do this in R, too, but would prefer to do it in Stan if possible, so that I hand Stan the raw data and get back results that are on the scale of the raw data.)
You could also tighten up the priors, which would probably speed everything up a bit.
This is good news, since I’d been a bit nervous initially that I had priors that were too tight and had spread them out a bit. However, when I replied that I’d actually enlarged some of the priors until there was little effect on the posterior — hoping for something like a minimally-informative prior, he replied
The priors should be realistic, not set up so as to have no effect.
Which has gotten me thinking: what should the priors be? For the intercepts: alpha[1]
and alpha[2]
will be dollar
s at around 55 degrees F, and alpha[3]
will be dollar
s at around 70 degrees F. I’m thinking I might calculate Tolerance Intervals for the data and go from there. For the slopes (beta
s), I’ll have to think a bit more.
I said that I thought a more-complete model would explicitly account for correlation between slopes and intercepts, to which Bob replied:
If you do full Bayesian posteriors, you get the appropriate covariance in the inferences. Or did you want a more complex prior that uses something like a covariance matrix for the two terms? Andrew discusses models like that in his regression book with Jennifer Hill.
Which is reassuring. Simple is better. In a followup, he mentioned how I might add covariance into the model, if I wanted:
You can do the covariance for slope and intercept in Stan. I think it’s more of a matter of how much data you have as to whether you’ll be able to estimate the covariance matrix. It’s very easy. If beta is a 2D vector, instead of saying:
beta[k,1] ~ normal(mu[1],sigma[1]);
beta[k,2] ~ normal(mu[2],sigma[2]);or equivalently, with vectorization:
beta ~ normal(mu,sigma);
and with vectorized priors:
mu ~ normal(0,2);
sigma ~ cauchy(0,2.5);you say (we don’t have multivariates vectorized yet):
for (k in 1:K)
beta[k] ~ multi_normal(mu,sigma’ * Omega * sigma);and for prior, I’d suggest a scaled version of the LKJ correlation prior,
sigma ~ cauchy(0,2.5); // sd as before
Omega ~ lkj_correlation(4.0); // plus correlation matrixThe prior for the correlation matrix Omega will favor the identity matrix.
This kind of model is how I finally came to understand multivariate normals.
Last, Bob mentioned that the WordPress blog theme that I’m using — beautiful as it is — makes code and equations hard to read, and I have to agree. He gave me a few leads, and I’ll see what I can do with the free (no CSS, limited options) WordPress themes. I really, really like the overall appearance of the current theme (Reddle), so if I can implement a better code environment I will.
@Bob: Very true. Thanks for your work on Stan and your insights here! I’ve been working with this electricity data for several years now, and it’s one of the data sets that I try new things on. (I recently applied for a job at Opower, a company that works with consumer electricity data, but it doesn’t look like that’s going to happen.)
Cool model and one that’s interesting to us application-wise. We’ve put in some pre-proposals to DoE to do some work like this, but they’ve shot them down in the pre-proposal stage.
I think the saying is something like “it’s easier to optimize debugged code than to debug optimized code”.
When I said you get the full covariance of slopes and intercepts, I just meant in terms of the correlations in the posterior inferences. If you want to build a hierarchical model with covariance priors, that’s a different model that directly accounts for the correlation of the parameters in their prior. Depending on the amount of data you have and the strength of the correlations, it may or may not make much of a difference.