In Part 4 of this series, I created a Bayesian model in Stan. A member of the Stan team, Bob Carpenter, has been so kind as to send me some comments via email, and has given permission to post them on the blog. This is a great opportunity to get some expert insights! I’ll put his comments as block quotes:
That’s a lot of iterations! Does the data fit the model well?
I hadn’t thought about this. Bob noticed in both the call and the summary results that I’d run Stan for 300,000 iterations, and it’s natural to wonder, “Why so many iterations? Was it having trouble converging? Is something wrong?” The
stan command defaults to four chains, each with 2,000 iterations, and one of the strengths of
Stan‘s HMC algorithm is that the iterations are a bit slower than other methods but it mixes much faster so you don’t need nearly as many iterations. So 300,000 is a bit excessive. It turns out that if I run for 2,000 iterations, it takes 28 seconds on my laptop and mixes well. Most of the 28 seconds is taken up by compiling the model, since I can get 10,000 iterations in about 40 seconds.
So why 300,000 iterations? For the silly reason that I wanted lots of samples to make the CI’s in my plot as smooth as possible. Stan is pretty fast, so it only took a few minutes, but I hadn’t thought of implication of appearing to need that many iterations to converge.
It might mix a bit better with a variable transform on the normal priors (see the manual chapter that includes
the funnel example). It might also be handy to scale the predictors so the scales of the alphas and betas aren’t so high.
The main loop could be vectorized a bit painfully by recognizing there are only three values of ratetemp[n]. Sorting each into its own vector at the get go (it’s known in the transformed data block) and then vectorizing should speed up the code considerably.
As Bob also says, “Definitely better to start with clear code and then optimize later.” And now that I’ve started with (hopefully) clear code, further improvements can be implemented. The current model doesn’t need a dramatic speedup, but it’s better to practice the techniques on a straight-forward model now, rather than waiting until I create a complex model that desperately needs optimization. I believe I can scale the predictors in the
transformed data section, and then compensate for the difference it makes in the parameters in a
generated quantities section, so I still see what I want to see. (I can do this in R, too, but would prefer to do it in Stan if possible, so that I hand Stan the raw data and get back results that are on the scale of the raw data.)
You could also tighten up the priors, which would probably speed everything up a bit.
This is good news, since I’d been a bit nervous initially that I had priors that were too tight and had spread them out a bit. However, when I replied that I’d actually enlarged some of the priors until there was little effect on the posterior — hoping for something like a minimally-informative prior, he replied
The priors should be realistic, not set up so as to have no effect.
Which has gotten me thinking: what should the priors be? For the intercepts:
alpha will be
dollars at around 55 degrees F, and
alpha will be
dollars at around 70 degrees F. I’m thinking I might calculate Tolerance Intervals for the data and go from there. For the slopes (
betas), I’ll have to think a bit more.
I said that I thought a more-complete model would explicitly account for correlation between slopes and intercepts, to which Bob replied:
If you do full Bayesian posteriors, you get the appropriate covariance in the inferences. Or did you want a more complex prior that uses something like a covariance matrix for the two terms? Andrew discusses models like that in his regression book with Jennifer Hill.
Which is reassuring. Simple is better. In a followup, he mentioned how I might add covariance into the model, if I wanted:
You can do the covariance for slope and intercept in Stan. I think it’s more of a matter of how much data you have as to whether you’ll be able to estimate the covariance matrix. It’s very easy. If beta is a 2D vector, instead of saying:
beta[k,1] ~ normal(mu,sigma);
beta[k,2] ~ normal(mu,sigma);
or equivalently, with vectorization:
beta ~ normal(mu,sigma);
and with vectorized priors:
mu ~ normal(0,2);
sigma ~ cauchy(0,2.5);
you say (we don’t have multivariates vectorized yet):
for (k in 1:K)
beta[k] ~ multi_normal(mu,sigma’ * Omega * sigma);
and for prior, I’d suggest a scaled version of the LKJ correlation prior,
sigma ~ cauchy(0,2.5); // sd as before
Omega ~ lkj_correlation(4.0); // plus correlation matrix
The prior for the correlation matrix Omega will favor the identity matrix.
This kind of model is how I finally came to understand multivariate normals.
Last, Bob mentioned that the WordPress blog theme that I’m using — beautiful as it is — makes code and equations hard to read, and I have to agree. He gave me a few leads, and I’ll see what I can do with the free (no CSS, limited options) WordPress themes. I really, really like the overall appearance of the current theme (Reddle), so if I can implement a better code environment I will.