Electricity Usage in a High-rise Condo Complex pt 4

This is the fourth article in the series, where the techiness builds to a crescendo. If this is too statistical/programming geeky for you, the next posting will return to a more investigative and analytical flavor. Last time, we looked at a fixed-effects model:

m.fe <- lm (dollars ~ 1 + regime + ratetemp * I(dca - 55))

which looks like a plausible model and whose parameters are all statistically significant. A question that might arise is: why not use a hierarchical (AKA multilevel, mixed-effects) model instead? While we’re at it, why not go full-on Bayesian as well? It just so happens that there is a great new tool called Stan which fits the bill and which also has an rstan package for R.

stan1

Stan is developed by the great folks on the Stan development team, and has been widely promoted by statistician Andrew Gelman. It was originally developed to solve some problems faster than BUGS or Martyn Plummer‘s JAGS, using HMC rather than Gibbs sampling. (In particular, an algorithm called NUTS: the No U-Turn Sampler.)

So let’s get started! With Stan and rstan installed, we specify our model in R:

which is a hierarchical version of our fixed-effects model. For those familiar with lmer‘s formulas, this would be specified as something like lmer (dollars ~ 1 + (1 | regime) + (1 + I(dca-cut) | ratetemp)).

The data section specifies the data that will be passed in to Stan. The transformed data section is where I center the temperature (dca) at our 55-degree (an input to Stan, actually) set point, and create ratetemp which combines the seasonal rate and whether the average monthly temperature is above or below the set point. I could pre-center dca in R before passing it to Stan, but doing the centering in Stan keeps everything together in the model for clarity. Similarly, I already have a ratetemp variable in R, but if I modify the cut point, the temp part of ratetemp might change, so keeping the calculation in Stan makes things clearer and less error-prone.

The parameters section specifies the modeled parameters, which Stan calculates via MCMC sampling. And last, the model section specifies the model itself, which specifies how the data and parameters are related. Note that this is a Bayesian model, where parameters have priors and we’re calculating the posterior distributions of the parameters. It’s also a hierarchical model, where some of the priors are themselves modeled, though not a nested hierarchical model since regime is not nested in ratetemp and vice versa. Note that unlike BUGS and JAGS, the Stan code is actually compiled and executed — it’s not just a kind of model specification. The C++-style comment in the for loop is commenting out a more-robust student’s t-based alternative fit.

I started with a simple regression model, tested it, compared it to lm or lmer, and then expanded it until I reached the model I wanted. Hopefully, you can see how powerful modeling in Stan is. Now that the model is specified, we put the data into a list (in R):

stan.list <- list (cut=55, nu=9, N=length (elect5$dca), dollars=elect5$dollars, dca=elect5$dca, rate=as.integer (elect5$rate), regime=as.integer (elect5$regime))

which supplies all of the variables specified in the data section of the model. (Actually, it supplies nu, which is not used in the current model, but Stan has no problem with unused variables.) To compile the model into C++, then compile and run the sampler, we use stan:

library (rstan)
stan.model <- stan (model_code=stan.code, model_name=paste ("Electricity", stan.list$cut), data=stan.list, iter=300000)

which will print out updates as the code is compiled, then at each 10% of the way through the sampling of each chain. (The default is to run 4 chains, each starting at a different initial value.) The result is displayed with stan.model:

The values only display one significant digit to keep it small, but to display more digits, you can print (stan.model, digits_summary=3). The Rhat looks like things converged, but we can do several diagnostic checks (there may be more efficient ways, but this worked for me):

All of which seem to confirm that the sampling converged properly, and the results, above, are meaningful. The plot at the top of this posting reflects the results of this model. I calculated the 95% predictive credible intervals based on the current regime (R5) with:

The lines may not look centered on the data, since the regime offsets varied from -77 (R1) to around 38 (R2 and R4), and R5‘s offset is about 6. (Which, by the way, says that the regimes have varied by about \pm 6\% from the current regime.) Simulation like this is about the only way to get credible intervals which account for all of the uncertainties in the model. (It doesn’t account for uncertainty in the data, such as the temperature, though.)

Hopefully this provides a solid baseline model and has shown you powerful and flexible Bayesian tools R and Stan can be. In the next post, we’ll look a bit more closely at some of the results, and see what we’ve gained by doing a Bayesian.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s