One of the great new features in Stata 13 is a command called
forecast. This is not just another version of
predict, it’s more like a forecast system management/dependency tool. You can take one or more regressions and deterministic equations and
forecast takes your exogenous variables, pulls their values from your data set, feeds them into the equations/regressions that use them, take the resulting endogenous variables and feeds them into the equations/regressions that use them, chaining together a whole multi-part forecast. It also has tools for testing alternative scenarios, for inserting shocks and other modifications of endogenous variables, and for calculating confidence intervals of the system via simulation.
As an example, I have another series of postings where I analyzed the electricity usage and expenditures for a high-rise condo complex. (Here, I’ll gloss over details of variables and interactions, but you can find out more in the original series.) In order to forecast, my exogenous variables must all be available as forecasts somewhere, and since a major driver of electricity usage is temperature, that’s what my model is based on. To keep things simple, I originally modeled expenditures as a single hierarchical regression with exogenous variables: a month’s average temperature and a high/low-season indicator. This works out reasonably well, and you can read more about it in the other series.
But, this regression doesn’t really break the expenditure down in a way that resembles the actual bill. That is, the three major drivers in a bill are: the daily usage (kWh), the month’s demand (kW), and the seasonal rate (Jun-Sep is high-season). The daily usage and the demand are mainly driven by the temperature. So I’d like to create three regressions, with the daily and demand regressions feeding in to the dollars regression:
sureg (daily temp#c.dca55) (demand temp#c.(dca55 L.daily)) if daily < .
estimates store dailydemand
reg dollars rate#c.(daily demand) if daily < .
estimates store dollars
Where I combined
demand into a seemingly-unrelated regression (SUR) because in the best case the calculations would benefit from
demand experiencing similar shocks and thus having correlated error terms. In the worst case, I wouldn’t gain anything over doing two separate
These two regressions (
dollars) are completely independent and they both pull all of their right-hand-side variables from the data set. I can work with and evaluate either one until I’m satisfied with their individual results, at which point I save the estimates for
forecast. Then we combine this all together into a forecast and identify the exogenous variables:
forecast create dollars
forecast estimates dailydemand
forecast estimates dollars
forecast exogenous dca55
forecast exogenous rate
forecast exogenous temp
Where the forecast is called
dollars, in the forecast name space. (So we now have a
dollars variable, a saved
dollars regression estimate, and a
dollars forecast, each of which could have had different names if we wanted.)
We could have included deterministic equations (e.g. calculating the log of the demand, etc), but for now we’re just grouping these regressions together. To do a forecast, starting with January of 2011, saving all of the endogenous variable forecasts into new variables of the same name, prefixed with “f_”:
f_dollars, etc we do:
forecast solve, begin(tm(2011m1)) prefix(f_)
This may seem straightforward, but Stata has actually done several things that make this much more powerful than two separate regressions. Remember that the regressions mentioned five variables — daily, demand, temp, dca55, and dollars — of which we designated two exogenous, leaving the other three as endogenous. In calculating the forecast, Stata read values of the exogenous variables from the data set, as necessary at each time step, but ignored any values of endogenous variables. The endogenous variables were instead forecast by the regressions.
Forecast keeps track of the variables it creates (the
f_* variables in this case, but more in the next example) and if we want to drop them before another
forecast, we can use
forecast drop. If we want to forecast beyond our data, we would use
tsappend to append future months, then enter forecasted exogenous variables (temperature and rate) and:
forecast solve, prefix(f_) begin(tm(2013m7)) simulate(betas errors, statistic(stddev, prefix(sd_)) reps(100))
which will do the forecast and also calculate SD’s of the forecasted variables via 100 simulations.
You can obviously try out different scenarios by providing different values for the exogenous variables. But you can also override predictions of endogenous variables in a couple of ways. One way is to use the
actuals option, which will override predictions of endogenous variables whenever their values are are not missing (
.) in the data set. You can also use
forecast adjust to conditionally modify endogenous forecasts, adding shocks or other manipulations to see how they propagate through the system.
It’s very impressive, really. The only downside is that only about half of Stata’s regression commands can be used with forecast, though this includes the biggies like OLS, GLM, ARIMA, ARCH, GARCH, VAR, VEC, IV regression, SUR, and non-linear regression.
The graph at the top of this posting shows my forecast for the next eight months, based on NOAA’s long-term temperature forecast for the area.
Hey, great post! For in-sample forecasting the Stata option for the predict command is stdp which calculates the standard error. However, to calculate out of sample we are using forecast solve and the statistic option is standard deviation instead. Do you have any insight on why the change?
Just wondering whether you would be happy to share your stata code for the Electricity Expenditure graph. I have not been successful graphing my confidence intervals, and I don’t think my below attempt is correct:
twoway (tsline cases)(tsrline f_cases sd_cases, recast(rarea)) //plot forecast with 95% CI
I can’t find the exact graph I made, but I have done similar things with:
ucm dollars under55 over55a over55b ib5.regime if regime > 1, model(llevel) cformat(%4.2f) pformat(%4.2f)
predict pred8, rmse(pred8rmse) dynamic(tm(2013m11))
gen pred8ub = pred8 + 1.96 * pred8rmse
gen pred8lb = pred8 - 1.96 * pred8rmse
twoway rarea pred8lb pred8ub date if tin(2012m1,), color(gs14) ytitle(Average Dollars per Day) title(Average Daily Electricity Costs) ///
note(ucm prediction as of October 2013) legend(row(1) order (2 3 1) lab(1 "95% CI") lab(2 "Actual") lab (3 "Prediction")) ///
|| line dollars pred8 pred8t date if tin(2012m1,), name(pred8)
Thanks so much for sharing Wayne (sorry for the delay, I only just got around to redoing this!).
Woah! I’m really loving the template/theme of this website.
It’s simple, yet effective. A lot of times it’s tough to get that “perfect balance” between user friendliness and appearance.
I must say that you’ve done a very good job with this.
Also, the blog loads super fast for me on Safari.