I do a lot of my work in R, and it turns out that not one but two R packages have recently been released that enable R users to use the famous Python-based deep learning package, Keras
: keras
and kerasR
.
The first thing you’ll need to do is to make sure you have Python 3 on your system. Windows generally doesn’t have Python at all, while Macs and some Linux systems will include Python 2.7. The way I’d recommend to get Python 3 is to download and install Anaconda. (Few people do data science with a base-installed Python: they mostly use Anaconda, which pre-packages a lot of data science tools along with Python. Trying to install all of those tools yourself is … difficult.)
The usual place to install Anaconda is in your home directory. Then, on a Mac or Linux system add ~/anaconda/bin:
to the beginning of your PATH
environment variable. For example, on my system, I edited the .profile
file in my home directory and the beginning of the PATH line looks like export PATH=~/anaconda/bin:/usr/local/bin:
, so that when I type python
the system will use the one provided by Anaconda.
One of the nice things about Anaconda is that it provides an enhanced package loading system, conda
. This is similar to R’s cran
in some sense. But to install Keras, we’ll use Python’s pip
, which is similar to R’s devtools
and is run from a command line (not from within Python):
pip install keras tensorflow pydot graphviz
should do the trick. The last two packages allow you to plot the model, though the plot is fairly boring.
Of the two packages (keras
and kerasR
), I’ve started using kerasR
because it has some nice tutorials and it’s in CRAN, so that’s what I’ll use here. (keras
is installed via devtools
.)
In R, install packages kerasR
and reticulate
from CRAN. To load kerasR
requires an extra step beyond the usual library
:
reticulate::use_python ("~/anaconda/bin/python")
library (kerasR)
The use_python
tells R where to find the Python 3 that you’re using, the one where you also pip
-loaded keras
. If the library
doesn’t think for a moment and then return “successfully loaded keras”, something is wrong, and you’ll get a bunch of Python error messages which will give you a clue, if you examine them carefully. If you forget the use_python
, R will be looking at the base-installed Python and won’t be able to import keras
in Python. The things that can go wrong are myriad, and I can’t really be more specific, unfortunately.
If you got this far, you can now follow the kerasR
tutorial. At the moment (03 June), there is an error in the first model of the tutorial (Boston Housing) that causes it to perform poorly. I’ve submitted a trouble ticket, and the code I’d recommend is:
mod <- Sequential ()
mod$add (Dense (units=50, input_shape=13))
mod$add (Activation ("relu"))
mod$add (Dense (units=1))
# mod$add (Activation ("relu"))
keras_compile (mod, loss="mse", optimizer=RMSprop())
boston <- load_boston_housing ()
X_train <- scale (boston$X_train)
Y_train <- boston$Y_train
X_test <- scale (boston$X_test, center=attr (X_train, "scaled:center"), scale=attr (X_train, "scaled:scale"))
Y_test <- boston$Y_test
keras_fit (mod, X_train, Y_train, batch_size=32, epochs=200, verbose=1, validation_split=0.1)
pred <- keras_predict (mod, X_test)
sd (as.numeric (pred) - Y_test) / sd (Y_test)
plot (Y_test, pred)
abline (0, 1)
Keras works a bit differently than the way R usually works in that mod$add
modifies the model mod
directly, in-place. The mod$add
‘s first create an empty model (Sequential ()
), and then add a layer with 50 hidden units and a “Relu” activation function, and then add the 1-unit output layer.
This is pretty much a simple Hello World model with 13 input variables and one hidden layer with 50 units. You could have made the same model in older R neural net packages, but Keras has many advantages.
In the tutorial (as of 03 June), the R scale
‘s of the X training and testing data were independent and not linked. In this case, I scale the training data and then use the same center
and scale
for the testing data, just as you would when you deploy a model: training represents the data we already have, while testing represents new data arriving in the future. (This is a pet peeve on my part, and not generally important.)
More importantly, the tutorial also accidentally applied a second normalization to the data in the prediction step, which would drive it in a different direction from the training data. The version, above, has results that look pretty reasonable:
This example isn’t Deep Learning(tm) and you could’ve done this with older R neural net packages, but it’s just the start of Keras exploration. Follow the kerasR
tutorials and the links they recommend. For more details on what the various kerasR
functions do, check out the Keras pages. (Remembering that kerasR
doesn’t implement everything in Keras itself.)
For a very readable explanation of Deep Learning architectures, first read Neural Network Zoo Prequel, and then Neural Network Zoo, by the Asimov Institute.
One of the advantages of Keras is that it’s built on Tensor Flow, which takes full advantage of clusters of machines, GPUs, and all those other things that makes Google Google. Unfortunately, by “GPU” we mean Nvidia GPUs (i.e. GPUs that support CUDA). My Mac laptop has an AMD graphics chip, so I can’t use GPUs, though I can still develop things on my laptop and then someday spin up things on Amazon Web Services and use GPU-based instances.
brms
package, available from CRAN. In case you haven’t heard of it, brms
is an R package by Paul-Christian Buerkner that implements Bayesian regression of all types using an extension of R’s formula specification that will be familiar to users of lm
, glm
, and lmer
. Under the hood, it translates the formula into Stan code, Stan translates this to C++, your system’s C++ compiler is used to compile the result and it’s run.
brms
is impressive in its own right. But also impressive is how it continues to add capabilities and the breadth of Buerkner’s vision for it. I last posted something way back on version 0.8, when brms
gained the ability to do non-linear regression, but now we’re up to version 1.1, with 1.2 around the corner. What’s been added since 0.8, you may ask? Here are a few highlights:
You can now define functions in Stan, using stan_funs
and can use them in your formula or use them like a native R function (via expose_functions
), which is pretty exciting. A huge addition is gam
-style smoothers, allowing us to implement GAMMs (Generalized Additive Mixture Models), including spatial GAMMs (via MRF smoothers). Interval censoring has been added to the already-existing left and right censoring. You can use monotonic
effects in ordinal regression, and both monotonic
and cse
(category-specific) effects can be used at the individual or group levels. It now supports distributional regression models, allowing modeling of things like the heterogeneity of variances. There are many efficiency improvements, and you can now use IDs to specify that multiple groups share the same group-level effects across formulas. The von mises
family has been added to allow for circular regression. (I’ve had some difficulty making it work, I have to admit.) Last, the graphing gets better and better — it was also updated to work with ggplot2
version 2.2 — and many of the plots will soon (brms
1.2) use package bayesplot
.
Buerkner’s To Do list includes items like CAR (Conditional Auto Regression) models which are state-of-the-art for spatial regression, mixture models, and errors-in-variables models. Yeah, he doesn’t rest on his laurels.
The flexibility of brms
formulas allows you to create sophisticated models easily. For example, using cens
and the Weibull family allows you to create (AFT) survival models. Add in a random effect and you have a frailty model — no need to learn something dramatically new. If you reach brms
limits, you may be able to dip your toe into Stan and include a Stan function via stan_funs
. If that’s not good enough, you can extract the Stan model, via stan_model
and then modify it and use the rstan
package and go full-fledged Stan. (brms
implements much simpler, more human-like Stan models than rstanarm
, which makes it much more practical to build on a reliable base of code.)
But you will use brms
for a long time before you need to delve into Stan. You can do lm
-style models, you can add a family to do GLMs (including logistic regression, categorical or ordinal logistic regression, Poisson, Lognormal, zero-inflated Negative Binomial, etc), you can add in random effects to create a mixed-effects model, you can add in smoothers to create GAMs, you can add smoothers and random effects to create GAMMs, you can handle censored variables and use the appropriate family (Weibull, etc) to create survival models, and you could take your survival model and add a random effect to get a frailty model. All without doing anything radically different from specifying an lmer
model.
Check out brms
on CRAN!
It’s full of great quotes like this:
“… In those less-hyped times, the skills being touted today were unnecessary. Instead, scientists developed skills to solve the problem they were really interested in, using elegant mathematics and powerful quantitative programming environments modeled on that math. Those environments were the result of 50 or more years of continual refinement, moving ever closer towards the ideal of enabling immediate translation of clear abstract thinking to computational results.
“The new skills attracting so much media attention are not skills for better solving the real problem of inference from data; they are coping skills for dealing with organizational artifacts of large-scale cluster computing. …”
Great stuff.
brms 0.8
, they’ve added non-linear regression. Non-linear regression is fraught with peril, and when venturing into that realm you have to worry about many more issues than with linear regression. It’s not unusual to hit roadblocks that prevent you from getting answers. (Read the Wikipedia links Non-linear regression and Non-linear least squares to get an idea.)
Given that you intend to venture forth, brms 0.8
can now help you. If you go to the Non-linear regression Wikipedia page, above, the first section, General describes the Michaelis-Menten kinetics curve, and there is a graph to the right. I eyeballed the graph and created the points of the curve with:
mmk <- data.frame (R=c(0, 0.04, 0.07, 0.11, 0.165, 0.225, 0.27, 0.31, 0.315), S=c(0, 30, 70, 140, 300, 600, 1200, 2600, 3800)) plot (R ~ S, data=mmk)
According to the article, the formula is: , and the way to implement this in brms 0.8
is:
library (brms) fit <- brm (bf (R ~ b1 * S / (b2 + S), b1 + b2 ~ 1, nl=TRUE), data=foo, prior = c(prior (gamma (1.5, 0.8), lb=0.001, nlpar = b1), prior (gamma (1.5, 0.002), lb=0.001, nlpar = b2))) summary (fit2)
The formula is just like the Michaelis-Menten formula, which is a little different from most regression formulas. The nonlinear=
part specifies the modeling of the non-linear variables. Priors must be placed on all non-linear variables. If you’re wondering about what these priors look like, they’re quite broad and mainly serve to keep things positive. To see them:
curve (dgamma (x, 1.5, 0.002), -0.1, 3000, n=1000) ; abline (h=0, col="grey") ; abline (v=0.34, lty=3) curve (dgamma (x, 1.5, 0.0005), -0.1, 10000, n=1000) ; abline (h=0, col="grey") ; abline (v=300, lty=3)
where the vertical lines indicate the values I eyeballed from the wikipedia graph. The model takes a bit of time to compile, but runs four chains rapidly, and the results from the summary
are:
Family: gaussian (identity) Formula: R ~ b1 * S/(b2 + S) Data: mmk (Number of observations: 9) Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup samples = 4000 WAIC: Not computed Fixed Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat b1_Intercept 0.34 0.01 0.33 0.35 1037 1 b2_Intercept 301.21 20.27 263.30 343.05 1084 1 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma(R) 0.01 0 0 0.01 835 1.01
And the results for and look to be close to what I’d eyeballed and we got a reasonable effective sample size and Rhat
. We can plot our results with the new (in brms 0.8
) marginal_effects
function, and also plot the MCMC chains with plot (fit2)
. (The latter graph is included at the top of this posting.)
plot (marginal_effects (fit2), ask = FALSE) plot (fit2)
The trace plot is a bit clumpier than trace plots we’ve looked at before, but then again we only have 8 data points. There are many details about non-linear regression that I don’t know much about — biases of sigma
, perhaps, and others — so this isn’t a tutorial, but hopefully it points you in the right direction if non-linear regression is a path you want to tread.
Bayesian modeling is flexible because it’s really a generalized mechanism for probabilistic inference, so we can create almost any model you can imagine in a flexible tool like Stan
. And now brms
provides easy access to a very different kind of model than most of us are used to.
brms
and rstanarm
. Interestingly, both of these packages are elegant front ends to Stan, via rstan
and shinystan
.
This article describes brms
and rstanarm
, how they help you, and how they differ.
You can install both packages from CRAN, making sure to install dependencies so you get rstan
, Rcpp
, and shinystan
as well. [EDIT: Note that you will also need a C++ compiler to make this work, as Charles and Paul Bürkner (@paulbuerkner) have found. On Windows that means you’ll need to install Rtools
, and on the Mac you may have to install Xcode (which is also free). brms
‘s help refers to the RStan Getting Started, which is very helpful.] If you like having the latest development versions — which may have a few bug fixes that the CRAN versions don’t yet have — you can use devtools
to install them following instructions at the brms
github site or the rstanarm
github site.
brms
packageLet’s start with a quick multinomial logistic regression with the famous Iris dataset, using brms
. You may want to skip the actual brm
call, below, because it’s so slow (we’ll fix that in the next step):
library (brms) rstan_options (auto_write=TRUE) options (mc.cores=parallel::detectCores ()) # Run on multiple cores set.seed (3875) ir <- data.frame (scale (iris[, -5]), Species=iris[, 5]) ### With improper prior it takes about 12 minutes, with about 40% CPU utilization and fans running, ### so you probably don't want to casually run the next line... system.time (b1 <- brm (Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width, data=ir, family="categorical", n.chains=3, n.iter=3000, n.warmup=600))
First, note that the brm
call looks like glm
or other standard regression functions. Second, I advised you not to run the brm
because on my couple-of-year-old Macbook Pro, it takes about 12 minutes to run. Why so long? Let’s look at some of the results of running it:
b1 # ===> Result, below Family: categorical (logit) Formula: Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width Data: ir (Number of observations: 150) Samples: 3 chains, each with n.iter = 3000; n.warmup = 600; n.thin = 1; total post-warmup samples = 7200 WAIC: NaN Fixed Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept[1] 1811.49 1282.51 -669.27 4171.99 16 1.26 Intercept[2] 1773.48 1282.87 -707.92 4129.22 16 1.26 Petal.Length[1] 3814.69 6080.50 -8398.92 15011.29 2 2.32 Petal.Length[2] 3848.02 6080.70 -8353.52 15032.85 2 2.32 Petal.Width[1] 14769.65 18021.35 -2921.08 54798.11 2 3.36 Petal.Width[2] 14794.32 18021.10 -2902.81 54829.05 2 3.36 Sepal.Length[1] 1519.97 1897.12 -2270.30 5334.05 7 1.43 Sepal.Length[2] 1515.83 1897.17 -2274.31 5332.95 7 1.43 Sepal.Width[1] -7371.98 5370.24 -18512.35 -935.85 2 2.51 Sepal.Width[2] -7377.22 5370.22 -18515.78 -941.65 2 2.51
A multinomial logistic regression involves multiple pair-wise logistic regressions, and the default is a baseline level versus the other levels. In this case, the last level (virginica
) is the baseline, so we see results for 1) setosa
v virginica
, and 2) versicolor
v virginica
. (brms
provides three other options for ordinal regressions, too.)
The first “diagnostic” we might notice is that it took way longer to run than we might’ve expected (12 minutes) for such a small dataset. Turning to the formal results above, we see huge estimated coefficients, huge error margins, a tiny effective sample size (2-16 effective samples out of 7200 actual samples), and an Rhat
significantly different from 1. So we can officially say something (everything, actually) is very wrong.
If we were coding in Stan
ourselves, we’d have to think about bugs we might’ve introduced, but with brms
, we can assume for now that the code is correct. So the first thing that comes to mind is that the default flat (improper) priors are so broad that the sampler is wandering aimlessly, which gives poor results and takes a long time because of many rejections. The first graph in this posting was generated by plot (b1)
, and it clearly shows non-convergence of Petal.Length[1]
(setosa
v virginica
). This is a good reason to run multiple chains, since you can see how poor the mixing is and how different the densities are. (If we want nice interactive interface to all of the results, we could launch_shiny (b1)
.)
Let’s try again, with more reasonable priors. In the case of a logistic regression, the exponentiated coefficients reflect the increase in probability for a unit increase of the variable, so let’s try using a normal (0, 8)
prior, (95% CI is , which easily covers reasonable odds):
system.time (b2 <- brm (Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width), data=ir, family="categorical", n.chains=3, n.iter=3000, n.warmup=600, prior=c(set_prior ("normal (0, 8)"))))
This only takes about a minute to run — about half of which involves compiling the model in C++ — which is a more reasonable time, and the results are much better:
Family: categorical (logit) Formula: Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width Data: ir (Number of observations: 150) Samples: 3 chains, each with iter = 3000; warmup = 600; thin = 1; total post-warmup samples = 7200 WAIC: 21.87 Fixed Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept[1] 6.90 3.21 1.52 14.06 2869 1 Intercept[2] -5.85 4.18 -13.97 2.50 2865 1 Petal.Length[1] 4.15 4.45 -4.26 12.76 2835 1 Petal.Length[2] 15.48 5.06 5.88 25.62 3399 1 Petal.Width[1] 4.32 4.50 -4.50 13.11 2760 1 Petal.Width[2] 13.29 4.91 3.76 22.90 2800 1 Sepal.Length[1] 4.92 4.11 -2.85 13.29 2360 1 Sepal.Length[2] 3.19 4.16 -4.52 11.69 2546 1 Sepal.Width[1] -4.00 2.37 -9.13 0.19 2187 1 Sepal.Width[2] -5.69 2.57 -11.16 -1.06 2262 1
Much more reasonable estimates, errors, effective samples, and Rhat
. And let’s compare a plot for Petal.Length[1]
:
Ahhh, that looks a lot better. The chains mix well and seem to randomly explore the space, and the densities closely agree. Go back to the top graph and compare, to see how much of an influence poor priors can have.
We have a wide choice of priors, including Normal, Student’s t, Cauchy, Laplace (double exponential), and many others. The formula argument is a lot like lmer
‘s (from package lme4
, an excellent place to start a hierarchical/mixed model to check things before moving to Bayesian solutions) with an addition:
response | addition ~ fixed + (random | group)
where addition
can be replaced with function calls se
, weights
, trials
, cat
, cens
, or trunc
, to specify SE of the observations (for meta-analysis), weighted regression, to specify the number of trials underlying each observation, the number of categories, and censoring or truncation, respectively. (See details of brm
for which families these apply to, and how they are used.) You can do zero-inflated and hurdle models, specify multiplicative effects, and of course do the usual hierarchical/mixed effects (random
and group
) as well.
Families include: gaussian
, student
, cauchy
, binomial
, bernoulli
, beta
, categorical
, poisson
, negbinomial
, geometric
, gamma
, inverse.gaussian
, exponential
, weibull
, cumulative
, cratio
, sratio
, acat
, hurdle_poisson
, hurdle_negbinomial
, hurdle_gamma
, zero_inflated_poisson
, and zero_inflated_negbinomial
. (The cratio
, sratio
, and acat
options provide different options than the default baseline model for (ordinal) categorical models.)
All in one function! Oh, did I mention that you can specify AR, MA, and ARMA correlation structures?
The only downside to brms
could be that it generates Stan code on the fly and passes it to Stan via rstan
, which will result in it being compiled. For larger models, the 40-60 seconds for compilation won’t matter much, but for small models like this Iris model it dominates the run time. This technique provides great flexibility, as listed above, so I think it’s worth it.
Another example of brms
‘ emphasis on flexibility is that the Stan code it generates is simple, and could be straightforwardly used to learn Stan coding or as a basis for a more-complex Stan model that you’d modify and then run directly via rstan
. In keeping with this, brms
provides the make_stancode
and make_standata
functions to allow direct access to this functionality.
rstanarm
packageThe rstanarm
package takes a similar, but different approach. Three differences stand out: First, instead of a single main function call, rstanarm
has several calls that are meant to be similar to pre-existing functions you are probably already using. Just prepend with “stan_”: stan_lm
, stan_aov
, stan_glm
, stan_glmer
, stan_gamm4
(GAMMs), and stan_polr
(Ordinal Logistic). Oh, and you’ll probably want to provide some priors, too.
Second, rstanarm
pre-compiles the models it supports when it’s installed, so it skips the compilation step when you use it. You’ll notice that it immediately jumps to running the sampler rather than having a “Compiling C++” step. The code it generates is considerably more extensive and involved than the code brms
generates, which seems to allow it to sample faster but which would also make it more difficult to take and adapt it by hand to go beyond what the rstanarm/brms
approach offers explicitly. In keeping with this philosophy, there is no explicit function for seeing the generated code, though you can always look into the fitted model to see it. For example, if the model were br2
, you could look at br2$stanfit@stanmodel@model_code
.
Third, as its excellent vignettes emphasize, Bayesian modeling is a series of steps that include posterior checks, and rstanarm
provides a couple of functions to help you, including pp_check
. (This difference is not as hard-wired as the first two, and brms
could/should someday include similar functions, but it’s a statement that’s consistent with the Stan team’s emphasis.)
As a quick example of rstanarm
use, let’s build a (poor, toy) model on the mtcars
data set:
mm Results stan_glm(formula = mpg ~ ., data = mtcars, prior = normal(0, 8)) Estimates: Median MAD_SD (Intercept) 11.7 19.1 cyl -0.1 1.1 disp 0.0 0.0 hp 0.0 0.0 drat 0.8 1.7 wt -3.7 2.0 qsec 0.8 0.8 vs 0.3 2.1 am 2.5 2.2 gear 0.7 1.5 carb -0.2 0.9 sigma 2.7 0.4 Sample avg. posterior predictive distribution of y (X = xbar): Median MAD_SD mean_PPD 20.1 0.7
Note the more sparse output, which Gelman promotes. You can get more detail with summary (br)
, and you can also use shinystan
to look at most everything that a Bayesian regression can give you. We can look at the values and CIs of the coefficients with plot (mm)
, and we can compare posterior sample distributions with the actual distribution with: pp_check (mm, "dist", nreps=30)
:
I could go into more detail, but this is getting a bit long, and rstanarm
is a very nice package, too, so let’s wrap things up with a comparison of the two packages and some tips.
Both packages support a wide variety of regression models — pretty much everything you’ll ever need. Both packages use Stan, via rstan
and shinystan
, which means you can also use rstan
capabilities as well, and you get parallel execution support — mainly useful for multiple chains, which you should always do. Both packages support sparse solutions, brms
via Laplace or Horseshoe priors, and rstanarm
via Hierarchical Shrinkage Family priors. Both packages support Stan 2.9’s new Variational Bayes methods, which are much faster then MCMC sampling (an order of magnitude or more), but approximate and only valid for initial explorations, not final results.
Because of its pre-compiled-model approach, rstanarm
is faster in starting to sample for small models, and is slightly faster overall, though a bit less flexible with things like priors. brms
supports (non-ordinal) multinomial logistic regression, several ordinal logistic regression types, and time-series correlation structures. rstanarm
supports GAMMs (via stan_gamm4
). rstanarm
is done by the Stan/rstan
folks. brms
‘s make_stancode
makes Stan less of a black box and allows you to go beyond pre-packaged capabilities, while rstanarm
‘s pp_check
provides a useful tool for the important step of posterior checking.
Bayesian modeling is a general machine that can model any kind of regression you can think of. Until recently, if you wanted to take advantage of this general machinery, you’d have to learn a general tool and its language. If you simply wanted to use Bayesian methods, you were often forced to use very-specialized functions that weren’t flexible. With the advent of brms
and rstanarm
, R users can now use extremely flexible functions from within the familiar and powerful R framework. Perhaps we won’t all become Bayesians now, but we now have significantly fewer excuses for not doing so. This is very exciting!
First, Stan’s HMC/NUTS sampler is slower per sample, but better explores the probability space, so you should be able to use fewer samples than you might’ve come to expect with other samplers. (Probably an order of magnitude fewer.) Second, Stan transforms code to C++ and then compiles the C++, which introduces an initial delay at the start of sampling. (This is bypassed in rstanarm
.) Third, don’t forget the rstan_options
and options
statements I started with: you really need to run multiple chains and the fastest way to do that is by having Stan run multiple processes/threads.
Remember that the results of the stan_
plots, such as stan_dens
or the results of rstanarm
‘s plot (mod, "dens")
are ggplot2
objects and can be modified with additional geoms
. For example, if you want to zoom in on a density plot:
stan_plot (b2$fit, show_density=TRUE) + coord_cartesian (xlim=c(-15, 15))
(Note: you want to use coord_cartesian
rather than xlim
which eliminates points and screws up your plot.) If you want to jitter and adjust the opacity of pp_check
points:
pp_check (mm, check="scatter", nreps=12) + geom_jitter (width=0.1, height=0.1, color=rgb (0, 0, 0, 0.2))
Carlos A. Furuti has an excellent website with many projections and clear explanations of the tradeoffs of each. The main projection page has links to all types, including two of my favorites: Other Interesting Projections, and Projections on 3D Polyhedra. Enjoy!
In R, the packages maps and mapproj are your entrée to this world. I created the above map (a Mollweide projection, which is a useful favorite), with:
library (maps)
library (mapproj)
map ("world", projection="mollweide", regions="", wrap=TRUE, fill=TRUE, col="green")
map.grid (labels=FALSE, nx=36, ny=18)
I’ll start the series with a review of Kaiser Fung’s Numbersense, published in 2013. It’s not mainly about Real Data Science, but I’ll start with it because it’s a great book that illustrate several common data pitfalls, and in the epilogue Kaiser shares one of his own Real Data Science stories and I found myself nodding my head and saying, “Yup, that’s how I spent several days in the last couple of weeks!”
Numbersense is a wonderful and accessible book that consists of a series of stories about data that illustrate how to think about the kinds of statistics you read about on a daily basis. The emphasis isn’t mathematical, it’s more about when you should think, “Hmmm… that doesn’t sound right”, when you hear some statistics thrown around. The summary on the jacket cover mentions Big Data, but none of the principles depend on Big Data so they’re applicable in most any situation.
The Prologue throws out several short stories to illustrate how underlying assumptions can fool you, using situations like how Bill Gates was fooled about the efficacy of small schools versus larger schools, how airline on-time statistics can say totally different things depending on which direction you slice it, and how Mitt Romney’s pollsters allowed themselves to be blindsided by the elections. After that, each chapter revolves around a more in-depth story and can range from how Kaiser developed some metrics to help a friend in his Fantasy Football league, to looking at how economists seasonally adjust data or why economists are puzzled about consumer perceptions of inflation. (Turns out, seasonal adjustment is useful and important, but “core inflation” is misleading both from an economic policy viewpoint and from an understanding-consumers viewpoint.)
Kaiser’s a good story teller, and he dives deeply into the whole context and environment of each story. This gives the book a great flavor, though I’ll have to warn you that, depending on your interests, you’ll probably find one or two stories less interesting than the rest and might need to go into skim mode for those. In my case, the Bureau of Labor Statistics (BLS) chapter was fascinating to me in all of its texture and flavor, but the Fantasy Football chapter got too deeply into personalities for my taste.
Each story is real-world in the way that you don’t always know where Kaiser’s heading when it starts, which is a lot like data exploration in the real world. It can be a little disorienting to suddenly realize that economists were the good guys in the last story, but they’re the bad guys in the current story, but that’s a small price to pay.
Remember not to overlook the Epilogue! It’s a Real Data Science story, that illuminates the part of Data Science that you don’t usually read about.
One of my favorite topics in the book was counterfactuals. That is, what are the alternatives to what actually happened and how do you determine an appropriate baseline for comparison of results before and after something changed. In a designed experiment, you establish treatment and control groups, and if the experiment is well-designed you’ll have a very good idea the actual difference the treatment makes. But when you’re just handed some data you only know what did happen and have to do some work to figure out what realistically would have happened if something hadn’t changed.
For example, your company brings up a new website and at the end of the year they proclaim how much new business the site has brought in. The ROI is incredible… until you realize that they’re making the assumption that 100% of the business done through the website would not have occurred if not for the new version. A more realistic baseline would acknowledge that some of the website business is due to the new website but some of it would have occurred anyhow — over the phone, in person, via mail, via the older website, etc — and the trick is to try to allocate the website’s business appropriately to find a realistic ROI. That’s counterfactuals.
A great book that’s entertaining and educational, that you can read in bite-sized chunks, and that you’ll find useful in our data-intensive society.
This discussion on the Math Stack Exchange site addresses that issue. I highly recommend you read it.
Two things: 1) the two-person version I mentioned is logically fair but not fair in practice: make sure to find the comment that addresses that. And 2) there are stochastic and deterministic solutions, and most of them address problem that are harder than “fair” division: that no one involved in the process feel “deceived”, that everyone in the process feel that their piece is the best possible piece (not envy other users), that no group can conspire to get better pieces, etc. Of course, there’s geek humor in abundance, too.
Let me start by saying that this is one of the best textbooks I’ve ever read. It was written as if the author was our mentor, and I really get the feeling that he’s sharing his wisdom with us rather than trying to be pedagogically correct. The book is full of insights on how he thinks about building and applying SEMs, and the lessons he’s learned the hard way.
It’s a little risky to endorse a highly-targeted book like this: it’s dedicated to longitudinal SEMs that don’t include categorical variables. But the discussion and advice that it has is so valuable that I’d recommend it to anyone who is interested in SEMs.
For example, his discussion on scale setting — you have to nail down something regarding a latent variable (“construct”) in order to provide a defined scale — and in addition to the way that most SEM software does it (fixing one of the loading factors to 1), there are two other ways to do it, both of which provide several important advantages. (The book’s website provides example Lisrel, Mplus, and R lavaan code illustrating select chapters, including the one that illustrates scaling.)
As another example, he talks about phantom constructs (latent variables), which is an expert’s trick for modifying your model to change coefficients into more interpretable forms. As one example, you can convert the SEM’s native covariance between latent variables into correlations, post-estimation, by simple algebra. But you couldn’t take two of these correlations (from different times in a longitudinal study, say) and use the model to do a chi-squared test of the significance of the difference between them. Using phantom constructs, with certain constraints, the model can yield correlations directly, so you could properly test their significance.
I could go on, but the bottom line is the book is well-written, entertaining, enlightening, well-illustrated, insightful, and it covers many areas of basic SEM as well as the specifics of the more complicated longitudinal SEM. It’s $60 for a 380-page hardback, and worth every penny.
Notice in the above screenshot that variables can naturally contain spaces in their names, without the need to put quotation marks around them. To display a value or calculation, put “=>” after it, and Calca displays the resulting value in a grey box. Type unindented text and Calca figures out that it’s not calculations and formats it as text.
Of course, everything is dynamic, so if you change a variable, all the results throughout the document that depend on the variable also change. You can do these things using something like R with Markdown, and even more things like including graphs and using the full power of R, but for some things Calca is the way to go. (I’ve also found it helpful in conjunction with Stata, which doesn’t have the Markdown/reproducible tools that R has.)
It’s a great utility tool, and its inexpensive. The Mac App store makes it particularly easy, since it will be auto-updated as well. Highly recommended.
My wife and I have started listening to books on audio during car trips or in the evening, and we’ve discovered that there are some absolute gems in the children’s literature section of the library. Yep, kid’s stories aren’t just for kids anymore. (And I wish I’d had stories like this when I was growing up!) In particular, we’ve been listening to audio CD’s and several of them have superb a voice acting that really enhances the story. These stories show incredible imagination, and in this posting I’d like to highly recommend two series: The Larklight Trilogy and the Bartimaeus Sequence, especially the audio CD’s.
The Larklight Trilogy[1] is a steam-punkish sci-fi/fantasy work set in the Victorian Era of an alternative reality where Sir Isaac Newton discovered alchemical secrets which enabled space travel. Space is filled with ether, the space ships are winged versions of sailing ships – many of which can fly faster than light – and most of the planets in our solar system have fascinating societies, inhabitants, and histories.
I really can’t do justice to the creativity of the series, without spoiling things. (I’d highly recommend not reading the Wikipedia entry or other spoilers.) The voice acting on the audio CD’s is first-rate and I’d highly recommend listening to it if you can. The story has a stunning range of imagination, and a well-thought-out arc across the trilogy.
The Bartimaeus Sequence[2] is a fantasy work that is, by coincidence, also set in the Victorian Era of an alternative reality where magicians rule the world, though magic does not work in the way you might imagine.
As with Larklight, I really can’t do justice to the series without spoiling it for you. And the voice acting on the audio CD is also first-rate. (I’d recommend starting with the first book, The Amulet of Samarkand, and save reading the prequel, The Ring of Solomon for last.)
If you like good sci-fi/fantasy and you’ve missed these children’s books, you simply must get these on audio CD and sit back and enjoy their wonderful worlds!
I highly recommend not reading spoilers, but for completeness, Wikipedia has an entry on the Larklight Trilogy. ↩
I highly recommend not reading spoilers, but for completeness, Wikipedia has an entry on the Bartimaeus Sequence. ↩
Here’s a screen capture of the Activity Monitor’s Energy tab (click on it to see it full-sized):
First, note that this displays (by default) processes that have run in the last 8 hours, not just currently-running processes. This is a new option for the Activity Monitor, and is necessary if you’re really going to see energy usage, even for processes that ran and stopped while you weren’t looking.
Second, note that it displays Energy Impact and Average Energy Impact values. If you have the battery status icon enabled in the menu bar, you can also pull it down and see the top “Apps Using Significant Energy”, which reflects the Energy Impact value. No better way to get app developers to take advantage of energy-saving features — and avoid energy-wasters like active polling — than to post them up.
Third, App Nap is an OS feature that apps can take advantage of, and Requires High Performance GPU indicates what apps have switched the system to the dedicated GPU instead of using the more-energy-efficient Intel graphics. In conjunction with App Nap, which apps can take advantage of to use less CPU, the new OS also clumps process awake-times together, to make longer stints of idle time possible, further saving power. Pretty cool predictive stuff here.
Fourth, it turns out that the new OS compresses RAM, which actually saves CPU cycles and also energy-inefficient access to external storage. (It also makes the system more responsive, as a side benefit.)
Which all adds up to an OS upgrade can extend your battery’s life by up to 25%. I find that astounding. Add in new CPUs, like Intel’s Haswell, and the industry’s making amazing battery-life strides.
COIN-OR’s CBC
COIN-OR’s CLP
COIN-OR’s SYMPHONY
Gnu GLPK’s glpsol
lp_solve
Some considerations in choosing an open source solver:
1. Higher-level modeling language. If you’re going to do any serious linear programming, you’ll want to use a higher-level mathematical language like AMPL, GAMS, GMPL, or ZIMPL*. The alternative, working at the low-level rows/columns level, is insane for anything other than a small system of equations.
AMPL and GAMS are commercial products, which ruled them out for this project. ZIMPL is open-source, but GMPL is also open source and has two important advantages: 1) it’s directly read by several solvers, and 2) it’s a subset of AMPL so if you ever get to the point that you’re moving up to commercial products you’re ahead of the curve. With ZIMPL, you’d need to run your model through the ZIMPL processor to output MPS (a primitive, row/column format), which is implemented in most every solver, creating a two-step process, and possibly losing some accounting information along the way.
2. Ability to solve solvable problems. I found a couple of comparisons of commercial and non-commercial solvers, where they threw a suite of test problems at each product. The commercial solvers couldn’t solve about 20% of the test suite, but several of the open source solvers couldn’t solve 80%+ of the test suite. The open source CBC was more comparable to the commercial solvers.
3. Speed. Open source solvers lag far, far behind commercial solvers. Of the non-commercial solvers was SCIP far and away the fastest, but SCIP must be licensed if used for non-academic purposes, so that ruled it out for my purposes. Of the remaining open source solvers, CBC stood out for its speed, being twice as fast as glpsol, lp_solve, etc.
4. Standalone? Do you intend to use the solver via an API, or do you want to use it as a standalone solver? Only gnu’s glpsol supports all of the I/O features of GMPL, including the display
and printf
statements, which means you can make fairly nice reports and can more easily read/write CSV or databases. CBC has some output options. And lp_solve seems to have I/O options with its XLI plugins, but I couldn’t get them to work.
CBC is the fastest and most powerful open source solver that might be used in a non-academic setting, it directly reads GMPL, and its output capabilities are acceptable, so it’s my winner. (It can also be compiled to take advantage of multi-core CPUs, though I’ve not hit it hard enough yet to see how well it scales.)
One trick that’s totally non-obvious is how you get the CBC standalone solver to read a GMPL file. The obscure answer is that in the command line, you add a percent sign to the end of the model file’s name:
cbc mymodel.mod% -threads 4 -printi csv -solve -solu mymodel.csv
which runs it on 4 cores, outputting the results in a CSV-formatter file called my model.csv, using the GMPL model file my model.mod. The trailing percent sign is not part of the file’s name, it’s just a flag. I can see a reason why they might’ve done it this obscure way, but it’s not clearly spelled out in the documentation.
You can also obtain glpsol
and SYMPHONY, which both also read GMPL models, so you can triple-check results if you want. Hope this was helpful!
* In a more Seussian mode, this sentence should be, “AMPLE, GAMS, GMPL, and ZIMPL; ROFL, GUZL, TRMPL, and ZOOFL.” Or something like that. The last four “products” don’t exist, as far as I know. Though of course, there are actual (non-linear-programming) products called Hadoop, Mahout, Scalding, Pootle, and Django.
So far, I’m leaning towards OscaR, because it seems fairly complete and it’s in Scala which is the Official General-Purpose Language of Thinkinator. I’ve looked at a variety of CP tools like Zinc (seems dead) and MiniZinc (seems barely alive), Geocode (C++), JaCoP (java-based), Choco (java-based), Comet (seems dead), etc. It’s a bit worrisome when I compare the level of activity and the apparent long-term stability of these projects to linear programming (Symphony, glpsol, lpsolv, etc).
Any experiences or insights on CP tools? (I’d prefer standalone tools with high-level syntax or Scala/Java-based tools, and not C++/Prolog-based tools.) Thanks!
I whipped out the Ashley Book and found the perfect bend, #1463, which is good for joining two cords that have different thickness. I’ve shown my results below, but note that the actual knot is in the upper left and the rest of it is just my quick work to make sure that the thicker piece stays bent and doesn’t straighten out and try to pull through. I don’t think Ashley would approve of that part.
The Ashley Book costs about $50 at Amazon, but it really is a delightful book. First, it shows 4,000 some knots, with some discussion of what knots are useful under what conditions. For example, the square knot (aka reef knot) is something that most of us learned as kids. Turns out that it’s a great knot for binding something, but one of its strengths is that you can yank on one of the loose ends and quickly loosen it — I didn’t know that — which renders it totally unsuitable in the situation I encountered: in bending two cords together. Us city-slickers don’t deal much with knots, but it turns out that this could be a life-threatening mistake if you thought a square knot was secure enough, say, to use for climbing out a window.
The Ashley Book’s negatives? It isn’t really a casual book, and is organized logically for a knot aficionado but a little confusingly for us landlubbers. Also, since it was published in 1944, it doesn’t include knots that might be more appropriate for modern synthetic ropes. It’s also a bit expensive for casual use, but I think the positives may well outweigh the negatives for many readers.
The Knot Book (the mathematical one) is okay, but ultimately lost my interest. The Ashley book is a fun read, which is full of a lot of history of commercial sailing. The author published the book in 1944, when he was 60, and as a youth he had actually gone to sea on an old-school whaling ship. He obviously talks a lot about knots used to secure things when sailing, but knots cover a lot of ground, including the balls and other shapes that keep the end of a rope from fraying or provide handholds, and ornamental knots. Turns out that illiterate sailors at sea needed to do something to keep from going crazy and there was lots of rope, so fancy knots made a lot of sense. Hard to imagine a tough sailor doing macrame, but there you have it.
For the real geek, it’s also fascinating to read the glossary and to see where phrases come from. For example, “the bitter end” is a rope/sailing term, as is “overhaul”, “carried away”, and a variety of other phrases. A phrase that was particularly illuminated by the book was “learning the ropes”. I’d always thought it meant learning which ropes to pull on, or learning how to climb in the rigging, but that’s only part of it. At a more basic level, tying the wrong knot or even tying the right knot but failing to dress it properly would result in the knot failing and someone dying or even the ship being lost. At an even more fundamental level, there were approximately 60 different kinds of rope used on sailing ships (probably only about 20 on any particular ship), so you literally had to know what kind of rope to use for what kind of task, as well as how to stow ropes properly, how to fix the ends so they wouldn’t fray, how to break them in and care for them, how to provide protection from friction, and how to know when they were likely to fail.
It’s funny how us users of R, Hadoop, and AWS are tempted to think we’re so so much smarter than those people who lived long ago. But think of the engineering sophistication of sixty different kinds of ropes for different purposes, along with dozens of different knots, and we haven’t even gotten to sails and their materials. Sort of like Conestoga Wagons, which used many different kinds of wood, each better suited for a particular part of the wagon. They were very smart people, they just didn’t have the materials that we have today.
Oh, it turns out that “Avast”, as in the pirate phrase “Avast ye landlubbers!” simply means to stop.
I’m currently taking a class called Functional Programming Principles in Scala, which iss actually taught by the father of Scala, Martin Odersky. It’s a seven-week course where each Monday they open up a new set of video lectures and exercises. The lectures are broken down into 5-6 video segments totaling to about 90 minutes, with pause points where they want you to work something out before it’s shown to you, plus quizzes. I believe this is fairly standard at Coursera. I’m not sure if it’s unique to this particular class, but they have a very nice slide system during the lecture where you see the text, but it’s nicely superimposed over a whiteboard where the instructor can also draw.
The weekly programming assignments are well-done and illustrate many principles at once. You’re not only covering the topic of the week, but you’re learning things like how to use sbt (Simple Build Tool) and scalaTest, which are standard build and unit-testing tools in the Scala world. Submission of your programming assignment is actually done via a single command in sbt, and the program is automatically tested and your score posted within about 5 minutes.
They also have discussion forums, which are frequented by moderators. (You couldn’t expect feedback from Martin himself, considering tens of thousands of people may be taking the course at once.)
I’m not sure if other courses live up to this standard, since this is my first, but it’s a great first impression and I can highly recommend it. I’ve already signed up for Machine Learning, which is by Stanford’s Andrew Ng (!), as well as Data Analysis, Principles of Reactive Programming (Scala and Martin Odersky again), and Discrete Optimization.
Obviously, I have a programming/data focus, but the courses cover a wide variety of topics from social networks and economics to biology and the arts. I’m impressed so far, and recommend that you check it out if you haven’t already.
Rssa
and EMD
.
In this posting, I’m going to document some of my explorations of the two methods, to hopefully paint a more realistic picture of what the packages and the methods can actually do. (At least in the hands of a non-expert such as myself.)
EMD (Empirical Mode Decomposition) is, as the name states, empirical. It makes intuitive sense and it works well, but there isn’t as yet any strong theoretical foundation for it. EMD works by finding intrinsic mode functions (IMFs), which are oscillatory curves that have (almost) the same number of zero crossings as extrema, and where the average maxima and minima cancel to zero. In my mind, they’re oscillating curves that swing back and forth across the X axis, spending an equal amount of time above and below the axis, but not having any fixed frequency or symmetry.
EMD is an iterative process, which pulls out IMFs starting with higher frequencies and leaving behind a low-passed time series for the next iteration, finally ending when the remaining time series cannot contain any more IMFs — this remainder being the trend. Each step of the iteration begins with fitting curves to the maxima and minima of the remaining time series, creating an envelope. The envelope is then averaged, resulting in a proto-IMF which is iteratively refined in a “sifting” process. There are a choice of stopping criteria for the overall iterations and for the sifting iterations. Since the IMF’s are locally adaptive, EMD has no problems with with non-stationary and non-linear time series.
The magic of IMFs is that, being adaptive they tend to be interpretable, unlike non-adaptive bases which you might get from a Fourier or wavelet analysis. At least that’s the claim. The fly in the ointment is mode mixing: when one IMF contains signals of very different scales, or one signal is found in two different IMFs. The best solution to mode mixing is the EEMD (Ensemble EMD), which calculates an ensemble of results by repeatedly adding small but significant white noise to the original signal and then processing each noise-augmented signal via EMD. The results are then averaged (and ideally subjected to one last sifting process, since the average of IMFs is not guaranteed to be an IMF). In my mind, this works because the white noise cancels out in the end, but it tends to drive the signal away from problematic combinations of maxima and minima that may cause mode mixing. (Mode mixing often occurs in the presence of an intermittent component to the signal.)
The R package EMD
implements basic EMD, and the R package hht
implements EEMD, so you’ll want to install both of them. (Note that EMD is part of the Hilbert-Huang method for calculating instantaneous frequencies — a super-FFT if you will — so these packages support more than just EMD/EEMD.)
As the Wikipedia page says, almost every conceivable use of EMD has been patented in the US. EMD itself is patented by NOAA scientists, and thus the US government.
SSA (Singular Spectrum Analysis) is a bit less empirical than EMD, being related to EOF (Empirical Orthogonal Function analysis) and PCA (Principal Component Analysis).
SSA is a subspace-based method which works in four steps. First, the user selects a maximum lag L (1 < L < N, where N is the number of data points), and SSA creates a trajectory matrix with L columns (lags 0 to L-1) and N – L + 1 rows. Second, SSA calculates the SVD of the trajectory matrix. Third, the user uses various diagnostics to determine what eigenvectors are grouped to form bases for projection. And fourth, SSA calculates an elementary reconstructed series for each group of eigenvectors.
The ideal grouping of eigenvectors is in pairs, where each pair has a similar eigenvalue, but differing phase which usually corresponds to sin-cosine-like pairs. The choice of L is important, and involves two considerations: 1) if there is a periodicity to the signal, it’s good to choose an L that is a multiple of the period, and 2) L should be a little less than N/2, to balance the error and the ability to resolve lower frequencies.
The two flies in SSA’s ointment are: 1) issues relating to complex trends, and 2) the inability to differentiate two components that are close in frequency. For the first problem, one proposed solution is to choose a smaller L that is a multiple of any period, and use that to denoise the signal, with a normal SSA operating on the denoised signal. For the second problem, several iterative methods have been proposed, though the R package does not implement them.
The R package Rssa
implements basic SSA. Rssa
is very nice and has quite a few visualization methods, and to be honest I prefer the feel I get from it over the EMD/hht
packages. However, while it allows for manually working around issue 1 from the previous paragraph, it doesn’t address issue 2 which puts more of the burden on the user to find groupings — and even then this often can’t overcome this problem.
SSA seems to have quite a few patents surrounding it as well, though it appears to have deeper historical roots than EMD, so it might be a bit less encumbered overall than EMD.
Having talked about each method, let’s walk through the decomposition of a time series, to see how they compare. Let’s use the gas sales data from the forecast
package:
data (gas, package=”forecast”)
And we’ll use EMD first:
library (EMD)
library (hht)
ee <- EEMD (c(gas), time (gas), 250, 100, 6, "trials")
eec <- EEMDCompile ("trials", 100, 6)
I’m letting several important parameters default, and I’ll discuss some of them in the next section. We’ve run EEMD with explicit parameter choices of: noise amplitude of 250, ensemble size of 100, up to 6 IMFs, and store each run in the directory trials
. (EEMD is a bit awkward in that it stores these runs externally, but with a huge dataset or ensemble it’s probably necessary.) This yields a warning message, I believe because some members of the ensemble have the requested 6 IMFs, but some only have 5, and I assume that it is leaving them out. I have encountered such issues when doing my own EEMD before hht
came out: not all members of each ensemble have the same number of IMFs, as the white noise drives them in more complex or simpler directions.
Let’s do the same thing with SSA:
library (SSA)
gas.ssa <- ssa (gas, L=228)
gas.rec <- reconstruct (gas.ssa, list (T=1:2, U=5, M96=6:7, M12=3:4, M6=9:10, M4=14:15, M2.4=20:21))
We’ve chosen a lag of 228, which is the multiple of 12 (monthly data) just below half of the time series’ length. For the reconstruction, I’ve chosen the pair 1 and 2 (1:2) as the trend, pair 3:4 appears to be the yearly cycle, and so on, naming each one with names that make sense to me: “T” for “Trend”, “U” for “Unknown”, “M6” for what appears to be a 6-month cycle, etc. I’ll come back to some diagnostic plots for SSA that gave me the idea to use these pairs, but first let’s compare results. The trends appear similar:
plot (eec$tt, eec$averaged.residue[,1], type="l")
lines (gas.rec$T, col="red")
though the SSA solution isn’t flat in the pre-1965 era, and shows some high-frequency mixing in the 1990’s. The yearly cycles also appear similar:
plot (eec$tt, eec$averaged.imfs[,2], type="l")
lines (gas.rec$M12, col="red")
with the EEMD solution showing more variability from year to year, which might more realistic or it might simply be an artifact. We could compare other components, though there is not necessarily a one-to-one correspondence because I chose groupings in the SSA reconstruction. One last comparison is a roughly eight-year cycle that both methods found, where again the EEMD result is more irregular:
plot (eec$tt, eec$averaged.imfs[,4], type="l")
lines (gas.rec$M96, col="red")
SSA requires more user analysis to implement, and also seems as if it would benefit more from domain knowledge. If I knew better how to trade off the various diagnostic outputs and knew a bit more about the natural gas trade, I believe I could have obtained better results with SSA. As it stands, I applied both methods to a domain I do not know much about and EEMD seems to have defaulted better. Rssa
is also handicapped in comparison to EEMD via hht
because basic SSA has similar problems to basic EMD, though hopefully Rssa
will implement extensions to the algorithm, such as those suggested in Golyandina & Shlemov, 2013, placing them on a more even footing.
Note from the R code for the graphs that SSA preserves the ts
attributes of the original data, while EMD does not, which is one of several very convenient features.
OK, since I love graphs, let’s do one last comparison of denoising, where we skip my pair choices. The EEMD solution uses 6 components plus the residual (trend), for a total of 7. The rough equivalent for SSA would then be 14 eigenvector pairs, so let’s just pick the first 14 eigenvectors and mix them all together and see what we get:
r <- reconstruct (gas.ssa, list (T=1:14))
plot (gas, lwd=3, col="gray")
lines (r$T, lwd=2)
Which matches well, except for the flat trend around 1965, and looks very smooth. The EEMD solution is (leaving out the first IMF to allow for a little smoothing):
plot (gas, lwd=3, col=”gray”)
lines (c(eec$tt), eec$averaged.residue + apply (eec$averaged.imfs[,2:6], 1, sum), lwd=2)
which is also reasonable, but it’s definitely not as smooth and has some different things happening around 1986. Is this more realistic or quirky? Unfortunately, I can’t tell you. Is this a fair comparison? I believe so, since EEMD was attempting to break the signal down into 7 components, plus noise, and SSA ordered the eigenvectors and I picked the first N. Is it informative? I’m not sure.
Let’s consider the dials and knobs that we get for each method. With SSA, we have the lag, L, the eigenvector groupings, and some choices of methods for things like how the SVD is calculated. With EEMD, we have the maximum number of IMFs we want, a choice of five different stopping rules for sifting, a choice of five different methods for handling curve fitting at the ends of the time series, four choices for curve fitting, the size of the ensemble, and the size of the noise.
So EEMD has many more up-front knobs, though the defaults are good and the only knobs we need to be very concerned with are the boundary handling and the size of the noise. The default boundary for emd
is “periodic”, which is probably not a good choice for any non-stationary time series, but fortunately the default for EEMD
is “wave”, which seems to be quite clever at time series endpoints. The noise needs to be sufficiently large to actually push the ensemble members around without being so large as to swamp the actual data.
On the other hand, SSA has a whole series of diagnostic graphs that need to be interpreted (sort of like Box-Jenkins for ARIMA on steroids) in order to figure out what eigenvectors should be grouped together. For example, a first graph is a scree plot of the singular values:
plot (gas.ssa)
From which we can see the trend at point 1 (and maybe 2), and the obvious 3:4 and 6:7 pairings. We can then look at the eigenvectors, the factor vectors, reconstructed time series for each projection, and phase plots for pairings. (Phase plots that have a regular shape — circle, triangle, square, etc — indicate that the pair is working like sin-cosine pairs with a frequency related to the number of sides. This is preferred.) Here’s an example of the reconstructions from the first 20 eigenvectors:
plot (gas.ssa, “series”, groups=1:20)
You can see clearly that 6:7 are of similar frequency and are phase-shifted, as are 18:19. You can also see that 11 has a mixing of modes where a higher frequency is riding on a lower carrier, and 12 has a higher frequency at the beginning and a lower frequency at the end. An extended algorithm could minimize these kinds of issues.
There are several other SSA diagnostic graphs I could make — the graph at the top of the article is a piece of a paired phase plot — but let’s leave it at that. Rssa
also has functions for attempting to estimate the frequencies of components, for “clusterifying” raw components into groups, and so on. Note also that Rssa
supports lattice
graphics and preserves time series attributes (tsp info
), which makes for a more pleasant experience.
I prefer the Rssa
package. It has more and better graphics, it preserves time series attributes, and it just feels more extensive. It suffers, in comparison to EMD/hht
because it does not (yet) implement extended methods (ala Golyandina & Shlemov, 2013), and because SSA requires more user insight.
Neither method appears to be “magical” in real-world applications. With sufficient domain knowledge, they could each be excellent exploratory tools since they are data-driven rather than having fixed bases. I hope that I’ve brought something new to your attention and that you find it useful!
The small complication is that the NOAA forecasts cover three-month periods rather than single month: JFM (Jan-Feb-Mar), FMA (Feb-Mar-Apr), MAM (Mar-Apr-May), etc. So, in this posting, we’ll briefly describe how to turn a series of these overlapping three-month forecasts into a series of monthly approximations.
The first assumption we’ll make is that each three-month forecast is the average of three monthly forecasts. (My guess would be that NOAA is actually forecasting three-month periods.) If we had some random three-month average temperatures, we wouldn’t be able to figure out the numbers behind the averages, but in this case we have overlapping three-month forecasts, which corresponds to a three-month running average. Since most of the months’ values are thus reflected in three averages, we have enough information to pull the underlying values out in a principled manner.
Let’s work forwards from monthly values to a running average, to see how we would reverse the process. Say we had seven months of values and wanted to create a three-month running average. If we want each resulting value to reflect three months — no averages of only one or two months — we would multiply our monthly (column) vector with this matrix:
0.333 0.333 0.333 0.000 0.000 0.000 0.000 0.000 0.333 0.333 0.333 0.000 0.000 0.000 0.000 0.000 0.333 0.333 0.333 0.000 0.000 0.000 0.000 0.000 0.333 0.333 0.333 0.000 0.000 0.000 0.000 0.000 0.333 0.333 0.333
Which you can see will result in a five-value column vector where each value is the (rolling) average of three months. In order to reverse this process, we simply need to find the inverse of this matrix and multiply the running average vector. Of course, since the matrix is not square, we’ll have to use the pseudo-inverse.
Let’s run an example! Say that I had 13 three-month forecasts: SON (Sep-Oct-Nov), OND, through SON. These forecasts would actually cover a 15-month period from September of this year through November of next year. From NOAA, I retrieved the following forecast for DCA:
SON 2013 59.44 OND 49.92 NDJ 42.97 DJF 40.05 JFM 2014 42.42 FMA 48.90 MAM 56.79 AMJ 65.45 MJJ 72.26 JJA 75.99 JAS 74.56 ASO 68.17 SON 59.44
which in R would be:
> temps3 <- c(59.44, 49.92, 42.97, 40.05, 42.42, 48.90, 56.79, 65.45, 72.26, 75.99, 74.56, 68.17, 59.44)
In order to create and invert our averaging matrix, I’ll use the Matrix
and matrixcalc
libraries:
> library (Matrix) > library (matrixcalc) > avg <- as.matrix (band (matrix (1, nrow=15, ncol=15), -1, 1)[-c(1, 15),]) / 3 > round (avg, 2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [1,] 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 [2,] 0.00 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 [3,] 0.00 0.00 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 [4,] 0.00 0.00 0.00 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 [5,] 0.00 0.00 0.00 0.00 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 [6,] 0.00 0.00 0.00 0.00 0.00 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 [7,] 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 0.00 [8,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 [9,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.33 0.33 0.00 0.00 0.00 0.00 [10,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.33 0.33 0.00 0.00 0.00 [11,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.33 0.33 0.00 0.00 [12,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.33 0.33 0.00 [13,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.33 0.33 > iavg <- svd.inverse (avg)
Now we’re ready to multiply our three-month rolling average by this inverse:
> temps1 <- c(t(temps3) %*% t(iavg)) > temps1 [1] 67.39 59.02 51.91 38.83 38.17 43.15 45.94 57.61 66.82 71.92 78.04 78.01 67.63 58.87 51.82 > mean (temps1[1:3]) [1] 59.44
Yep, looks like we reversed the process. A graph of the NOAA three-month averages (in pink) and the results of this process (in blue) are shown at the top of the posting. They’re pretty similar, of course, though the highs go higher and the lows go lower which is exactly what we’d expect since an average will pull extremes towards the mean.
Hope you find this helpful!