I did a quick Google search on “Stata for R users” (both as separate words and as a quoted phrase) and there really isn’t much out there. At best, there are a couple of equivalence guides that show you how to do certain tasks in both programs. (Plus a whole lot of “R for (ex-) Stata users” articles.) I’m writing this post, as a long-term R user who recently bought Stata, because I believe that Stata is a good complement to R, and many R users should consider adding it to their toolbox.
I’m going to write this in two parts. Part one will describe why an R user might be interested in Stata — with various Stata examples. Part Two will give specific tips and warnings to R users who do decide to use Stata.
Why use Stata?
Stata has several areas where it is arguably more capable or more integrated than R. For example, Stata is quite strong with panel data, and survey data. Stata has a large array of regression techniques, ranging from OLS to 2SLS, 3SLS, and SUR. Stata supports many time series techniques, and has some unique date features. It also has a well-integrated multiple-imputation (MI) feature, flexible graphics, and a solid Structural Equation Models (SEM) capability that includes a visual editor. Let’s look at seven Stata strengths:
1. Survey dataYou use the
svysetcommand to to describe the survey’s design, and then you can use many of Stata’s standard commands to process the survey data, accounting for the design. This includes the main descriptive statistics, eight types of linear regression (including non-linear and constrained regression), SEMs (Structural Equation Models), survival-data regression, binary-response regression, discrete-response regresion, poisson regression, instrumental-variables regression, and a few other options.
For example, to (OLS) regress
x2, the interaction of
x3, you’d type:
regress y x1##x2 x3
svyset command, you could then do the same regression for your survey data with:
svy: regress y x1##x2 x3
Which is very nice. Note that the Stata equivalent of R’s
* in a formula is
##, and that depending on whether
x2 are factors or continuous, the syntax would be slightly different from what I have here. Similarly Stata’s equivalent of R’s
2. Multiple ImputationSimilar to survey data, you do the imputation step, then use an
mi estimate:prefix, in front of most of the same commands as with survey data, above. SEM’s aren’t an option, but instead you get many of the panel data (
xt) functions (see #4, below), and you can combine
svy:to work with imputed survey data. Imputation methods aren’t as varied as in the various R MI packages, but include several types of regression and chained equations.
3. Time series dataStata features the usual types of models including ARIMA/ARFIMA, ARCH/GARCH, UCM (decomposition), Dynamic Factor, State Space, and VEC/VAR/SVAR models. Also multiple types of filters (BK, BW, CF, HP), multiple types of smoothing (exponential, double exponential, HW, MA, and non-linear), Impulse Response Functions, and Forecast Error Vector Decompositions. But it’s the little touches that are cool. Among other things, lags and diffs are first-class citizens in Stata’s regression formulas. For example, you can say:
regress y L1.x x D1.z
which has the obvious meaning of regressing y on the first lag of
x, and the second difference of
z. When you’re using first lags or differences, you can leave off the
1. You also can add seasonal differences with
S. and leads with
F.. You can also do things like
L(1/3).x, which is equivalent to
L1.x L2.x L3.x.
Further, you can create and apply business calendars, where you indicate what dates are not business days and Stata will then ignores those days when doing date arithmetic, lags, leads, etc. The non-business days can be regular or irregular.
4. Panel Data (aka Longitudinal Data)Unlike the survey data where you use (many of) the regular commands with the
svy:prefix, panel data is implemented as separate commands whose names start with “xt”:
xtgee, etc. You set your panel description with
xtsetand then use the various xt commands.
5. Other major sections of StataListing the subjects of the standalone PDF manuals that I haven’t already touched on: Graphics, Multilevel Mixed Effects, Multivariate Statistics, Power and Sample Size, Survival Analysis, Treatment Effects, Programming (in Stata), and Mata (a LAPACK-backed matrix and programming language).
6. Internet connectivity and helpWhen you type
help regyou’ll get the help page on
regression, which has the option to jump to the appropriate PDF documentation. But you can also type
help timeseries regression, which will get you a list with multiple commands that are built into Stata, along with links to online commands that match. With a single click, you can download and install these commands. The main, CRAN-like repository is SSC, and there’s also an
ssccommand to look at the reporitory, read help files, and choose to download and install. I get the feeling that these options are not always as robust as R packages, but they are much more convenient to find and examine than CRAN. Stata and installed packages are also updated automatically online.
7. MiscellaneousThe Stata community is friendly and has a heavy emphasis on reproducible statistics. You can drive entirely from the command line or entirely from the GUI or combine the two freely. Whenever you use a GUI option, you see the command printed in the results log, so you can learn it, reuse it, modify it, or save it for future use (in a do-file) or for reproducibility.
StataCorp LP is full of friendly and helpful people and is pleasant to interact with. Stata has three price points: IC (standard edition), SE (enhanced), and MP (multiprocessor), and for students you can choose a low price for a one-year license or a higher (but still nice) price for a perpetual license. For the last several years, Stata has been on a two-year release cycle with the next major version coming out in June/July, and Stata 13 just came out in June of 2013.