In Part 1 of this Stata for R users series, I mentioned many of the strengths of Stata that might be attractive to R users. I forgot several important strengths, which I’ll add and expand in Part 3. The more I work with it, the more I’m impressed with Stata.
In this Part 2, as promised, I’ve listed several thoughts and tips for R users who are new to Stata.
Notes for R Users
Most commands operate on the Dataset, which is like an R
data.frame, with numbered rows called “observations” and named columns called “variables”. You can view and edit this dataset in the spreasheet-like Data Editor or Data Browser views. You can import and export data in a wide variety of formats, though the native format is Stata’s
.dtafile format, which stores a lot more information than what you see in the Data Browser.
This is perhaps the hardest thing to get your head around as an R user. You do most of your statistical analysis in this central dataset, adding variables, dropping variables, ordering variables, renaming variables, sorting observations, etc. You can save and load different datasets during a single session — for example training data then test data — and you may end up butchering your dataset to accomplish one goal, then reverting back once you’re done.
As you’ll find out in a point, below, there are really a lot of other variables hiding below the surface, but when you’re not doing serious programming, you’ll live in the dataset and there are a whole lot of techniques for slicing and dicing the data set to accomplish your goals. You’ll probably want to master the
reshape command, which is used in many more ways than it’s R counterpart.
There’s an elaborate wild-card system for using and modifying variable names, which I won’t describe here. Just remember that if you decide you like to use the dash (e.g.
regress y x1-x3) in variable lists, you’ll need to make sure that the variables are
order‘d as you think they should be.
Another thing to note is that almost all commands take an
if or an
in option, which allows you to select observations based on conditional tests of variables or based on row numbers, respectively. So
in depends totally on how you’ve
sort‘d the dataset. The
in range is specified like
1/5 meaning the first five (1..5), with negative numbers meaning “from the end”, so
-5/1 would mean the last five.
Once you have your input dataset in shape, you add more variables with options from various commands (usually something like
generate(newVarName) option), by combining other variables in straightforward ways with
gen or by using
gen‘s extended version
egen to do more complicated generation.
Help is your friend, and Stata loves the internet
The Stata equivalent of R’s
help and its equivalent of
search. In Stata 13, if
help finds a match, it displays that single entry, but if it does not, it automatically does a
search command will search the local Stata help and also internet help, which includes Stata Journal articles and the Stata equivalent of CRAN, SSC.
This is incredibly powerful. For example, if you see a web page that mentions the command
ivreg2 and you type
help ivreg2 it will find the
ivreg2 command on SSC and you can click to read the help page and then click to install the
ivreg2 ado file (the Stata file that implements the command), making it immediately available for use. CRAN on steroids. (You can also type
ssc hot to see a list of the top 10 most downloaded ado files from SSC.)
If you want to find out more about the
regress command, type
help regress. In the help page, there are hyperlinks, and at the top of the window you’ll see “Dialog”, “See Also”, and “Jump To” menus which are very helpful. Dialog will open the GUI dialog for regress. The first option in See Also will take you to the
regress‘s entry in the PDF version of the help files, which has more details, and you can also go to “regress postestimation” which tells you about the commands you can use after doing a regression: doing a heteroskedasticity test, calculating AIC/BIC, doing diagnostic plots, etc.
help often, and use it with multiple words where appropriate, such as
help logistic regression.
Stata doesn’t have a variable type
factor, but it does have the equivalent concept. Say you have a numeric variable,
size, which has the values 1, 2, or 3, and you want to use it as a factor. You could use it in an interaction with the continuous variable
regress cost size#c.weight
# is equivalent to R’s
# automatically prefixes both of its operands with
i., which treats the operand as a set of virtual indicator variables — like a factor — so I explicitly added
c. in front of
weight to specify that weight is to be treated as continuous. (More on these formulas, below.)
To simplify your life, you can create a value label, and apply it to
label define size 1 “Small” 2 “Medium” 3 “Large”
label values size size
These commands are slightly confusing because value labels have their own name space, so
size as a value label is distinct from
size as a variable and you can thus have both with the same name with no conflict. (More on this when I talk about “hidden” variables, below.) I didn’t need to name them the same, and don’t be confused that naming them the same created any kind of association between them: the
label value command created the association. The reason that label values are separately defined is that they can be used across multiple variables:
label define sml 1 “Small” 2 “Medium” 3 “Large”
label values size weight sml
where the variables
weight both have 1, 2, and 3 that stand for small, medium, and large, respectively. The value label is displayed instead of the number in most output where there’s room. If you want so use a value label, say in an
if, it doesn’t automatically convert from a string as R does, you have to specify the value label name you’re using:
regress cost size#c.height if weight==”Medium”:sml
regress cost size#c.height if weight==2
are both equivelant in this example.
I’ve emphasized here the first half of the story: how you have numeric variables display with names. The other half of the story — how you treat these variables as categorical, which turns them into a virtual set of indicator variables — is discussed below, under Regression Formulas.
Subtle distinctions that matter
In R, you can type any code at the prompt and R will execute it and display the result as if you’d wrapped it all in a
Say you fire up Stata and do
help date to get help on the
date function. Then you decide to try it, so you type
which results in an error that
date is not a recognized command name. This is because it’s a function, not a command, and in order to try this out you need to use the
display date(“2011-01-02”, “YMD”)
If you’re used to putting a space between a function name and its opening parenthesis — it’s not a popular thing to do, but I do it — you would also get an odd result because
is the function
date with arguments, while
is the variable
date followed by a grouping of some sort. If the variable
date exists, you’ll see its value followed by what’s in parentheses, and if
date doesn’t exist, you’ll get an error. (Stata uses parentheses to delimit multiple equations in regressions that accept multiple equations, and in other areas.)
All that other data
For basic work where you input some data, invoke an estimation command, do some post-estimation checks, and graph a few things, the dataset’s about all you need. You might suspect that there must be other variables behind the scenes, to keep track of things like the coefficients of your regression, and you’d be correct. In fact, there are a whole slew of behind-the-scenes variables that you won’t see in the Data Browser but that you can access: labels, constraints, s-class variables, e-class variables, r-class variables, c-class variables, macros, scalars, Stata matrices, and Mata variables, to name a few.
You can see them by typing, for example,
scalar dir, or
ereturn list. The first of these that you’ll probably use are value labels, as described above.
The second hidden variables that you might indirectly use are the e-class (estimation result class) variables that are returned from an estimation command, such as
regress. Each estimation command overwrites these variables, so if you want to compare the results of two estimations, you need to save them:
regress cost size#c.height
estimates store cost1
regress cost size##c.height
estimates store cost2
estimates store commands save the estimates in memory, and you could also use
estimates save to save them to disk. The
## operator is the equivalent of R’s
* formula operator. You could use
estimates dir to see a list of estimates you’ve stored, along with the command that created each, the dependent variable, and the number of parameters.
help matrix, etc, for more details about this world of “hidden” variables. While the return variables are not saved in a .dta, the others (labels, matrices, etc) are saved.
The other big class of “hidden” things are macros. Macros are a lot like shell script variables: they contain strings and when used they substitute their contents where they’re referenced. For example:
local d = c(current_date)
display “This was typed on `d'”
which accesses a c-class (essentially a system class) variable
current_date, results in:
This was typed on 3 Aug 2013
A macro could obviously be used in a more substantive way in a command, or could even include commands. You can find out more with
As I’ve mentioned, the Stata equivalent of
: in a regression formula is
#, and the Stata equivalent of
##, so if you had an R regression where
size is a factor:
lm (cost ~ size:height)
in Stata this would be:
regress cost size#c.height
I briefly mentioned that the
i. prefix treats the variable as a virtual set of indicator variables, on for each value of the variable. By default, the
## operators treat both of their arguments as if they were prefixed by
i., so this regression is actually:
regress cost i.size#c.height
c. prefix prevents
height from being treated as categorical. In fact, its actually equivalent to
regress cost ib(first).size#c.height
which says that the base level of the virtual indicator variables is the smallest value of
In addition to the
c. prefixes, there is also an
o. prefix which is used to omit a variable or indicator. You can explicitly specify the base level of a factor with
i2b.weight to indicate
weight as virtual indicator variables with 2 as the base level. It gets pretty elaborate, with options like
ib(last). choosing the last value as the base, and
ib(freq). choosing the most frequent value as the base.
With time series, you can specify also use prefixes to specify differences (
D.), leads (
F.), lags (
L.), and seasonal differences (
S.). You can use ranges with any prefixes, including time series, so you could say
L(0/2).height to specify lags 0 (i.e. no lag), 1, and 2 of
help fvvarlist, and
Stata’s . is not your daddy’s NA
At first glance, Stata’s
. is the same as R’s
NA: a missing value. Don’t be fooled: Stata’s undefined value offers more than
NA, but also requires more. It’s true that both undefined values propagate, so that a calculation with
. results in a
. result. But you may also find out that
. actually has a (special) value if you try something like:
list cost weight height if weight > 100
It turns out that
. is actually something like +INF, so the
if includes undefined values of
weight. In order to do what you think it’s doing, you should do:
list cost weight height if weight > 100 & weight < .
. is the default missing value, Stata offers
.z, so you can distinguish various types of missing, without having to add auxiliary variables. Stata has several commands dedicated to missing values, including
Hope that helps
I hope these thoughts are helpful. Stata’s an amazing program. I’m not abandoning R any time soon, but I have found that for some tasks I have actually come to prefer Stata over R. R has the power and flexibility of being a full-blown programming language and the breadth of user base that makes it extraordinarily useful and the statistical go-to tool, but Stata has a certain elegance and focus as well.