Stata for R users pt 2

In Part 1 of this Stata for R users series, I mentioned many of the strengths of Stata that might be attractive to R users. I forgot several important strengths, which I’ll add and expand in Part 3. The more I work with it, the more I’m impressed with Stata.

In this Part 2, as promised, I’ve listed several thoughts and tips for R users who are new to Stata.

Stata 13 logo

Notes for R Users

The Dataset


Most commands operate on the Dataset, which is like an R data.frame, with numbered rows called “observations” and named columns called “variables”. You can view and edit this dataset in the spreasheet-like Data Editor or Data Browser views. You can import and export data in a wide variety of formats, though the native format is Stata’s .dta file format, which stores a lot more information than what you see in the Data Browser.

This is perhaps the hardest thing to get your head around as an R user. You do most of your statistical analysis in this central dataset, adding variables, dropping variables, ordering variables, renaming variables, sorting observations, etc. You can save and load different datasets during a single session — for example training data then test data — and you may end up butchering your dataset to accomplish one goal, then reverting back once you’re done.

As you’ll find out in a point, below, there are really a lot of other variables hiding below the surface, but when you’re not doing serious programming, you’ll live in the dataset and there are a whole lot of techniques for slicing and dicing the data set to accomplish your goals. You’ll probably want to master the reshape command, which is used in many more ways than it’s R counterpart.

There’s an elaborate wild-card system for using and modifying variable names, which I won’t describe here. Just remember that if you decide you like to use the dash (e.g. regress y x1-x3) in variable lists, you’ll need to make sure that the variables are order‘d as you think they should be.

Another thing to note is that almost all commands take an if or an in option, which allows you to select observations based on conditional tests of variables or based on row numbers, respectively. So in depends totally on how you’ve sort‘d the dataset. The in range is specified like 1/5 meaning the first five (1..5), with negative numbers meaning “from the end”, so -5/1 would mean the last five.

Once you have your input dataset in shape, you add more variables with options from various commands (usually something like generate(newVarName) option), by combining other variables in straightforward ways with gen or by using gen‘s extended version egen to do more complicated generation.

Help is your friend, and Stata loves the internet

The Stata equivalent of R’s ? (help) is help and its equivalent of ?? (help.search) is search. In Stata 13, if help finds a match, it displays that single entry, but if it does not, it automatically does a search. The search command will search the local Stata help and also internet help, which includes Stata Journal articles and the Stata equivalent of CRAN, SSC.

This is incredibly powerful. For example, if you see a web page that mentions the command ivreg2 and you type help ivreg2 it will find the ivreg2 command on SSC and you can click to read the help page and then click to install the ivreg2 ado file (the Stata file that implements the command), making it immediately available for use. CRAN on steroids. (You can also type ssc hot to see a list of the top 10 most downloaded ado files from SSC.)

If you want to find out more about the regress command, type help regress. In the help page, there are hyperlinks, and at the top of the window you’ll see “Dialog”, “See Also”, and “Jump To” menus which are very helpful. Dialog will open the GUI dialog for regress. The first option in See Also will take you to the regress‘s entry in the PDF version of the help files, which has more details, and you can also go to “regress postestimation” which tells you about the commands you can use after doing a regression: doing a heteroskedasticity test, calculating AIC/BIC, doing diagnostic plots, etc.

Use help often, and use it with multiple words where appropriate, such as help logistic regression.

Factors

Stata doesn’t have a variable type factor, but it does have the equivalent concept. Say you have a numeric variable, size, which has the values 1, 2, or 3, and you want to use it as a factor. You could use it in an interaction with the continuous variable weight:

regress cost size#c.weight

Where the # is equivalent to R’s :. The # automatically prefixes both of its operands with i., which treats the operand as a set of virtual indicator variables — like a factor — so I explicitly added c. in front of weight to specify that weight is to be treated as continuous. (More on these formulas, below.)

To simplify your life, you can create a value label, and apply it to size:

label define size 1 “Small” 2 “Medium” 3 “Large”
label values size size

These commands are slightly confusing because value labels have their own name space, so size as a value label is distinct from size as a variable and you can thus have both with the same name with no conflict. (More on this when I talk about “hidden” variables, below.) I didn’t need to name them the same, and don’t be confused that naming them the same created any kind of association between them: the label value command created the association. The reason that label values are separately defined is that they can be used across multiple variables:

label define sml 1 “Small” 2 “Medium” 3 “Large”
label values size weight sml

where the variables size and weight both have 1, 2, and 3 that stand for small, medium, and large, respectively. The value label is displayed instead of the number in most output where there’s room. If you want so use a value label, say in an if, it doesn’t automatically convert from a string as R does, you have to specify the value label name you’re using:

regress cost size#c.height if weight==”Medium”:sml
regress cost size#c.height if weight==2

are both equivelant in this example.

I’ve emphasized here the first half of the story: how you have numeric variables display with names. The other half of the story — how you treat these variables as categorical, which turns them into a virtual set of indicator variables — is discussed below, under Regression Formulas.

Subtle distinctions that matter

In R, you can type any code at the prompt and R will execute it and display the result as if you’d wrapped it all in a print. Everything’s a variable and variables are treated mostly the same, and on the whole white space doesn’t matter. Stata is different: there are commands that can be typed at the prompt, and there are functions that do not stand alone, and parentheses have multiple uses so white space in front of a parenthesis matters.

Say you fire up Stata and do help date to get help on the date function. Then you decide to try it, so you type

date(“2011-01-02”, “YMD”)

which results in an error that date is not a recognized command name. This is because it’s a function, not a command, and in order to try this out you need to use the display command

display date(“2011-01-02”, “YMD”)

which works.

If you’re used to putting a space between a function name and its opening parenthesis — it’s not a popular thing to do, but I do it — you would also get an odd result because

date(…

is the function date with arguments, while

date (…

is the variable date followed by a grouping of some sort. If the variable date exists, you’ll see its value followed by what’s in parentheses, and if date doesn’t exist, you’ll get an error. (Stata uses parentheses to delimit multiple equations in regressions that accept multiple equations, and in other areas.)

All that other data

For basic work where you input some data, invoke an estimation command, do some post-estimation checks, and graph a few things, the dataset’s about all you need. You might suspect that there must be other variables behind the scenes, to keep track of things like the coefficients of your regression, and you’d be correct. In fact, there are a whole slew of behind-the-scenes variables that you won’t see in the Data Browser but that you can access: labels, constraints, s-class variables, e-class variables, r-class variables, c-class variables, macros, scalars, Stata matrices, and Mata variables, to name a few.

You can see them by typing, for example, label dir, scalar dir, or ereturn list. The first of these that you’ll probably use are value labels, as described above.

The second hidden variables that you might indirectly use are the e-class (estimation result class) variables that are returned from an estimation command, such as regress. Each estimation command overwrites these variables, so if you want to compare the results of two estimations, you need to save them:

regress cost size#c.height
estimates store cost1
regress cost size##c.height
estimates store cost2

The estimates store commands save the estimates in memory, and you could also use estimates save to save them to disk. The ## operator is the equivalent of R’s * formula operator. You could use estimates dir to see a list of estimates you’ve stored, along with the command that created each, the dependent variable, and the number of parameters.

See help return, help label, help matrix, etc, for more details about this world of “hidden” variables. While the return variables are not saved in a .dta, the others (labels, matrices, etc) are saved.

The other big class of “hidden” things are macros. Macros are a lot like shell script variables: they contain strings and when used they substitute their contents where they’re referenced. For example:

local d = c(current_date)
display “This was typed on `d'”

which accesses a c-class (essentially a system class) variable current_date, results in:

This was typed on  3 Aug 2013

A macro could obviously be used in a more substantive way in a command, or could even include commands. You can find out more with help macros.

Regression formulas

As I’ve mentioned, the Stata equivalent of : in a regression formula is #, and the Stata equivalent of * is ##, so if you had an R regression where size is a factor:

lm (cost ~ size:height)

in Stata this would be:

regress cost size#c.height

I briefly mentioned that the i. prefix treats the variable as a virtual set of indicator variables, on for each value of the variable. By default, the # and ## operators treat both of their arguments as if they were prefixed by i., so this regression is actually:

regress cost i.size#c.height

where the c. prefix prevents height from being treated as categorical. In fact, its actually equivalent to

regress cost ib(first).size#c.height

which says that the base level of the virtual indicator variables is the smallest value of size.

In addition to the i. and c. prefixes, there is also an o. prefix which is used to omit a variable or indicator. You can explicitly specify the base level of a factor with i2b.weight to indicate weight as virtual indicator variables with 2 as the base level. It gets pretty elaborate, with options like ib(last). choosing the last value as the base, and ib(freq). choosing the most frequent value as the base.

With time series, you can specify also use prefixes to specify differences (D.), leads (F.), lags (L.), and seasonal differences (S.). You can use ranges with any prefixes, including time series, so you could say L(0/2).height to specify lags 0 (i.e. no lag), 1, and 2 of height.

See help varlist, help fvvarlist, and help numlist.

Stata’s . is not your daddy’s NA

At first glance, Stata’s . is the same as R’s NA: a missing value. Don’t be fooled: Stata’s undefined value offers more than NA, but also requires more. It’s true that both undefined values propagate, so that a calculation with . results in a . result. But you may also find out that . actually has a (special) value if you try something like:

list cost weight height if weight > 100

It turns out that . is actually something like +INF, so the if includes undefined values of weight. In order to do what you think it’s doing, you should do:

list cost weight height if weight > 100 & weight < .

While . is the default missing value, Stata offers .a through .z, so you can distinguish various types of missing, without having to add auxiliary variables. Stata has several commands dedicated to missing values, including mvencode, mvdecode, and misstable.

Hope that helps

I hope these thoughts are helpful. Stata’s an amazing program. I’m not abandoning R any time soon, but I have found that for some tasks I have actually come to prefer Stata over R. R has the power and flexibility of being a full-blown programming language and the breadth of user base that makes it extraordinarily useful and the statistical go-to tool, but Stata has a certain elegance and focus as well.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s