Skip to content

Equivalent to Stata egen

8 messages · Peter Kraglund Jacobsen, David Winsemius, eric +2 more

#
What are the R equivalents to the Stata command egen?

egen temp = anycount(t0vas t30vas t60vas t120vas t240vas t360vas),
values(0,1,2,3,4,5,6,7,8,9,10)
egen temp2 = rowtotal(t0vas t30vas t60vas t120vas t240vas t360vas)
#
Peter Kraglund Jacobsen <peter <at> kraglundjacobsen.dk> writes:
And people call R documentation cryptic! As far as I can tell the corresponding
function would be ave, but that is only a guess since there really is not much
help regarding egen's purpose from the voluminous Stat documentation.
#
http://www.stata.com/help.cgi?egen -- it creates new variables dealing
with some special relatively non-standard tasks that don't boil down
to a one-line arithmetic expressions. For that reason, there will be
no equivalent to -egen- in general, as it has so many functions that
are so different. -rowtotal- is of course just a shorthand for sum(),
except for treatment of missing values ( ifelse(is.na(x),0,x ). But
-anycount- is a moderately complicated double cycle over variables and
list of values (40 lines of underlying Stata code, including parsing
and labeling the resulting variables)... which will probably become a
triple R cycle including the cycle over observations, although the
latter can probably be avoided.

Yes, R documentation looks exteremely terse to me as a regular Stata
user. I am used to seeing the concpets explained well, even in the
help files, and certainly more so in the shelved books. As every
option and every part of the syntax is devoted at least three to five
sentences, and the most common uses are exemplified, I can usually
figure out how to run a particular task relatively quickly. (The data
management tricks, which is what Peter was asking about above, are
probably an exception: you either know them, or you don't. In this
example, I don't know the corresponding R tricks, although I can
probably brute force the solution if I needed to.) The fraction of
commands in R that I personally have been coming across that are
comparably well documented is about a quarter. For other, it is either
a guesswork+CRANning+googling around or "Forget it, I'll just go back
to Stata to do it" after a few futile attempts. May be I just don't
know where to look for the good stuff, but it is certainly outside R
as a package+its documentation.
On 4/15/09, David Winsemius <dwinsemius at comcast.net> wrote:

  
    
#
Terse is OK by me as long as I get told what goes in (allowable data  
types, argument names and effects) and what comes out. What seemed to  
be lacking in that Stata doc for egen was a description of the purpose  
or behavior and then could find no description of the values produced.  
Perhaps it is because Stata has an approach that everything is a  
rectangular array? Is everything assumed to create a new column of  
data as in SAS?

At any rate it looked to this casual non-user, reading that document,  
that egen creates a new variable aligned with its argument variables  
by applying various functions within groupings. That is pretty much  
what ave does. "ave" is not restricted to mean as a functional  
argument. As I said it was a guess.

The texts I used to get up to speed in R are several downloaded from  
the Contributed documents (including anything written by Venables),  
V&R MASS v 2, Harrell's RMS, Sarkar's Lattice, Chambers&Hastie SMiS  
and reading a lot of Q&A on this list.
#
Now that we know what egen is, the answers are one-liners in R:

# Make up some data
vasdat <- matrix ( sample ( 1:100, 3000, replace = TRUE ), ncol = 3 )

# Use apply for each ( MARGIN = 1 means rows, 2 means columns )
anycountresult <- apply ( vasdat, MARGIN = 1, FUN = function ( x ) sum ( x %in% 1:10 ) )
rowtotalresult <- apply ( vasdat, MARGIN = 1, FUN = sum )

# Combine results with original data 
egentyperesults <- cbind ( vasdat, anycountresult , rowtotalresult)

# Display first ten rows of the data
head ( egentyperesults  , 10 )





----- Original message -----
From: "David Winsemius" <dwinsemius at comcast.net>
To: "Stas Kolenikov" <skolenik at gmail.com>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Date: Thu, 16 Apr 2009 13:39:44 -0400
Subject: Re: [R] Equivalent to Stata egen

Terse is OK by me as long as I get told what goes in (allowable data  
types, argument names and effects) and what comes out. What seemed to  
be lacking in that Stata doc for egen was a description of the purpose  
or behavior and then could find no description of the values produced.  
Perhaps it is because Stata has an approach that everything is a  
rectangular array? Is everything assumed to create a new column of  
data as in SAS?

At any rate it looked to this casual non-user, reading that document,  
that egen creates a new variable aligned with its argument variables  
by applying various functions within groupings. That is pretty much  
what ave does. "ave" is not restricted to mean as a functional  
argument. As I said it was a guess.

The texts I used to get up to speed in R are several downloaded from  
the Contributed documents (including anything written by Venables),  
V&R MASS v 2, Harrell's RMS, Sarkar's Lattice, Chambers&Hastie SMiS  
and reading a lot of Q&A on this list.
#
See, we just jave different expectations of what is to be seen in the
help system, and are used to different formats. Yes, Stata thinks of
data as a rectangular array (although it stores it in memory, unlike
SAS). The inputs to -egen-, as well as the values produced, depend on
the particular function -fcn- and are described in subsections on
those individual functions. That is mentioned at the top of the page.
There is a pretty much standard syntax of most Stata commands (command
name followed by variables it is applied to or expression to be
computed followed by if conditions on observations followed by comma
options ), and -egen- more or less satisfies that syntax. A Stata user
equipped with the basic concepts of the assignment command -generate-
(which -egen- is said to extend) and variable lists (-varlist- here
and there in the help file) would be able to make sense of this all.

I would rather translate R's ave() to Stata's -by- expression. Not all
of the -egen- functionality can be implemented via ave().

Looks like terseness is a prerequisite to doing anything in R though.
If I am telling you I am a newbie, the book abbreviations although
standard to everybody on this list may not mean much to me. I could
figure out "Regression Modeling Strategies" (although I was not
thinking about it as a book on R -- I probably did not read it far
enough :) ), and V&R is Venables & Ripley. Right?
On 4/16/09, David Winsemius <dwinsemius at comcast.net> wrote:

  
    
#
On Apr 16, 2009, at 3:58 PM, Stas Kolenikov wrote:

            
R has a by function which is a convenience wrapper for tapply. It will  
not necessarily produce an object with the same number of rows as the  
input, which is what I thought that egen was doing.
Yes, and Chambers and Hastie wrote "Statistical Models in S".

The VR bundle is the way to get the MASS package (and IIRC three  
others).

The documentation and contributed pages are here:
http://cran.r-project.org/manuals.html
http://cran.r-project.org/other-docs.html

Harrell probably does not think of RMS as an R book either.
#
It is sure thing that different person has different expectation of
the help system. Personally, I think Stata's on-line help system is
too brief, though the manual may be a different story. Perhaps, it is
all about the habit and the extent to which you are used to (and how
much you know about it).

2009/4/17 Stas Kolenikov <skolenik at gmail.com>: