Skip to content

Casting lists to data.frames, analog to SAS

7 messages · Matthew Pettis, David Winsemius, Hadley Wickham +1 more

#
<Resubmitting: Told I had a problem with a character set -- now
submitting in plain text>

I have a specific question and a general question.

Specific Question: I want to do an analysis on a data frame by 2 or
more class variables (i.e., use 2 or more columns in a dataframe to do
statistical classing).  Coming from SAS, I'm used to being able to
take a data set and have the output of the analysis in a dataset for
further manipulation.  I have a data set with vote totals, with one
column being the office name being voted on, and the other being the
party of the candidate.  My votes are in the column "vc.n".  I did the
analysis I want with:

work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)

the str() output of work looks like:
'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
 - attr(*, "dimnames")=List of 2
  ..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT
GOVERNOR" "SECRETARY OF STATE" "STATE AUDITOR" ...
  ..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
 - attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES
= sd62[, c("office.nm",      "party.abbr")], FUN = sum)




work is now a list.  I'd really like to have work be a data frame with
3 columns: The rows of the first two columns show the office and party
levels being considered, and the third being the sum of the votes for
that level combination.  How do I cast this list/output into a data
frame?  using 'as.data.frame' doesn't work.

General Question: I assume the answer to the specific question is
dependent on my understanding list objects and accessing their
attributes.  Can anyone point me to a good, throrough treatment of
these R topics?  Specifically how to read and interpret the output of
the str(), and attributes() function, how to extract the values of the
'by' output object into a data frame, etc.?

Thanks,
Matt
#
On Jan 14, 2009, at 3:56 PM, Matthew Pettis wrote:

            
When I use as.data.frame.table on a similarly constructed object I get:

 > as.data.frame.table(with(warpbreaks, by(breaks, list(wool,  
tension), sum)))
   Var1 Var2 Freq
1    A    L  401
2    B    L  254
3    A    M  216
4    B    M  259
5    A    H  221
6    B    H  169
#
On Jan 14, 2009, at 4:14 PM, David Winsemius wrote:

            
A more self-documenting call might be:

 > as.data.frame.table(with(warpbreaks, by(breaks, list(Wool=wool,  
Tension=tension), sum)), responseName="Sum" )
   Wool Tension Sum
1    A       L 401
2    B       L 254
3    A       M 216
4    B       M 259
5    A       H 221
6    B       H 169
#
It's not quite what you're asking for, but you might want to have a
look at the plyr package, http://had.co.nz/plyr, as an alternative.
Plyr provides a consistent set of methods for manipulating data frames
and lists to solve the same general problem that you struggled with in
your email - breaking up a complex structure, operating on each piece,
and then joining all the results back together.

Hadley
#
on 01/14/2009 02:51 PM Matthew Pettis wrote:
Matt,

Welcome to R.

The help pages for each function, while they can be intentionally terse,
are a good first place to look. Many will include links/references to
related sources.

"An Introduction to R" is a good general place to start. A more thorough
treatment is in the "R Language Definition" manual. There are also a
plethora of contributed documents:

  http://cran.r-project.org/other-docs.html

and books on R and using R within specific domains:

  http://www.r-project.org/doc/bib/R-books.html


There are (at least) three ways to generate summary statistics based
upon multi-level groupings. These include by(), tapply() and aggregate().

The key difference between the three is the class/structure of the
results object and the print (output) method. In the specific case of
aggregate(), it must also return a scalar. Thus for example, unlike with
by() and tapply(), you cannot use summary(), which returns multiple values.

Thus the choice for which approach to take, to an extent, is founded on
what you may subsequently do with the data.

As an example, using the same set of data (warpbreaks):
'data.frame':	54 obs. of  3 variables:
 $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
 $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
 $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...


# Use by()
list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum)
wool: A
tension: L
[1] 401
------------------------------------------------------
wool: B
tension: L
[1] 254
------------------------------------------------------
wool: A
tension: M
[1] 216
------------------------------------------------------
wool: B
tension: M
[1] 259
------------------------------------------------------
wool: A
tension: H
[1] 221
------------------------------------------------------
wool: B
tension: H
[1] 169



Note, because the result of using by() is at its core, a matrix/table,
you can also do the following, explicitly using the print method for a
table:
list(wool = warpbreaks$wool,
                   tension = warpbreaks$tension), sum))
    tension
wool   L   M   H
   A 401 216 221
   B 254 259 169


which gives you printed output in the same format as tapply() below,
without altering the structure of the result itself.


# tapply() directly gives you a tabular output
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
         sum)
    tension
wool   L   M   H
   A 401 216 221
   B 254 259 169



Note that the structure of the result from by() and the result from
tapply() are quite similar:
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
      sum))
 by [1:2, 1:3] 401 254 216 259 221 169
 - attr(*, "dimnames")=List of 2
  ..$ wool   : chr [1:2] "A" "B"
  ..$ tension: chr [1:3] "L" "M" "H"
 - attr(*, "call")= language by.default(data = warpbreaks[, 1], INDICES
= list(wool = warpbreaks$wool,      tension = warpbreaks$tension), FUN =
sum)
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
      sum))
 num [1:2, 1:3] 401 254 216 259 221 169
 - attr(*, "dimnames")=List of 2
  ..$ wool   : chr [1:2] "A" "B"
  ..$ tension: chr [1:3] "L" "M" "H"


Both are at their core, a 2 x 3 matrix.

The key difference is in the 'class' of the result, which affects
subsequent operations, such as the print method used.



# aggregate() gives you a data frame, with the summary statistic as the
# 'x' column
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
           sum)
  wool tension   x
1    A       L 401
2    B       L 254
3    A       M 216
4    B       M 259
5    A       H 221
6    B       H 169
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
      sum))
'data.frame':	6 obs. of  3 variables:
 $ wool   : Factor w/ 2 levels "A","B": 1 2 1 2 1 2
 $ tension: Factor w/ 3 levels "L","M","H": 1 1 2 2 3 3
 $ x      : num  401 254 216 259 221 169


Thus, bottom line, given your intended application, I would suggest
using aggregate() rather than by().

HTH,

Marc Schwartz
#
Thank you very much -- this was very helpful for differentiating among
the aggregating methods!

Matt

On Wed, Jan 14, 2009 at 3:42 PM, Marc Schwartz
<marc_schwartz at comcast.net> wrote: