Casting lists to data.frames, analog to SAS

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090114/78bb73b1/attachment-0001.pl>
<Resubmitting: Told I had a problem with a character set -- now
submitting in plain text>

I have a specific question and a general question.

Specific Question: I want to do an analysis on a data frame by 2 or
more class variables (i.e., use 2 or more columns in a dataframe to do
statistical classing).  Coming from SAS, I'm used to being able to
take a data set and have the output of the analysis in a dataset for
further manipulation.  I have a data set with vote totals, with one
column being the office name being voted on, and the other being the
party of the candidate.  My votes are in the column "vc.n".  I did the
analysis I want with:

work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)

the str() output of work looks like:
str(work)
'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
 - attr(*, "dimnames")=List of 2
  ..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT
GOVERNOR" "SECRETARY OF STATE" "STATE AUDITOR" ...
  ..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
 - attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES
= sd62[, c("office.nm",      "party.abbr")], FUN = sum)

work is now a list.  I'd really like to have work be a data frame with
3 columns: The rows of the first two columns show the office and party
levels being considered, and the third being the sum of the votes for
that level combination.  How do I cast this list/output into a data
frame?  using 'as.data.frame' doesn't work.

General Question: I assume the answer to the specific question is
dependent on my understanding list objects and accessing their
attributes.  Can anyone point me to a good, throrough treatment of
these R topics?  Specifically how to read and interpret the output of
the str(), and attributes() function, how to extract the values of the
'by' output object into a data frame, etc.?

Thanks,
Matt

<Resubmitting: Told I had a problem with a character set -- now
submitting in plain text>

I have a specific question and a general question.

Specific Question: I want to do an analysis on a data frame by 2 or
more class variables (i.e., use 2 or more columns in a dataframe to do
statistical classing).  Coming from SAS, I'm used to being able to
take a data set and have the output of the analysis in a dataset for
further manipulation.  I have a data set with vote totals, with one
column being the office name being voted on, and the other being the
party of the candidate.  My votes are in the column "vc.n".  I did the
analysis I want with:

work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)

the str() output of work looks like:

str(work)
'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
- attr(*, "dimnames")=List of 2
 ..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT
GOVERNOR" "SECRETARY OF STATE" "STATE AUDITOR" ...
 ..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
- attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES
= sd62[, c("office.nm",      "party.abbr")], FUN = sum)
When I use as.data.frame.table on a similarly constructed object I get:

 > as.data.frame.table(with(warpbreaks, by(breaks, list(wool,  
tension), sum)))
   Var1 Var2 Freq
1    A    L  401
2    B    L  254
3    A    M  216
4    B    M  259
5    A    H  221
6    B    H  169

David Winsemius
>
>
>
>
> work is now a list.  I'd really like to have work be a data frame with
> 3 columns: The rows of the first two columns show the office and party
> levels being considered, and the third being the sum of the votes for
> that level combination.  How do I cast this list/output into a data
> frame?  using 'as.data.frame' doesn't work.
>
> General Question: I assume the answer to the specific question is
> dependent on my understanding list objects and accessing their
> attributes.  Can anyone point me to a good, throrough treatment of
> these R topics?  Specifically how to read and interpret the output of
> the str(), and attributes() function, how to extract the values of the
> 'by' output object into a data frame, etc.?
>
> Thanks,
> Matt
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

On Jan 14, 2009, at 3:56 PM, Matthew Pettis wrote:

<Resubmitting: Told I had a problem with a character set -- now
submitting in plain text>

I have a specific question and a general question.

Specific Question: I want to do an analysis on a data frame by 2 or
more class variables (i.e., use 2 or more columns in a dataframe to  
do
statistical classing).  Coming from SAS, I'm used to being able to
take a data set and have the output of the analysis in a dataset for
further manipulation.  I have a data set with vote totals, with one
column being the office name being voted on, and the other being the
party of the candidate.  My votes are in the column "vc.n".  I did  
the
analysis I want with:

work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)

the str() output of work looks like:

str(work)
'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
- attr(*, "dimnames")=List of 2
..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT
GOVERNOR" "SECRETARY OF STATE" "STATE AUDITOR" ...
..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
- attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES
= sd62[, c("office.nm",      "party.abbr")], FUN = sum)
When I use as.data.frame.table on a similarly constructed object I  
get:

as.data.frame.table(with(warpbreaks, by(breaks, list(wool,  
tension), sum)))
 Var1 Var2 Freq
1    A    L  401
2    B    L  254
3    A    M  216
4    B    M  259
5    A    H  221
6    B    H  169
A more self-documenting call might be:

 > as.data.frame.table(with(warpbreaks, by(breaks, list(Wool=wool,  
Tension=tension), sum)), responseName="Sum" )
   Wool Tension Sum
1    A       L 401
2    B       L 254
3    A       M 216
4    B       M 259
5    A       H 221
6    B       H 169

-- 
David Winsemius

work is now a list.  I'd really like to have work be a data frame  
with
3 columns: The rows of the first two columns show the office and  
party
levels being considered, and the third being the sum of the votes for
that level combination.  How do I cast this list/output into a data
frame?  using 'as.data.frame' doesn't work.

General Question: I assume the answer to the specific question is
dependent on my understanding list objects and accessing their
attributes.  Can anyone point me to a good, throrough treatment of
these R topics?  Specifically how to read and interpret the output of
the str(), and attributes() function, how to extract the values of  
the
'by' output object into a data frame, etc.?

Thanks,
Matt

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
General Question: I assume the answer to the specific question is dependent
on my understanding list objects and accessing their attributes.  Can anyone
point me to a good, throrough treatment of these R topics?  Specifically how
to read and interpret the output of the str(), and attributes() function,
how to extract the values of the 'by' output object into a data frame, etc.?
It's not quite what you're asking for, but you might want to have a
look at the plyr package, http://had.co.nz/plyr, as an alternative.
Plyr provides a consistent set of methods for manipulating data frames
and lists to solve the same general problem that you struggled with in
your email - breaking up a complex structure, operating on each piece,
and then joining all the results back together.

Hadley
http://had.co.nz/
I have a specific question and a general question.

Specific Question: I want to do an analysis on a data frame by 2 or more
class variables (i.e., use 2 or more columns in a dataframe to do
statistical classing).  Coming from SAS, I'm used to being able to take a
data set and have the output of the analysis in a dataset for further
manipulation.  I have a data set with vote totals, with one column being the
office name being voted on, and the other being the party of the candidate.
My votes are in the column "vc.n".  I did the analysis I want with:

work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)

the str() output of work looks like:

str(work)
 'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
 - attr(*, "dimnames")=List of 2
  ..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT GOVERNOR"
"SECRETARY OF STATE" "STATE AUDITOR" ...
  ..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
 - attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES =
sd62[, c("office.nm",      "party.abbr")], FUN = sum)

work is now a list.  I'd really like to have work be a data frame with 3
columns: The rows of the first two columns show the office and party levels
being considered, and the third being the sum of the votes for that level
combination.  How do I cast this list/output into a data frame?  using
'as.data.frame' doesn't work.

General Question: I assume the answer to the specific question is dependent
on my understanding list objects and accessing their attributes.  Can anyone
point me to a good, throrough treatment of these R topics?  Specifically how
to read and interpret the output of the str(), and attributes() function,
how to extract the values of the 'by' output object into a data frame, etc.?

Thanks,
Matt
Matt,

Welcome to R.

The help pages for each function, while they can be intentionally terse,
are a good first place to look. Many will include links/references to
related sources.

"An Introduction to R" is a good general place to start. A more thorough
treatment is in the "R Language Definition" manual. There are also a
plethora of contributed documents:

  http://cran.r-project.org/other-docs.html

and books on R and using R within specific domains:

  http://www.r-project.org/doc/bib/R-books.html

There are (at least) three ways to generate summary statistics based
upon multi-level groupings. These include by(), tapply() and aggregate().

The key difference between the three is the class/structure of the
results object and the print (output) method. In the specific case of
aggregate(), it must also return a scalar. Thus for example, unlike with
by() and tapply(), you cannot use summary(), which returns multiple values.

Thus the choice for which approach to take, to an extent, is founded on
what you may subsequently do with the data.

As an example, using the same set of data (warpbreaks):
str(warpbreaks)
'data.frame':	54 obs. of  3 variables:
 $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
 $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
 $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...

# Use by()
by(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum)
wool: A
tension: L
[1] 401
------------------------------------------------------
wool: B
tension: L
[1] 254
------------------------------------------------------
wool: A
tension: M
[1] 216
------------------------------------------------------
wool: B
tension: M
[1] 259
------------------------------------------------------
wool: A
tension: H
[1] 221
------------------------------------------------------
wool: B
tension: H
[1] 169

Note, because the result of using by() is at its core, a matrix/table,
you can also do the following, explicitly using the print method for a
table:
print.table(by(warpbreaks[, 1],
list(wool = warpbreaks$wool,
                   tension = warpbreaks$tension), sum))
    tension
wool   L   M   H
   A 401 216 221
   B 254 259 169

which gives you printed output in the same format as tapply() below,
without altering the structure of the result itself.

# tapply() directly gives you a tabular output
tapply(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
         sum)
    tension
wool   L   M   H
   A 401 216 221
   B 254 259 169

Note that the structure of the result from by() and the result from
tapply() are quite similar:
str(by(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
      sum))
 by [1:2, 1:3] 401 254 216 259 221 169
 - attr(*, "dimnames")=List of 2
  ..$ wool   : chr [1:2] "A" "B"
  ..$ tension: chr [1:3] "L" "M" "H"
 - attr(*, "call")= language by.default(data = warpbreaks[, 1], INDICES
= list(wool = warpbreaks$wool,      tension = warpbreaks$tension), FUN =
sum)
str(tapply(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
      sum))
 num [1:2, 1:3] 401 254 216 259 221 169
 - attr(*, "dimnames")=List of 2
  ..$ wool   : chr [1:2] "A" "B"
  ..$ tension: chr [1:3] "L" "M" "H"

Both are at their core, a 2 x 3 matrix.

The key difference is in the 'class' of the result, which affects
subsequent operations, such as the print method used.

# aggregate() gives you a data frame, with the summary statistic as the
# 'x' column
aggregate(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
           sum)
  wool tension   x
1    A       L 401
2    B       L 254
3    A       M 216
4    B       M 259
5    A       H 221
6    B       H 169
str(aggregate(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
      sum))
'data.frame':	6 obs. of  3 variables:
 $ wool   : Factor w/ 2 levels "A","B": 1 2 1 2 1 2
 $ tension: Factor w/ 3 levels "L","M","H": 1 1 2 2 3 3
 $ x      : num  401 254 216 259 221 169

Thus, bottom line, given your intended application, I would suggest
using aggregate() rather than by().

HTH,

Marc Schwartz
Thank you very much -- this was very helpful for differentiating among
the aggregating methods!

Matt

On Wed, Jan 14, 2009 at 3:42 PM, Marc Schwartz
on 01/14/2009 02:51 PM Matthew Pettis wrote:
I have a specific question and a general question.

Specific Question: I want to do an analysis on a data frame by 2 or more
class variables (i.e., use 2 or more columns in a dataframe to do
statistical classing).  Coming from SAS, I'm used to being able to take a
data set and have the output of the analysis in a dataset for further
manipulation.  I have a data set with vote totals, with one column being the
office name being voted on, and the other being the party of the candidate.
My votes are in the column "vc.n".  I did the analysis I want with:

work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)

the str() output of work looks like:

str(work)
 'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
 - attr(*, "dimnames")=List of 2
  ..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT GOVERNOR"
"SECRETARY OF STATE" "STATE AUDITOR" ...
  ..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
 - attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES =
sd62[, c("office.nm",      "party.abbr")], FUN = sum)

work is now a list.  I'd really like to have work be a data frame with 3
columns: The rows of the first two columns show the office and party levels
being considered, and the third being the sum of the votes for that level
combination.  How do I cast this list/output into a data frame?  using
'as.data.frame' doesn't work.

General Question: I assume the answer to the specific question is dependent
on my understanding list objects and accessing their attributes.  Can anyone
point me to a good, throrough treatment of these R topics?  Specifically how
to read and interpret the output of the str(), and attributes() function,
how to extract the values of the 'by' output object into a data frame, etc.?

Thanks,
Matt
Matt,

Welcome to R.

The help pages for each function, while they can be intentionally terse,
are a good first place to look. Many will include links/references to
related sources.

"An Introduction to R" is a good general place to start. A more thorough
treatment is in the "R Language Definition" manual. There are also a
plethora of contributed documents:

 http://cran.r-project.org/other-docs.html

and books on R and using R within specific domains:

 http://www.r-project.org/doc/bib/R-books.html

There are (at least) three ways to generate summary statistics based
upon multi-level groupings. These include by(), tapply() and aggregate().

The key difference between the three is the class/structure of the
results object and the print (output) method. In the specific case of
aggregate(), it must also return a scalar. Thus for example, unlike with
by() and tapply(), you cannot use summary(), which returns multiple values.

Thus the choice for which approach to take, to an extent, is founded on
what you may subsequently do with the data.

As an example, using the same set of data (warpbreaks):

str(warpbreaks)
'data.frame':   54 obs. of  3 variables:
 $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
 $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
 $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...

# Use by()

by(warpbreaks[, 1],
    list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum)
wool: A
tension: L
[1] 401
------------------------------------------------------
wool: B
tension: L
[1] 254
------------------------------------------------------
wool: A
tension: M
[1] 216
------------------------------------------------------
wool: B
tension: M
[1] 259
------------------------------------------------------
wool: A
tension: H
[1] 221
------------------------------------------------------
wool: B
tension: H
[1] 169

Note, because the result of using by() is at its core, a matrix/table,
you can also do the following, explicitly using the print method for a
table:

print.table(by(warpbreaks[, 1],
             list(wool = warpbreaks$wool,
                  tension = warpbreaks$tension), sum))
   tension
wool   L   M   H
  A 401 216 221
  B 254 259 169

which gives you printed output in the same format as tapply() below,
without altering the structure of the result itself.

# tapply() directly gives you a tabular output

tapply(warpbreaks[, 1],
        list(wool = warpbreaks$wool, tension = warpbreaks$tension),
        sum)
   tension
wool   L   M   H
  A 401 216 221
  B 254 259 169

Note that the structure of the result from by() and the result from
tapply() are quite similar:

str(by(warpbreaks[, 1],
     list(wool = warpbreaks$wool, tension = warpbreaks$tension),
     sum))
 by [1:2, 1:3] 401 254 216 259 221 169
 - attr(*, "dimnames")=List of 2
 ..$ wool   : chr [1:2] "A" "B"
 ..$ tension: chr [1:3] "L" "M" "H"
 - attr(*, "call")= language by.default(data = warpbreaks[, 1], INDICES
= list(wool = warpbreaks$wool,      tension = warpbreaks$tension), FUN =
sum)

str(tapply(warpbreaks[, 1],
     list(wool = warpbreaks$wool, tension = warpbreaks$tension),
     sum))
 num [1:2, 1:3] 401 254 216 259 221 169
 - attr(*, "dimnames")=List of 2
 ..$ wool   : chr [1:2] "A" "B"
 ..$ tension: chr [1:3] "L" "M" "H"

Both are at their core, a 2 x 3 matrix.

The key difference is in the 'class' of the result, which affects
subsequent operations, such as the print method used.

# aggregate() gives you a data frame, with the summary statistic as the
# 'x' column

aggregate(warpbreaks[, 1],
          list(wool = warpbreaks$wool, tension = warpbreaks$tension),
          sum)
 wool tension   x
1    A       L 401
2    B       L 254
3    A       M 216
4    B       M 259
5    A       H 221
6    B       H 169

str(aggregate(warpbreaks[, 1],
     list(wool = warpbreaks$wool, tension = warpbreaks$tension),
     sum))
'data.frame':   6 obs. of  3 variables:
 $ wool   : Factor w/ 2 levels "A","B": 1 2 1 2 1 2
 $ tension: Factor w/ 3 levels "L","M","H": 1 1 2 2 3 3
 $ x      : num  401 254 216 259 221 169

Thus, bottom line, given your intended application, I would suggest
using aggregate() rather than by().

HTH,

Marc Schwartz

One of the penalties for refusing to participate in politics is that
you end up being governed by your inferiors.
-- Plato

It is from the wellspring of our despair and the places that we are
broken that we come to repair the world.
-- Murray Waas