An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090114/78bb73b1/attachment-0001.pl>
Casting lists to data.frames, analog to SAS
7 messages · Matthew Pettis, David Winsemius, Hadley Wickham +1 more
<Resubmitting: Told I had a problem with a character set -- now
submitting in plain text>
I have a specific question and a general question.
Specific Question: I want to do an analysis on a data frame by 2 or
more class variables (i.e., use 2 or more columns in a dataframe to do
statistical classing). Coming from SAS, I'm used to being able to
take a data set and have the output of the analysis in a dataset for
further manipulation. I have a data set with vote totals, with one
column being the office name being voted on, and the other being the
party of the candidate. My votes are in the column "vc.n". I did the
analysis I want with:
work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)
the str() output of work looks like:
str(work)
'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
- attr(*, "dimnames")=List of 2
..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT
GOVERNOR" "SECRETARY OF STATE" "STATE AUDITOR" ...
..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
- attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES
= sd62[, c("office.nm", "party.abbr")], FUN = sum)
work is now a list. I'd really like to have work be a data frame with
3 columns: The rows of the first two columns show the office and party
levels being considered, and the third being the sum of the votes for
that level combination. How do I cast this list/output into a data
frame? using 'as.data.frame' doesn't work.
General Question: I assume the answer to the specific question is
dependent on my understanding list objects and accessing their
attributes. Can anyone point me to a good, throrough treatment of
these R topics? Specifically how to read and interpret the output of
the str(), and attributes() function, how to extract the values of the
'by' output object into a data frame, etc.?
Thanks,
Matt
On Jan 14, 2009, at 3:56 PM, Matthew Pettis wrote:
<Resubmitting: Told I had a problem with a character set -- now
submitting in plain text>
I have a specific question and a general question.
Specific Question: I want to do an analysis on a data frame by 2 or
more class variables (i.e., use 2 or more columns in a dataframe to do
statistical classing). Coming from SAS, I'm used to being able to
take a data set and have the output of the analysis in a dataset for
further manipulation. I have a data set with vote totals, with one
column being the office name being voted on, and the other being the
party of the candidate. My votes are in the column "vc.n". I did the
analysis I want with:
work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)
the str() output of work looks like:
str(work)
'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
- attr(*, "dimnames")=List of 2
..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT
GOVERNOR" "SECRETARY OF STATE" "STATE AUDITOR" ...
..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
- attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES
= sd62[, c("office.nm", "party.abbr")], FUN = sum)
When I use as.data.frame.table on a similarly constructed object I get: > as.data.frame.table(with(warpbreaks, by(breaks, list(wool, tension), sum))) Var1 Var2 Freq 1 A L 401 2 B L 254 3 A M 216 4 B M 259 5 A H 221 6 B H 169
David Winsemius > > > > > work is now a list. I'd really like to have work be a data frame with > 3 columns: The rows of the first two columns show the office and party > levels being considered, and the third being the sum of the votes for > that level combination. How do I cast this list/output into a data > frame? using 'as.data.frame' doesn't work. > > General Question: I assume the answer to the specific question is > dependent on my understanding list objects and accessing their > attributes. Can anyone point me to a good, throrough treatment of > these R topics? Specifically how to read and interpret the output of > the str(), and attributes() function, how to extract the values of the > 'by' output object into a data frame, etc.? > > Thanks, > Matt > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Jan 14, 2009, at 4:14 PM, David Winsemius wrote:
On Jan 14, 2009, at 3:56 PM, Matthew Pettis wrote:
<Resubmitting: Told I had a problem with a character set -- now
submitting in plain text>
I have a specific question and a general question.
Specific Question: I want to do an analysis on a data frame by 2 or
more class variables (i.e., use 2 or more columns in a dataframe to
do
statistical classing). Coming from SAS, I'm used to being able to
take a data set and have the output of the analysis in a dataset for
further manipulation. I have a data set with vote totals, with one
column being the office name being voted on, and the other being the
party of the candidate. My votes are in the column "vc.n". I did
the
analysis I want with:
work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)
the str() output of work looks like:
str(work)
'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
- attr(*, "dimnames")=List of 2
..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT
GOVERNOR" "SECRETARY OF STATE" "STATE AUDITOR" ...
..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
- attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES
= sd62[, c("office.nm", "party.abbr")], FUN = sum)
When I use as.data.frame.table on a similarly constructed object I get:
as.data.frame.table(with(warpbreaks, by(breaks, list(wool,
tension), sum))) Var1 Var2 Freq 1 A L 401 2 B L 254 3 A M 216 4 B M 259 5 A H 221 6 B H 169
A more self-documenting call might be: > as.data.frame.table(with(warpbreaks, by(breaks, list(Wool=wool, Tension=tension), sum)), responseName="Sum" ) Wool Tension Sum 1 A L 401 2 B L 254 3 A M 216 4 B M 259 5 A H 221 6 B H 169
-- David Winsemius
work is now a list. I'd really like to have work be a data frame with 3 columns: The rows of the first two columns show the office and party levels being considered, and the third being the sum of the votes for that level combination. How do I cast this list/output into a data frame? using 'as.data.frame' doesn't work. General Question: I assume the answer to the specific question is dependent on my understanding list objects and accessing their attributes. Can anyone point me to a good, throrough treatment of these R topics? Specifically how to read and interpret the output of the str(), and attributes() function, how to extract the values of the 'by' output object into a data frame, etc.? Thanks, Matt
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
General Question: I assume the answer to the specific question is dependent on my understanding list objects and accessing their attributes. Can anyone point me to a good, throrough treatment of these R topics? Specifically how to read and interpret the output of the str(), and attributes() function, how to extract the values of the 'by' output object into a data frame, etc.?
It's not quite what you're asking for, but you might want to have a look at the plyr package, http://had.co.nz/plyr, as an alternative. Plyr provides a consistent set of methods for manipulating data frames and lists to solve the same general problem that you struggled with in your email - breaking up a complex structure, operating on each piece, and then joining all the results back together. Hadley
on 01/14/2009 02:51 PM Matthew Pettis wrote:
I have a specific question and a general question.
Specific Question: I want to do an analysis on a data frame by 2 or more
class variables (i.e., use 2 or more columns in a dataframe to do
statistical classing). Coming from SAS, I'm used to being able to take a
data set and have the output of the analysis in a dataset for further
manipulation. I have a data set with vote totals, with one column being the
office name being voted on, and the other being the party of the candidate.
My votes are in the column "vc.n". I did the analysis I want with:
work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)
the str() output of work looks like:
str(work)
'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
- attr(*, "dimnames")=List of 2
..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT GOVERNOR"
"SECRETARY OF STATE" "STATE AUDITOR" ...
..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
- attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES =
sd62[, c("office.nm", "party.abbr")], FUN = sum)
work is now a list. I'd really like to have work be a data frame with 3
columns: The rows of the first two columns show the office and party levels
being considered, and the third being the sum of the votes for that level
combination. How do I cast this list/output into a data frame? using
'as.data.frame' doesn't work.
General Question: I assume the answer to the specific question is dependent
on my understanding list objects and accessing their attributes. Can anyone
point me to a good, throrough treatment of these R topics? Specifically how
to read and interpret the output of the str(), and attributes() function,
how to extract the values of the 'by' output object into a data frame, etc.?
Thanks,
Matt
Matt, Welcome to R. The help pages for each function, while they can be intentionally terse, are a good first place to look. Many will include links/references to related sources. "An Introduction to R" is a good general place to start. A more thorough treatment is in the "R Language Definition" manual. There are also a plethora of contributed documents: http://cran.r-project.org/other-docs.html and books on R and using R within specific domains: http://www.r-project.org/doc/bib/R-books.html There are (at least) three ways to generate summary statistics based upon multi-level groupings. These include by(), tapply() and aggregate(). The key difference between the three is the class/structure of the results object and the print (output) method. In the specific case of aggregate(), it must also return a scalar. Thus for example, unlike with by() and tapply(), you cannot use summary(), which returns multiple values. Thus the choice for which approach to take, to an extent, is founded on what you may subsequently do with the data. As an example, using the same set of data (warpbreaks):
str(warpbreaks)
'data.frame': 54 obs. of 3 variables: $ breaks : num 26 30 54 25 70 52 51 26 67 18 ... $ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ... $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ... # Use by()
by(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum) wool: A tension: L [1] 401 ------------------------------------------------------ wool: B tension: L [1] 254 ------------------------------------------------------ wool: A tension: M [1] 216 ------------------------------------------------------ wool: B tension: M [1] 259 ------------------------------------------------------ wool: A tension: H [1] 221 ------------------------------------------------------ wool: B tension: H [1] 169 Note, because the result of using by() is at its core, a matrix/table, you can also do the following, explicitly using the print method for a table:
print.table(by(warpbreaks[, 1],
list(wool = warpbreaks$wool,
tension = warpbreaks$tension), sum))
tension
wool L M H
A 401 216 221
B 254 259 169
which gives you printed output in the same format as tapply() below,
without altering the structure of the result itself.
# tapply() directly gives you a tabular output
tapply(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
sum)
tension
wool L M H
A 401 216 221
B 254 259 169
Note that the structure of the result from by() and the result from
tapply() are quite similar:
str(by(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
sum))
by [1:2, 1:3] 401 254 216 259 221 169
- attr(*, "dimnames")=List of 2
..$ wool : chr [1:2] "A" "B"
..$ tension: chr [1:3] "L" "M" "H"
- attr(*, "call")= language by.default(data = warpbreaks[, 1], INDICES
= list(wool = warpbreaks$wool, tension = warpbreaks$tension), FUN =
sum)
str(tapply(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
sum))
num [1:2, 1:3] 401 254 216 259 221 169
- attr(*, "dimnames")=List of 2
..$ wool : chr [1:2] "A" "B"
..$ tension: chr [1:3] "L" "M" "H"
Both are at their core, a 2 x 3 matrix.
The key difference is in the 'class' of the result, which affects
subsequent operations, such as the print method used.
# aggregate() gives you a data frame, with the summary statistic as the
# 'x' column
aggregate(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
sum)
wool tension x
1 A L 401
2 B L 254
3 A M 216
4 B M 259
5 A H 221
6 B H 169
str(aggregate(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
sum))
'data.frame': 6 obs. of 3 variables:
$ wool : Factor w/ 2 levels "A","B": 1 2 1 2 1 2
$ tension: Factor w/ 3 levels "L","M","H": 1 1 2 2 3 3
$ x : num 401 254 216 259 221 169
Thus, bottom line, given your intended application, I would suggest
using aggregate() rather than by().
HTH,
Marc Schwartz
Thank you very much -- this was very helpful for differentiating among the aggregating methods! Matt On Wed, Jan 14, 2009 at 3:42 PM, Marc Schwartz
<marc_schwartz at comcast.net> wrote:
on 01/14/2009 02:51 PM Matthew Pettis wrote:
I have a specific question and a general question.
Specific Question: I want to do an analysis on a data frame by 2 or more
class variables (i.e., use 2 or more columns in a dataframe to do
statistical classing). Coming from SAS, I'm used to being able to take a
data set and have the output of the analysis in a dataset for further
manipulation. I have a data set with vote totals, with one column being the
office name being voted on, and the other being the party of the candidate.
My votes are in the column "vc.n". I did the analysis I want with:
work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)
the str() output of work looks like:
str(work)
'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
- attr(*, "dimnames")=List of 2
..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT GOVERNOR"
"SECRETARY OF STATE" "STATE AUDITOR" ...
..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
- attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES =
sd62[, c("office.nm", "party.abbr")], FUN = sum)
work is now a list. I'd really like to have work be a data frame with 3
columns: The rows of the first two columns show the office and party levels
being considered, and the third being the sum of the votes for that level
combination. How do I cast this list/output into a data frame? using
'as.data.frame' doesn't work.
General Question: I assume the answer to the specific question is dependent
on my understanding list objects and accessing their attributes. Can anyone
point me to a good, throrough treatment of these R topics? Specifically how
to read and interpret the output of the str(), and attributes() function,
how to extract the values of the 'by' output object into a data frame, etc.?
Thanks,
Matt
Matt, Welcome to R. The help pages for each function, while they can be intentionally terse, are a good first place to look. Many will include links/references to related sources. "An Introduction to R" is a good general place to start. A more thorough treatment is in the "R Language Definition" manual. There are also a plethora of contributed documents: http://cran.r-project.org/other-docs.html and books on R and using R within specific domains: http://www.r-project.org/doc/bib/R-books.html There are (at least) three ways to generate summary statistics based upon multi-level groupings. These include by(), tapply() and aggregate(). The key difference between the three is the class/structure of the results object and the print (output) method. In the specific case of aggregate(), it must also return a scalar. Thus for example, unlike with by() and tapply(), you cannot use summary(), which returns multiple values. Thus the choice for which approach to take, to an extent, is founded on what you may subsequently do with the data. As an example, using the same set of data (warpbreaks):
str(warpbreaks)
'data.frame': 54 obs. of 3 variables: $ breaks : num 26 30 54 25 70 52 51 26 67 18 ... $ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ... $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ... # Use by()
by(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum) wool: A tension: L [1] 401 ------------------------------------------------------ wool: B tension: L [1] 254 ------------------------------------------------------ wool: A tension: M [1] 216 ------------------------------------------------------ wool: B tension: M [1] 259 ------------------------------------------------------ wool: A tension: H [1] 221 ------------------------------------------------------ wool: B tension: H [1] 169 Note, because the result of using by() is at its core, a matrix/table, you can also do the following, explicitly using the print method for a table:
print.table(by(warpbreaks[, 1],
list(wool = warpbreaks$wool,
tension = warpbreaks$tension), sum))
tension
wool L M H
A 401 216 221
B 254 259 169
which gives you printed output in the same format as tapply() below,
without altering the structure of the result itself.
# tapply() directly gives you a tabular output
tapply(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
sum)
tension
wool L M H
A 401 216 221
B 254 259 169
Note that the structure of the result from by() and the result from
tapply() are quite similar:
str(by(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
sum))
by [1:2, 1:3] 401 254 216 259 221 169
- attr(*, "dimnames")=List of 2
..$ wool : chr [1:2] "A" "B"
..$ tension: chr [1:3] "L" "M" "H"
- attr(*, "call")= language by.default(data = warpbreaks[, 1], INDICES
= list(wool = warpbreaks$wool, tension = warpbreaks$tension), FUN =
sum)
str(tapply(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
sum))
num [1:2, 1:3] 401 254 216 259 221 169
- attr(*, "dimnames")=List of 2
..$ wool : chr [1:2] "A" "B"
..$ tension: chr [1:3] "L" "M" "H"
Both are at their core, a 2 x 3 matrix.
The key difference is in the 'class' of the result, which affects
subsequent operations, such as the print method used.
# aggregate() gives you a data frame, with the summary statistic as the
# 'x' column
aggregate(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
sum)
wool tension x
1 A L 401
2 B L 254
3 A M 216
4 B M 259
5 A H 221
6 B H 169
str(aggregate(warpbreaks[, 1],
list(wool = warpbreaks$wool, tension = warpbreaks$tension),
sum))
'data.frame': 6 obs. of 3 variables:
$ wool : Factor w/ 2 levels "A","B": 1 2 1 2 1 2
$ tension: Factor w/ 3 levels "L","M","H": 1 1 2 2 3 3
$ x : num 401 254 216 259 221 169
Thus, bottom line, given your intended application, I would suggest
using aggregate() rather than by().
HTH,
Marc Schwartz
One of the penalties for refusing to participate in politics is that you end up being governed by your inferiors. -- Plato It is from the wellspring of our despair and the places that we are broken that we come to repair the world. -- Murray Waas