An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110207/a6cb3e5f/attachment.pl>
Convert the output of by() to a data frame
4 messages · Al Roark, David Winsemius, Dennis Murphy +1 more
On Feb 7, 2011, at 11:29 PM, Al Roark wrote:
I'd like to summarize several variables in a data frame, for
multiple groups, and store the results in a data.frame. To do so,
I'm using by(). For example:
df<-
data
.frame
(a
=1:10,b=11:20,c=21:30,grp1=c("x","y"),grp2=c("x","y"),grp3=c("x","y"))
dfsum<-by(df[c("a","b","c")], df[c("grp1","grp2","grp3")], range)
The result has a class of "by" and a mode of "list". I'm new to R
and can't find any documentation on this class, and don't see
methods for it associated with the as.data.frame. How should I go
about coercing this to a data frame? Is there a comprehensive
source that I'm might be missing, which can tell me such things?
I looked it up in Spector's "Data Manipulation":
> do.call(rbind, dfsum)
[,1] [,2]
[1,] 1 29
[2,] 2 30
Your test case is a bit pathological, since there are really only two
"combinations" within your three groups of factors, at least only two
combinations with values. Quite frankly, I have never understood how
by() works and do most of my work with tapply and
Hmisc::formula.describe and aggreagate. Many people have switched over
the plyr package.
Cheers [[alternative HTML version deleted]]
David Winsemius, MD West Hartford, CT
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110207/a5eee774/attachment.pl>
There's a much shorter way. You don't need that ugly h() with all those $
and potential for bugs !
Using the original f :
dt[,lapply(.SD,f),by=key(dt)]
grp1 grp2 grp3 a b d
x x x 1.000000 81.000000 161.000000
x x x 10.000000 90.000000 170.000000
x x x 5.500000 85.500000 165.500000
x x x 3.027650 3.027650 3.027650
x x x 1.816590 28.239721 54.662851
x x y 11.000000 91.000000 171.000000
x x y 20.000000 100.000000 180.000000
x x y 15.500000 95.500000 175.500000
x x y 3.027650 3.027650 3.027650
x x y 5.119482 31.542612 57.965742
[ snip ]
To get the names included, one (long) way is :
dt[,data.table(sapply(.SD,f),keep.rownames=TRUE),by=key(dt)]
grp1 grp2 grp3 rn a b d
x x x min 1.000000 81.000000 161.000000
x x x max 10.000000 90.000000 170.000000
x x x mean 5.500000 85.500000 165.500000
x x x sd 3.027650 3.027650 3.027650
x x x cv 1.816590 28.239721 54.662851
x x y min 11.000000 91.000000 171.000000
x x y max 20.000000 100.000000 180.000000
x x y mean 15.500000 95.500000 175.500000
x x y sd 3.027650 3.027650 3.027650
x x y cv 5.119482 31.542612 57.965742
[ snip ]
However, for speed on large datasets you can drop the names in f :
f <- function(x) c(min(x), max(x), mean(x), sd(x), mean(x)/sd(x))
and put the names in afterwards.
ans = dt[,lapply(.SD,f),by=key(dt)]
ans$labels = c("min","max","mean","sd","cv")
ans
grp1 grp2 grp3 a b d labels
x x x 1.000000 81.000000 161.000000 min
x x x 10.000000 90.000000 170.000000 max
x x x 5.500000 85.500000 165.500000 mean
x x x 3.027650 3.027650 3.027650 sd
x x x 1.816590 28.239721 54.662851 cv
x x y 11.000000 91.000000 171.000000 min
x x y 20.000000 100.000000 180.000000 max
x x y 15.500000 95.500000 175.500000 mean
x x y 3.027650 3.027650 3.027650 sd
x x y 5.119482 31.542612 57.965742 cv
[ snip ]
You don't want all those small pieces of memory for the names to be created
over and over again every time f runs. That's only important for large
datasets, though.
Matthew
View this message in context: http://r.789695.n4.nabble.com/Convert-the-output-of-by-to-a-data-frame-tp3268428p3278326.html Sent from the R help mailing list archive at Nabble.com.