Skip to content

Convert the output of by() to a data frame

4 messages · Al Roark, David Winsemius, Dennis Murphy +1 more

#
On Feb 7, 2011, at 11:29 PM, Al Roark wrote:

            
I looked it up in Spector's "Data Manipulation":

 > do.call(rbind, dfsum)
      [,1] [,2]
[1,]    1   29
[2,]    2   30

Your test case is a bit pathological, since there are really only two  
"combinations" within your three groups of factors, at least only two  
combinations with values. Quite frankly, I have never understood how  
by() works and do most of my work with tapply and  
Hmisc::formula.describe and aggreagate. Many people have switched over  
the plyr package.
David Winsemius, MD
West Hartford, CT
#
There's a much shorter way.  You don't need that ugly h() with all those $
and potential for bugs !

Using the original f :

dt[,lapply(.SD,f),by=key(dt)]

   grp1 grp2 grp3         a          b          d
      x    x    x  1.000000  81.000000 161.000000
      x    x    x 10.000000  90.000000 170.000000
      x    x    x  5.500000  85.500000 165.500000
      x    x    x  3.027650   3.027650   3.027650
      x    x    x  1.816590  28.239721  54.662851
      x    x    y 11.000000  91.000000 171.000000
      x    x    y 20.000000 100.000000 180.000000
      x    x    y 15.500000  95.500000 175.500000
      x    x    y  3.027650   3.027650   3.027650
      x    x    y  5.119482  31.542612  57.965742
[ snip ]

To get the names included, one (long) way is :

dt[,data.table(sapply(.SD,f),keep.rownames=TRUE),by=key(dt)]

   grp1 grp2 grp3   rn         a          b          d
      x    x    x  min  1.000000  81.000000 161.000000
      x    x    x  max 10.000000  90.000000 170.000000
      x    x    x mean  5.500000  85.500000 165.500000
      x    x    x   sd  3.027650   3.027650   3.027650
      x    x    x   cv  1.816590  28.239721  54.662851
      x    x    y  min 11.000000  91.000000 171.000000
      x    x    y  max 20.000000 100.000000 180.000000
      x    x    y mean 15.500000  95.500000 175.500000
      x    x    y   sd  3.027650   3.027650   3.027650
      x    x    y   cv  5.119482  31.542612  57.965742
[ snip ]

However, for speed on large datasets you can drop the names in f :

f <- function(x) c(min(x), max(x), mean(x), sd(x), mean(x)/sd(x))

and put the names in afterwards.

ans = dt[,lapply(.SD,f),by=key(dt)]
ans$labels = c("min","max","mean","sd","cv")
ans
   grp1 grp2 grp3         a          b          d labels
      x    x    x  1.000000  81.000000 161.000000    min
      x    x    x 10.000000  90.000000 170.000000    max
      x    x    x  5.500000  85.500000 165.500000   mean
      x    x    x  3.027650   3.027650   3.027650     sd
      x    x    x  1.816590  28.239721  54.662851     cv
      x    x    y 11.000000  91.000000 171.000000    min
      x    x    y 20.000000 100.000000 180.000000    max
      x    x    y 15.500000  95.500000 175.500000   mean
      x    x    y  3.027650   3.027650   3.027650     sd
      x    x    y  5.119482  31.542612  57.965742     cv
[ snip ]

You don't want all those small pieces of memory for the names to be created
over and over again every time f runs. That's only important for large
datasets, though.

Matthew