summarizing a data frame i.e. count -> group by
And the plyr version of this would be (using DF as the data frame name)
## transform method, mapping length(runtime) to all observations
## similar to David's results:
library('plyr')
ddply(DF, .(time, partitioning_mode), transform, n = length(runtime))
# or equivalently, the newer and somewhat faster
ddply(DF, .(time, partitioning_mode), mutate, n = length(runtime))
# If you just want the counts, then use
ddply(DF, .(time, partitioning_mode), summarise, n = length(runtime))
##---------
# Just for fun, here's the equivalent SQL call using sqldf():
library('sqldf')
sqldf('select time partitioning_mode count(*) from DF group by time
partitioning_mode')
# which you can distribute over multiple lines for readability, e.g.
sqldf('select time, partitioning_mode, count(*) as n
from DF
group by time, partitioning_mode')
# Result:
time partitioning_mode n
1 1 replication 4
2 1 sharding 11
##---------
# To do the same type of summary in data.table (to follow up on Jim
Holtman's post), here's one way:
library(data.table)
dt <- data.table(DF, key = 'time, partitioning_mode')
dt[, list(n = length(runtime)), by = key(dt)]
time partitioning_mode n
[1,] 1 replication 4
[2,] 1 sharding 11
###------
HTH,
Dennis
On Sun, Oct 23, 2011 at 10:29 AM, Giovanni Azua <bravegag at gmail.com> wrote:
Hello, This is one problem at the time :) I have a data frame df that looks like this: ?time partitioning_mode workload runtime 1 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? 607 2 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?85 3 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?52 4 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?79 5 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?77 6 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?67 7 ? ? 1 ? ? ? ? ?sharding ? ?query ? ? ?98 8 ? ? 1 ? ? ? ? ?sharding ?refresh ? ?2932 9 ? ? 1 ? ? ? ? ?sharding ?refresh ? ?2870 10 ? ?1 ? ? ? ? ?sharding ?refresh ? ?2877 11 ? ?1 ? ? ? ? ?sharding ?refresh ? ?2868 12 ? ?1 ? ? ? replication ? ?query ? ?2891 13 ? ?1 ? ? ? replication ? ?query ? ?2907 14 ? ?1 ? ? ? replication ? ?query ? ?2922 15 ? ?1 ? ? ? replication ? ?query ? ?2937 and if I could use SQL ... omg! I really wish I could! I would do exactly this: insert into throughput ?select time, partitioning_mode, count(*) ?from data.frame ?group by time, partitioning_mode My attempted R versions are wrong and produce very cryptic error message:
throughput <- aggregate(x=df[,c("time", "partitioning_mode")], by=list(df$time,df$partitioning_mode), count)
Error in `[.default`(df2, u_id, , drop = FALSE) : ?incorrect number of dimensions
throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode), count)
Error in `[.default`(df2, u_id, , drop = FALSE) : ?incorrect number of dimensions
throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning), FUN=count)
I cant comprehend what comes out from this one ... :( and I thought C++ template errors were the most cryptic ;P Many many thanks in advance, Best regards, Giovanni
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.