Skip to content

summarizing a data frame i.e. count -> group by

5 messages · Giovanni Azua, David Winsemius, Tyler Rinker +2 more

#
Hello,

This is one problem at the time :)

I have a data frame df that looks like this:

  time partitioning_mode workload runtime
1     1          sharding    query     607
2     1          sharding    query      85
3     1          sharding    query      52
4     1          sharding    query      79
5     1          sharding    query      77
6     1          sharding    query      67
7     1          sharding    query      98
8     1          sharding  refresh    2932
9     1          sharding  refresh    2870
10    1          sharding  refresh    2877
11    1          sharding  refresh    2868
12    1       replication    query    2891
13    1       replication    query    2907
14    1       replication    query    2922
15    1       replication    query    2937

and if I could use SQL ... omg! I really wish I could! I would do exactly this:

insert into throughput
  select time, partitioning_mode, count(*)
  from data.frame 
  group by time, partitioning_mode

My attempted R versions are wrong and produce very cryptic error message:
Error in `[.default`(df2, u_id, , drop = FALSE) : 
  incorrect number of dimensions
Error in `[.default`(df2, u_id, , drop = FALSE) : 
  incorrect number of dimensions
I cant comprehend what comes out from this one ... :(

and I thought C++ template errors were the most cryptic ;P

Many many thanks in advance,
Best regards,
Giovanni
#
On Oct 23, 2011, at 1:29 PM, Giovanni Azua wrote:

            
> df <-read.table(textConnection(" time partitioning_mode workload  
runtime
+ 1     1          sharding    query     607
+ 2     1          sharding    query      85
+ 3     1          sharding    query      52
+ 4     1          sharding    query      79
+ 5     1          sharding    query      77
+ 6     1          sharding    query      67
+ 7     1          sharding    query      98
+ 8     1          sharding  refresh    2932
+ 9     1          sharding  refresh    2870
+ 10    1          sharding  refresh    2877
+ 11    1          sharding  refresh    2868
+ 12    1       replication    query    2891
+ 13    1       replication    query    2907
+ 14    1       replication    query    2922
+ 15    1       replication    query    2937"))
 >
 > df$throughput <- ave(df$time, list(df$time, df$partitioning_mode),  
FUN=length)
 > df
    time partitioning_mode workload runtime throughput
1     1          sharding    query     607         11
2     1          sharding    query      85         11
3     1          sharding    query      52         11
4     1          sharding    query      79         11
5     1          sharding    query      77         11
6     1          sharding    query      67         11
7     1          sharding    query      98         11
8     1          sharding  refresh    2932         11
9     1          sharding  refresh    2870         11
10    1          sharding  refresh    2877         11
11    1          sharding  refresh    2868         11
12    1       replication    query    2891          4
13    1       replication    query    2907          4
14    1       replication    query    2922          4
15    1       replication    query    2937          4
You can of, course use package sqldf, which would undoubtedly be good  
practice for me, but this seemed like a typical situation for using  
'ave'. You do need to use the FUN= construction in 'ave' because that  
argument appears after the triple dots in the argument list.
David Winsemius, MD
West Hartford, CT
#
This could be done with aggregate but I am unfamiliar with it so I'll give what I think you want from your message using the library 'reshape' that you'll have to doneload.  If you're problem is large the data.table library would be much faster.
 
You haven't really said what you'd like to get from the output so I'm going by what your code looks like you want. There is no count in R, the function is called length (you may want sum but it does not appear that way).  Also giving the list a bit of what you'd expect for an out put is often helpful.
 
Here is the code(one of these three options is what you want I think:
 
library(reshape)
throughput1 <- cast(df, time~partitioning_mode, value="runtime",  length)
throughput2 <- cast(df, partitioning_mode~time, value="runtime",  length)
throughput3 <- cast(df, partitioning_mode + workload~time, value="runtime", length)

----------------------------------------
#
Another package to consider, especially if your dataframe is large, is
'data.table':
+ 1     1          sharding    query     607
+ 2     1          sharding    query      85
+ 3     1          sharding    query      52
+ 4     1          sharding    query      79
+ 5     1          sharding    query      77
+ 6     1          sharding    query      67
+ 7     1          sharding    query      98
+ 8     1          sharding  refresh    2932
+ 9     1          sharding  refresh    2870
+ 10    1          sharding  refresh    2877
+ 11    1          sharding  refresh    2868
+ 12    1       replication    query    2891
+ 13    1       replication    query    2907
+ 14    1       replication    query    2922
+ 15    1       replication    query    2937"), as.is = TRUE, header = TRUE)
Loading required package: data.table
data.table 1.7.1  For help type: help("data.table")
+     , list(workload = workload
+         , runtime = runtime
+         , thruput = length(runtime)
+         )
+     , by = list(time, partitioning_mode)
+     ]
      time partitioning_mode workload runtime thruput
 [1,]    1          sharding    query     607      11
 [2,]    1          sharding    query      85      11
 [3,]    1          sharding    query      52      11
 [4,]    1          sharding    query      79      11
 [5,]    1          sharding    query      77      11
 [6,]    1          sharding    query      67      11
 [7,]    1          sharding    query      98      11
 [8,]    1          sharding  refresh    2932      11
 [9,]    1          sharding  refresh    2870      11
[10,]    1          sharding  refresh    2877      11
[11,]    1          sharding  refresh    2868      11
[12,]    1       replication    query    2891       4
[13,]    1       replication    query    2907       4
[14,]    1       replication    query    2922       4
[15,]    1       replication    query    2937       4
On Sun, Oct 23, 2011 at 1:29 PM, Giovanni Azua <bravegag at gmail.com> wrote:

  
    
#
And the plyr version of this would be (using DF as the data frame name)

## transform method, mapping length(runtime) to all observations
## similar to David's results:
library('plyr')
ddply(DF, .(time, partitioning_mode), transform, n = length(runtime))
# or equivalently, the newer and somewhat faster
ddply(DF, .(time, partitioning_mode), mutate, n = length(runtime))

# If you just want the counts, then use

ddply(DF, .(time, partitioning_mode), summarise, n = length(runtime))

##---------
# Just for fun, here's the equivalent SQL call using sqldf():

library('sqldf')
sqldf('select time partitioning_mode count(*) from DF group by time
partitioning_mode')

# which you can distribute over multiple lines for readability, e.g.

sqldf('select time, partitioning_mode, count(*) as n
      from DF
      group by time, partitioning_mode')

# Result:
  time partitioning_mode  n
1    1       replication  4
2    1          sharding 11

##---------
# To do the same type of summary in data.table (to follow up on Jim
Holtman's post), here's one way:

library(data.table)
dt <- data.table(DF, key = 'time, partitioning_mode')
dt[, list(n = length(runtime)), by = key(dt)]
     time partitioning_mode  n
[1,]    1       replication  4
[2,]    1          sharding 11


###------
HTH,
Dennis
On Sun, Oct 23, 2011 at 10:29 AM, Giovanni Azua <bravegag at gmail.com> wrote: