binning runtimes
Hi:
On Mon, Oct 24, 2011 at 2:01 AM, Giovanni Azua <bravegag at gmail.com> wrote:
Hello, Suppose I have the dataset shown below. The amount of observations is too massive to get a nice geom_point and smoother on top. What I would like to do is to bin the data first. The data is indexed by Time (minutes from 1 to 120 i.e. two hours of System benchmarking). Option 1) group the data by Time i.e. minute 1, minute 2, etc and within each group create bins of N consecutive observations and average them into one observation, the bins become the new data points to use for the geom_point plot. How can I do this? Shingle? how to do that?
If necessary, create a variable for minute; if Time already represents minutes, you shouldn't need to do anything. To average Runtime by one or more factors, there are many ways to do it: aggregate() in base R, ddply() in plyr, summaryBy() in the doBy package or data.table. For example, with aggregate() [R-2.11.0 or later], you could do (assuming Time is in minutes; otherwise substitute the minute variable instead) aggregate(Runtime ~ Time + Partitioning, data = dfs, FUN = mean)
Option 2) ?Another option is to again group by Time i.e. minute 1, minute 2, etc and within each group draw a random observation to be the representative for the corresponding bin. I could not clearly see how to use Random.
# Example:
# sampfun() samples one row of a data frame at random
sampfun <- function(d) d[sample(seq_len(nrow(d)), 1), ]
library('plyr')
ddply(dfs, .(Time, Partitioning), sampfun)
HTH,
Dennis
dfs <- subset(df, Partitioning == "Sharding") head(dfs)
?Time Partitioning Workload Runtime 1 ? ?1 ? ? Sharding ? ?Query ? ?3301 2 ? ?1 ? ? Sharding ? ?Query ? ?3268 3 ? ?1 ? ? Sharding ? ?Query ? ?2878 4 ? ?1 ? ? Sharding ? ?Query ? ?2819 5 ? ?1 ? ? Sharding ? ?Query ? ?3310 6 ? ?1 ? ? Sharding ? ?Query ? ?3428
str(dfs)
'data.frame': ? 102384 obs. of ?4 variables: ?$ Time ? ? ? ?: int ?1 1 1 1 1 1 1 1 1 1 ... ?$ Partitioning: Factor w/ 2 levels "Replication",..: 2 2 2 2 2 2 2 2 2 2 ... ?$ Workload ? ?: Factor w/ 2 levels "Query","Refresh": 1 1 1 1 1 1 1 1 1 1 ... ?$ Runtime ? ? : int ?3301 3268 2878 2819 3310 3428 2837 2954 2902 2936 ...
Many thanks in advance, Best regards, Giovanni
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.