I want to calculate a statistic on a number of subgroups of a dataframe,
then put the results into a dataframe. (What SAS PROC MEANS does, I
think, though it's been years since I used it.)
This is possible using by(), but it seems cumbersome and fragile. Is
there a more straightforward way than this?
Here's a simple example showing my current strategy:
> dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))
> dataset
gp1 gp2 value
1 1 1 0.9493232
2 1 1 -0.0474712
3 1 2 -0.6808021
4 1 2 1.9894999
5 2 3 2.0154786
6 2 3 0.4333056
7 2 4 -0.4746228
8 2 4 0.6017522
>
> handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
+ gp2 = subset$gp2[1], statistic = mean(subset$value))
>
> bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>
> result <- do.call('rbind', bylist)
> result
gp1 gp2 statistic
1 1 1 0.45092598
11 1 2 0.65434890
12 2 3 1.22439210
13 2 4 0.06356469
tapply() is inappropriate because I don't have all possible combinations
of gp1 and gp2 values, only some of them:
> tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
1 2 3 4
1 0.450926 0.6543489 NA NA
2 NA NA 1.224392 0.06356469
In the real case, I only have a very sparse subset of all the
combinations, and tapply() and by() both die for lack of memory.
Any suggestions on how to do what I want, without using SAS?
Duncan Murdoch
by() processing on a dataframe
8 messages · Hadley Wickham, Peter Dalgaard, Gabor Grothendieck +2 more
I'm not entirely sure what you want, but maybe this does the trick?
data.frame.by <- function(data, variables, fun, ...) {
if (length(variables) == 0 ) {
df <- data.frame(results = 0)
df$results <- list(fun(data$value, ...))
return(df)
}
sorted <- sort.df(data, variables)[,c(variables), drop=FALSE]
duplicates <- duplicated(sorted[,variables, drop=FALSE])
index <- cumsum(!duplicates)
results <- by(data, index, fun, ...)
cols <- sorted[!duplicates,variables, drop=FALSE]
cols$results <- array(results)
cols
}
sort.df <- function(data, vars) {
data[do.call("order", data[,vars, drop=FALSE]), ,drop=FALSE]
}
dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))
data.frame.by(dataset, c("gp1", "gp2"), function(data) mean(data$value))
data.frame.by(dataset, "gp1", function(data) tapply(data$value, data$gp2, mean))
data.frame.by(dataset, "gp1", function(data) lm(gp2 ~ value, data)) #
doesn't print, but everything is there ok
(note that the results column will be a list if necessary - this may
be a serious abuse of data frames, but I'm not sure and no one replied
when I queried the list)
Hadley
Duncan Murdoch <murdoch at stats.uwo.ca> writes:
I want to calculate a statistic on a number of subgroups of a dataframe, then put the results into a dataframe. (What SAS PROC MEANS does, I think, though it's been years since I used it.) This is possible using by(), but it seems cumbersome and fragile. Is there a more straightforward way than this? Here's a simple example showing my current strategy:
> dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))
> dataset
gp1 gp2 value 1 1 1 0.9493232 2 1 1 -0.0474712 3 1 2 -0.6808021 4 1 2 1.9894999 5 2 3 2.0154786 6 2 3 0.4333056 7 2 4 -0.4746228 8 2 4 0.6017522
> > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
+ gp2 = subset$gp2[1], statistic = mean(subset$value))
>
> bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>
> result <- do.call('rbind', bylist)
> result
gp1 gp2 statistic 1 1 1 0.45092598 11 1 2 0.65434890 12 2 3 1.22439210 13 2 4 0.06356469 tapply() is inappropriate because I don't have all possible combinations of gp1 and gp2 values, only some of them:
> tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
1 2 3 4 1 0.450926 0.6543489 NA NA 2 NA NA 1.224392 0.06356469 In the real case, I only have a very sparse subset of all the combinations, and tapply() and by() both die for lack of memory. Any suggestions on how to do what I want, without using SAS?
Have you tried aggregate()? Alternatively, you migth split on interaction(...., drop=TRUE)
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Check out summaryBy in the doBy package at: http://genetics.agrsci.dk/~sorenh/misc e.g. summaryBy(value ~ gp1 + gp2, data = dataset)
On 9/30/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
I want to calculate a statistic on a number of subgroups of a dataframe, then put the results into a dataframe. (What SAS PROC MEANS does, I think, though it's been years since I used it.) This is possible using by(), but it seems cumbersome and fragile. Is there a more straightforward way than this? Here's a simple example showing my current strategy:
> dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))
> dataset
gp1 gp2 value 1 1 1 0.9493232 2 1 1 -0.0474712 3 1 2 -0.6808021 4 1 2 1.9894999 5 2 3 2.0154786 6 2 3 0.4333056 7 2 4 -0.4746228 8 2 4 0.6017522
> > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
+ gp2 = subset$gp2[1], statistic = mean(subset$value))
>
> bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>
> result <- do.call('rbind', bylist)
> result
gp1 gp2 statistic 1 1 1 0.45092598 11 1 2 0.65434890 12 2 3 1.22439210 13 2 4 0.06356469 tapply() is inappropriate because I don't have all possible combinations of gp1 and gp2 values, only some of them:
> tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
1 2 3 4 1 0.450926 0.6543489 NA NA 2 NA NA 1.224392 0.06356469 In the real case, I only have a very sparse subset of all the combinations, and tapply() and by() both die for lack of memory. Any suggestions on how to do what I want, without using SAS? Duncan Murdoch
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Fri, 2005-09-30 at 13:22 -0400, Duncan Murdoch wrote:
I want to calculate a statistic on a number of subgroups of a dataframe, then put the results into a dataframe. (What SAS PROC MEANS does, I think, though it's been years since I used it.) This is possible using by(), but it seems cumbersome and fragile. Is there a more straightforward way than this? Here's a simple example showing my current strategy:
> dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))
> dataset
gp1 gp2 value 1 1 1 0.9493232 2 1 1 -0.0474712 3 1 2 -0.6808021 4 1 2 1.9894999 5 2 3 2.0154786 6 2 3 0.4333056 7 2 4 -0.4746228 8 2 4 0.6017522
> > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
+ gp2 = subset$gp2[1], statistic = mean(subset$value))
>
> bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>
> result <- do.call('rbind', bylist)
> result
gp1 gp2 statistic 1 1 1 0.45092598 11 1 2 0.65434890 12 2 3 1.22439210 13 2 4 0.06356469 tapply() is inappropriate because I don't have all possible combinations of gp1 and gp2 values, only some of them:
> tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
1 2 3 4 1 0.450926 0.6543489 NA NA 2 NA NA 1.224392 0.06356469 In the real case, I only have a very sparse subset of all the combinations, and tapply() and by() both die for lack of memory. Any suggestions on how to do what I want, without using SAS? Duncan Murdoch
Duncan, Does this do what you want?
set.seed(1)
df <- data.frame(gp1 = rep(1:2, c(4,4)),
gp2 = rep(1:4, c(2,2,2,2)),
value = rnorm(8))
df
gp1 gp2 value 1 1 1 -0.6264538 2 1 1 0.1836433 3 1 2 -0.8356286 4 1 2 1.5952808 5 2 3 0.3295078 6 2 3 -0.8204684 7 2 4 0.4874291 8 2 4 0.7383247
means <- aggregate(df$value, list(gp1 = df$gp1, gp2 = df$gp2), mean)
means
gp1 gp2 x 1 1 1 -0.2214052 2 1 2 0.3798261 3 2 3 -0.2454803 4 2 4 0.6128769
merge(df, means, by = c("gp1", "gp2"))
gp1 gp2 value x 1 1 1 -0.6264538 -0.2214052 2 1 1 0.1836433 -0.2214052 3 1 2 -0.8356286 0.3798261 4 1 2 1.5952808 0.3798261 5 2 3 0.3295078 -0.2454803 6 2 3 -0.8204684 -0.2454803 7 2 4 0.4874291 0.6128769 8 2 4 0.7383247 0.6128769 HTH, Marc Schwartz
On 9/30/2005 1:41 PM, Peter Dalgaard wrote:
Duncan Murdoch <murdoch at stats.uwo.ca> writes:
I want to calculate a statistic on a number of subgroups of a dataframe, then put the results into a dataframe. (What SAS PROC MEANS does, I think, though it's been years since I used it.) This is possible using by(), but it seems cumbersome and fragile. Is there a more straightforward way than this? Here's a simple example showing my current strategy:
> dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))
> dataset
gp1 gp2 value 1 1 1 0.9493232 2 1 1 -0.0474712 3 1 2 -0.6808021 4 1 2 1.9894999 5 2 3 2.0154786 6 2 3 0.4333056 7 2 4 -0.4746228 8 2 4 0.6017522
> > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
+ gp2 = subset$gp2[1], statistic = mean(subset$value))
>
> bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>
> result <- do.call('rbind', bylist)
> result
gp1 gp2 statistic 1 1 1 0.45092598 11 1 2 0.65434890 12 2 3 1.22439210 13 2 4 0.06356469 tapply() is inappropriate because I don't have all possible combinations of gp1 and gp2 values, only some of them:
> tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
1 2 3 4 1 0.450926 0.6543489 NA NA 2 NA NA 1.224392 0.06356469 In the real case, I only have a very sparse subset of all the combinations, and tapply() and by() both die for lack of memory. Any suggestions on how to do what I want, without using SAS?
Have you tried aggregate()?
aggregate() has a few problems: - it applies the function to every column in the dataframe. In my case it only makes sense to apply it to some of them. (This may not be a killer, but it certainly makes things inefficient and tricky.) - I'd like to look at the whole subset to figure out the function (but I can probably work around this) - It uses too much memory. E.g. try > df <- data.frame(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000), w=rnorm(1000)) > aggregate(df, list(df$x,df$y,df$z), mean) Error: cannot allocate vector of size 3906250 Kb In addition: Warning messages: 1: Reached total allocation of 1007Mb: see help(memory.size) 2: Reached total allocation of 1007Mb: see help(memory.size) This should have returned the same dataframe (there are 1000 subsets), but it tried to construct a billion of them.
On 9/30/2005 1:48 PM, Don MacQueen wrote:
> Look at the summarize() function in the Hmisc package. It seems to want a matrix, not a data.frame. The real situation has mixed types (character, factors, numeric) so it can't be a matrix. > (and I this is an r-help question, not an r-devel question, I would think) Yes, that's where I should have posted. Sorry. However, this is starting to look like a development problem... Peter again:
Alternatively, you migth split on interaction(...., drop=TRUE)
Looking at the code, it appears that will construct the full product interaction, then subset to the non-empty cases... Yes, it does that. Looks like I'll have to write my own. Duncan
On 9/30/2005 1:41 PM, hadley wickham wrote:
I'm not entirely sure what you want, but maybe this does the trick?
data.frame.by <- function(data, variables, fun, ...) {
if (length(variables) == 0 ) {
df <- data.frame(results = 0)
df$results <- list(fun(data$value, ...))
return(df)
}
sorted <- sort.df(data, variables)[,c(variables), drop=FALSE]
duplicates <- duplicated(sorted[,variables, drop=FALSE])
index <- cumsum(!duplicates)
results <- by(data, index, fun, ...)
cols <- sorted[!duplicates,variables, drop=FALSE]
cols$results <- array(results)
cols
}
sort.df <- function(data, vars) {
data[do.call("order", data[,vars, drop=FALSE]), ,drop=FALSE]
}
dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))
data.frame.by(dataset, c("gp1", "gp2"), function(data) mean(data$value))
data.frame.by(dataset, "gp1", function(data) tapply(data$value, data$gp2, mean))
data.frame.by(dataset, "gp1", function(data) lm(gp2 ~ value, data)) #
doesn't print, but everything is there ok
(note that the results column will be a list if necessary - this may
be a serious abuse of data frames, but I'm not sure and no one replied
when I queried the list)
I think this should work. Thanks! Duncan Murdoch
And here is one more approach using the reshape package: library(reshape) dataset.d <- melt(dataset, id = 1:2) cast(dataset.d, gp1 + gp2 ~ variable, mean)
On 9/30/05, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Check out summaryBy in the doBy package at: http://genetics.agrsci.dk/~sorenh/misc e.g. summaryBy(value ~ gp1 + gp2, data = dataset) On 9/30/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
I want to calculate a statistic on a number of subgroups of a dataframe, then put the results into a dataframe. (What SAS PROC MEANS does, I think, though it's been years since I used it.) This is possible using by(), but it seems cumbersome and fragile. Is there a more straightforward way than this? Here's a simple example showing my current strategy:
> dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))
> dataset
gp1 gp2 value 1 1 1 0.9493232 2 1 1 -0.0474712 3 1 2 -0.6808021 4 1 2 1.9894999 5 2 3 2.0154786 6 2 3 0.4333056 7 2 4 -0.4746228 8 2 4 0.6017522
> > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
+ gp2 = subset$gp2[1], statistic = mean(subset$value))
>
> bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>
> result <- do.call('rbind', bylist)
> result
gp1 gp2 statistic 1 1 1 0.45092598 11 1 2 0.65434890 12 2 3 1.22439210 13 2 4 0.06356469 tapply() is inappropriate because I don't have all possible combinations of gp1 and gp2 values, only some of them:
> tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
1 2 3 4 1 0.450926 0.6543489 NA NA 2 NA NA 1.224392 0.06356469 In the real case, I only have a very sparse subset of all the combinations, and tapply() and by() both die for lack of memory. Any suggestions on how to do what I want, without using SAS? Duncan Murdoch
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel