Dear R-Help:
I want to (a) subset a data.frame by several columns, (b) fit a model
to each subset, and (c) store a vector of results from the fit in the
columns of a data.frame. In the past, I've used "for" loops do do this.
Is there a way to use "by"?
Consider the following example:
> byFits <- by(by.df, list(A=by.df$A, B=by.df$B),
+ function(data.)coef(lm(y~x, data.)))
> byFits
A: A1
B: B1
(Intercept) x
3.333333e-01 -1.517960e-16
------------------------------------------------------------
A: A2
B: B1
NULL
------------------------------------------------------------
A: A1
B: B2
NULL
------------------------------------------------------------
A: A2
B: B2
(Intercept) x
6.666667e-01 3.282015e-16
>
>
#############################
Desired output:
data.frame(A=c("A1","A2"), B=c("B1", "B2"),
.Intercept.=c(1/3, 2/3), x=c(-1.5e-16, 3.3e-16))
What's the simplest way to do this?
Thanks,
Spencer Graves
converting "by" to a data.frame?
7 messages · Thomas Lumley, Don MacQueen, Sundar Dorai-Raj +1 more
On Thu, 5 Jun 2003, Spencer Graves wrote:
Dear R-Help: I want to (a) subset a data.frame by several columns, (b) fit a model to each subset, and (c) store a vector of results from the fit in the columns of a data.frame. In the past, I've used "for" loops do do this. Is there a way to use "by"? Consider the following example:
> byFits <- by(by.df, list(A=by.df$A, B=by.df$B),
+ function(data.)coef(lm(y~x, data.)))
> byFits
A: A1 B: B1 (Intercept) x 3.333333e-01 -1.517960e-16 ------------------------------------------------------------ A: A2 B: B1 NULL ------------------------------------------------------------ A: A1 B: B2 NULL ------------------------------------------------------------ A: A2 B: B2 (Intercept) x 6.666667e-01 3.282015e-16
> >
#############################
Desired output:
data.frame(A=c("A1","A2"), B=c("B1", "B2"),
.Intercept.=c(1/3, 2/3), x=c(-1.5e-16, 3.3e-16))
What's the simplest way to do this?
do.call("rbind", byFits)
-thomas
Hi, Thomas, et al.:
Thanks for the reply. Unfortunately, "do.call" strips off the subset
identifiers, which I want to use for further modeling:
> do.call("rbind", byFits)
(Intercept) x
[1,] 0.3333333 -1.517960e-016
[2,] 0.6666667 3.282015e-016
The following does what I want using a "for" loop:
> by.df <- data.frame(A=rep(c("A1", "A2"), each=3),
+ B=rep(c("B1", "B2"), each=3), x=1:6, y=rep(0:1, length=6))
> by.lvls <- paste(as.character(by.df$A), as.character(by.df$B), sep=":")
> A.B <- unique(by.lvls)
> Fits <- data.frame(A.B = A.B, .Intercept.=rep(NA, length(A.B)),
+ x=rep(NA, length(A.B)))
> Fits$A <- substring(A.B, 1, regexpr(":", A.B)-1)
> Fits$B <- substring(A.B, regexpr(":", A.B)+1)
> for(i in 1:length(A.B))
+ Fits[i, 2:3] <- coef(lm(y~x, by.df[by.lvls==A.B[i],]))
> Fits
A.B X.Intercept. x A B
1 A1:B1 0.3333333 -1.517960e-16 A1 B1
2 A2:B2 0.6666667 3.282015e-16 A2 B2
>
I wondered if there was something easier.
Thanks again for your reply.
Spencer Graves
Thomas Lumley wrote:
On Thu, 5 Jun 2003, Spencer Graves wrote:
Dear R-Help: I want to (a) subset a data.frame by several columns, (b) fit a model to each subset, and (c) store a vector of results from the fit in the columns of a data.frame. In the past, I've used "for" loops do do this. Is there a way to use "by"? Consider the following example:
byFits <- by(by.df, list(A=by.df$A, B=by.df$B),
+ function(data.)coef(lm(y~x, data.)))
byFits
A: A1 B: B1 (Intercept) x 3.333333e-01 -1.517960e-16 ------------------------------------------------------------ A: A2 B: B1 NULL ------------------------------------------------------------ A: A1 B: B2 NULL ------------------------------------------------------------ A: A2 B: B2 (Intercept) x 6.666667e-01 3.282015e-16
#############################
Desired output:
data.frame(A=c("A1","A2"), B=c("B1", "B2"),
.Intercept.=c(1/3, 2/3), x=c(-1.5e-16, 3.3e-16))
What's the simplest way to do this?
do.call("rbind", byFits)
-thomas
Since I don't have your by.df to test with I may not have it exactly
right, but something along these lines should work:
byFits <- lapply(split(by.df,paste(by.df$A,by.df$B)),
FUN=function(data.) {
tmp <- coef(lm(y~x,data.))
data.frame(A=unique(data.$A),
B=unique(data.$B),
intercept=tmp[1],
slope=tmp[2])
})
byFitsDF <- do.call('rbind',byFits)
That's assuming I've got all the closing parantheses in the right
places, since my email software (Eudora) doesn't do R syntax checking!
This approach can get rather slow if by.df is big, or when the
computations in FUN are extensive (or both).
If by.df$A has mode character (as opposed to being a factor), then
replacing A=unique(data.$A) with A=I(unique(data.$A)) might improve
performance. You want to avoid character to factor conversions when
using an approach like this.
-Don
At 2:54 PM -0700 6/5/03, Spencer Graves wrote:
Dear R-Help: I want to (a) subset a data.frame by several columns, (b) fit a model to each subset, and (c) store a vector of results from the fit in the columns of a data.frame. In the past, I've used "for" loops do do this. Is there a way to use "by"? Consider the following example:
> byFits <- by(by.df, list(A=by.df$A, B=by.df$B),
+ function(data.)coef(lm(y~x, data.)))
> byFits
A: A1 B: B1 (Intercept) x 3.333333e-01 -1.517960e-16 ------------------------------------------------------------ A: A2 B: B1 NULL ------------------------------------------------------------ A: A1 B: B2 NULL ------------------------------------------------------------ A: A2 B: B2 (Intercept) x 6.666667e-01 3.282015e-16
#############################
Desired output:
data.frame(A=c("A1","A2"), B=c("B1", "B2"),
.Intercept.=c(1/3, 2/3), x=c(-1.5e-16, 3.3e-16))
What's the simplest way to do this?
Thanks,
Spencer Graves
______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
-------------------------------------- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA
Spencer,
Would "sapply" be better here?
R> by.df <- data.frame(A=rep(c("A1", "A2"), each=3),
R+ B=rep(c("B1", "B2"), each=3),
R+ x=1:6, y=rep(0:1, length=6))
R> t(sapply(split(by.df, do.call("paste", c(by.df[, 1:2], sep = ":"))),
R+ function(x) coef(lm(y ~ x, data = x))))
(Intercept) x
A1:B1 0.3333333 -1.517960e-16
A2:B2 0.6666667 3.282015e-16
R>
Sundar
Spencer Graves wrote:
Hi, Thomas, et al.: Thanks for the reply. Unfortunately, "do.call" strips off the subset identifiers, which I want to use for further modeling:
> do.call("rbind", byFits)
(Intercept) x [1,] 0.3333333 -1.517960e-016 [2,] 0.6666667 3.282015e-016 The following does what I want using a "for" loop:
> by.df <- data.frame(A=rep(c("A1", "A2"), each=3),
+ B=rep(c("B1", "B2"), each=3), x=1:6, y=rep(0:1, length=6))
> by.lvls <- paste(as.character(by.df$A), as.character(by.df$B), sep=":") > A.B <- unique(by.lvls) > Fits <- data.frame(A.B = A.B, .Intercept.=rep(NA, length(A.B)),
+ x=rep(NA, length(A.B)))
> Fits$A <- substring(A.B, 1, regexpr(":", A.B)-1)
> Fits$B <- substring(A.B, regexpr(":", A.B)+1)
> for(i in 1:length(A.B))
+ Fits[i, 2:3] <- coef(lm(y~x, by.df[by.lvls==A.B[i],]))
> Fits
A.B X.Intercept. x A B 1 A1:B1 0.3333333 -1.517960e-16 A1 B1 2 A2:B2 0.6666667 3.282015e-16 A2 B2
>
I wondered if there was something easier.
Thanks again for your reply.
Spencer Graves
Thomas Lumley wrote:
On Thu, 5 Jun 2003, Spencer Graves wrote:
Dear R-Help:
I want to (a) subset a data.frame by several columns, (b) fit a
model
to each subset, and (c) store a vector of results from the fit in the
columns of a data.frame. In the past, I've used "for" loops do do this.
Is there a way to use "by"?
Consider the following example:
byFits <- by(by.df, list(A=by.df$A, B=by.df$B),
+ function(data.)coef(lm(y~x, data.)))
byFits
A: A1 B: B1 (Intercept) x 3.333333e-01 -1.517960e-16 ------------------------------------------------------------ A: A2 B: B1 NULL ------------------------------------------------------------ A: A1 B: B2 NULL ------------------------------------------------------------ A: A2 B: B2 (Intercept) x 6.666667e-01 3.282015e-16
#############################
Desired output:
data.frame(A=c("A1","A2"), B=c("B1", "B2"),
.Intercept.=c(1/3, 2/3), x=c(-1.5e-16, 3.3e-16))
What's the simplest way to do this?
do.call("rbind", byFits)
-thomas
______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
2 days later
Thanks to Thomas Lumley, Sundar Dorai-Raj, and Don McQueen for their
suggestions. I need the INDICES as part of the output data.frame, which
McQueen's solution provided. I generalized his method as follows:
by.to.data.frame <-
function(x, INDICES, FUN){
# Split data.frame x on x[,INDICES]
# and lapply FUN to each data.frame subset,
# returning a data.frame
#
# Internal functions
get.Index <- function(x, INDICES){
Ind <- as.character(x[,INDICES[1]])
k <- length(INDICES)
if(k > 1)
Ind <- paste(Ind, get.Index(x, INDICES[-1]), sep=":")
Ind
}
FUN2 <- function(data., INDICES, FUN){
vec <- FUN(data.)
Vec <- matrix(vec, nrow=1)
dimnames(Vec) <- list(NULL, names(vec))
cbind(data.[1,INDICES], Vec)
}
# Combine INDICES
Ind <- get.Index(x, INDICES)
# Apply ...: Do the work.
Split <- split(x, Ind)
byFits <- lapply(Split, FUN2, INDICES, FUN)
# Convert to a data.frame
do.call('rbind',byFits)
}
Applying this to my toy problem produces the following:
> by.df <- data.frame(A=rep(c("A1", "A2"), each=3),
+ B=rep(c("B1", "B2"), each=3), x=1:6, y=rep(0:1, length=6))
>
> by.to.data.frame(by.df, c("A", "B"), function(data.)coef(lm(y~x, data.)))
A B (Intercept) x
A1:B1 A1 B1 0.3333333 -1.517960e-16
A2:B2 A2 B2 0.6666667 3.282015e-16
Thanks for the assistance. I can now tackle the real problem that
generated this question.
Best Wishes,
Spencer Graves
########################################
Don MacQueen wrote:
Since I don't have your by.df to test with I may not have it exactly
right, but something along these lines should work:
byFits <- lapply(split(by.df,paste(by.df$A,by.df$B)),
FUN=function(data.) {
tmp <- coef(lm(y~x,data.))
data.frame(A=unique(data.$A),
B=unique(data.$B),
intercept=tmp[1],
slope=tmp[2])
})
byFitsDF <- do.call('rbind',byFits)
That's assuming I've got all the closing parantheses in the right
places, since my email software (Eudora) doesn't do R syntax checking!
This approach can get rather slow if by.df is big, or when the
computations in FUN are extensive (or both).
If by.df$A has mode character (as opposed to being a factor), then
replacing A=unique(data.$A) with A=I(unique(data.$A)) might improve
performance. You want to avoid character to factor conversions when
using an approach like this.
-Don
At 2:54 PM -0700 6/5/03, Spencer Graves wrote:
Dear R-Help:
I want to (a) subset a data.frame by several columns, (b) fit a
model to each subset, and (c) store a vector of results from the fit
in the columns of a data.frame. In the past, I've used "for" loops do
do this. Is there a way to use "by"?
Consider the following example:
> byFits <- by(by.df, list(A=by.df$A, B=by.df$B),
+ function(data.)coef(lm(y~x, data.)))
> byFits
A: A1 B: B1 (Intercept) x 3.333333e-01 -1.517960e-16 ------------------------------------------------------------ A: A2 B: B1 NULL ------------------------------------------------------------ A: A1 B: B2 NULL ------------------------------------------------------------ A: A2 B: B2 (Intercept) x 6.666667e-01 3.282015e-16
#############################
Desired output:
data.frame(A=c("A1","A2"), B=c("B1", "B2"),
.Intercept.=c(1/3, 2/3), x=c(-1.5e-16, 3.3e-16))
What's the simplest way to do this?
Thanks,
Spencer Graves
______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
1 day later
Hi, Don:
Thanks for your suggestion to use "do.call" in my "get.Index". I
discovered that your version actually produces cosmetically different
answers in R 1.6.3 and S-Plus 6.1 for Windows. Fortunately, in the
context, this difference was unimportant. Since yours is faster, it is
clearly superior.
To check my understanding, I generalized my toy example as follows:
> by.df <- data.frame(A=rep(c("A1", "A2"), each=3),
+ B=rep(c("B1", "B2"), each=3), C=rep(c("C1", "C2"), each=3),
+ x=1:6, y=rep(0:1, length=6))
With this, your "get.Index" produced the following in R 1.6.2:
> get.Index <- function(x, INDICES) do.call('paste',c(x[INDICES],sep=':'))
> get.Index(by.df, c("A", "B", "C"))
[1] "A1:B1:C1" "A1:B1:C1" "A1:B1:C1" "A2:B2:C2" "A2:B2:C2" "A2:B2:C2"
In S-Plus 6.1 for Windows, I got the following:
> get.Index <- function(x, INDICES)
do.call("paste", c(x[INDICES], sep = ":"))
> get.Index(by.df, c("A", "B", "C"))
[1] "1:1:1" "1:1:1" "1:1:1" "2:2:2" "2:2:2" "2:2:2"
Fortunately, this difference is unimportant in this context, as
"by.to.data.frame" produces the same answer in both cases. Moreover,
your answer converts to a single call to "paste", which means that it
should be faster. For someone who understands "do.call", your version
is also easier to read.
Thanks again for your help.
Spencer Graves
######################################
Don MacQueen wrote:
> Glad to hear it was helpful.
>
> You can also use the do.call trick for the paste indices business.
>
> Try
> get.Index <- function(x, INDICES)
do.call('paste',c(x[INDICES],sep=':'))
>
> This works because a data frame is actually a list, albeit a special
> kind of list, and do.call() wants a list for its second arg.
>
> -Don
#######################################
Thanks to Thomas Lumley, Sundar Dorai-Raj, and Don McQueen for their
suggestions. I need the INDICES as part of the output data.frame, which
McQueen's solution provided. I generalized his method as follows:
by.to.data.frame <-
function(x, INDICES, FUN){
# Split data.frame x on x[,INDICES]
# and lapply FUN to each data.frame subset,
# returning a data.frame
#
# Internal functions
get.Index <- function(x, INDICES){
Ind <- as.character(x[,INDICES[1]])
k <- length(INDICES)
if(k > 1)
Ind <- paste(Ind, get.Index(x, INDICES[-1]), sep=":")
Ind
}
FUN2 <- function(data., INDICES, FUN){
vec <- FUN(data.)
Vec <- matrix(vec, nrow=1)
dimnames(Vec) <- list(NULL, names(vec))
cbind(data.[1,INDICES], Vec)
}
# Combine INDICES
Ind <- get.Index(x, INDICES)
# Apply ...: Do the work.
Split <- split(x, Ind)
byFits <- lapply(Split, FUN2, INDICES, FUN)
# Convert to a data.frame
do.call('rbind',byFits)
}
Applying this to my toy problem produces the following:
> by.df <- data.frame(A=rep(c("A1", "A2"), each=3),
+ B=rep(c("B1", "B2"), each=3), x=1:6, y=rep(0:1, length=6))
>
> by.to.data.frame(by.df, c("A", "B"), function(data.)coef(lm(y~x,
data.)))
A B (Intercept) x
A1:B1 A1 B1 0.3333333 -1.517960e-16
A2:B2 A2 B2 0.6666667 3.282015e-16
Thanks for the assistance. I can now tackle the real problem that
generated this question.
Best Wishes,
Spencer Graves
########################################
Don MacQueen wrote:
> Since I don't have your by.df to test with I may not have it exactly
> right, but something along these lines should work:
>
> byFits <- lapply(split(by.df,paste(by.df$A,by.df$B)),
> FUN=function(data.) {
> tmp <- coef(lm(y~x,data.))
> data.frame(A=unique(data.$A),
> B=unique(data.$B),
> intercept=tmp[1],
> slope=tmp[2])
> })
>
> byFitsDF <- do.call('rbind',byFits)
>
> That's assuming I've got all the closing parantheses in the right
> places, since my email software (Eudora) doesn't do R syntax checking!
>
> This approach can get rather slow if by.df is big, or when the
> computations in FUN are extensive (or both).
>
> If by.df$A has mode character (as opposed to being a factor), then
> replacing A=unique(data.$A) with A=I(unique(data.$A)) might improve
> performance. You want to avoid character to factor conversions when
> using an approach like this.
>
> -Don
>
>
> At 2:54 PM -0700 6/5/03, Spencer Graves wrote:
>
>> Dear R-Help:
>>
>> I want to (a) subset a data.frame by several columns, (b) fit a
>> model to each subset, and (c) store a vector of results from the fit
>> in the columns of a data.frame. In the past, I've used "for" loops do
>> do this. Is there a way to use "by"?
>>
>> Consider the following example:
>>
>> > byFits <- by(by.df, list(A=by.df$A, B=by.df$B),
>> + function(data.)coef(lm(y~x, data.)))
>> > byFits
>> A: A1
>> B: B1
>> (Intercept) x
>> 3.333333e-01 -1.517960e-16
>> ------------------------------------------------------------
>> A: A2
>> B: B1
>> NULL
>> ------------------------------------------------------------
>> A: A1
>> B: B2
>> NULL
>> ------------------------------------------------------------
>> A: A2
>> B: B2
>> (Intercept) x
>> 6.666667e-01 3.282015e-16
>>
>>>
>>>
>> #############################
>> Desired output:
>>
>> data.frame(A=c("A1","A2"), B=c("B1", "B2"),
>> .Intercept.=c(1/3, 2/3), x=c(-1.5e-16, 3.3e-16))
>>
>> What's the simplest way to do this?
>> Thanks,
>> Spencer Graves
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>
>
>