Skip to content

converting "by" to a data.frame?

7 messages · Thomas Lumley, Don MacQueen, Sundar Dorai-Raj +1 more

#
Dear R-Help:

	  I want to (a) subset a data.frame by several columns, (b) fit a model 
to each subset, and (c) store a vector of results from the fit in the 
columns of a data.frame.  In the past, I've used "for" loops do do this. 
  Is there a way to use "by"?

	  Consider the following example:

 > byFits <- by(by.df, list(A=by.df$A, B=by.df$B),
+  function(data.)coef(lm(y~x, data.)))
 > byFits
A: A1
B: B1
   (Intercept)             x
  3.333333e-01 -1.517960e-16
------------------------------------------------------------
A: A2
B: B1
NULL
------------------------------------------------------------
A: A1
B: B2
NULL
------------------------------------------------------------
A: A2
B: B2
  (Intercept)            x
6.666667e-01 3.282015e-16
 >
 >
#############################
Desired output:

data.frame(A=c("A1","A2"), B=c("B1", "B2"),
	.Intercept.=c(1/3, 2/3), x=c(-1.5e-16, 3.3e-16))

What's the simplest way to do this?
Thanks,
Spencer Graves
#
On Thu, 5 Jun 2003, Spencer Graves wrote:

            
do.call("rbind", byFits)


	-thomas
#
Hi, Thomas, et al.:

Thanks for the reply.  Unfortunately, "do.call" strips off the subset 
identifiers, which I want to use for further modeling:

 > do.call("rbind", byFits)
      (Intercept)              x
[1,]   0.3333333 -1.517960e-016
[2,]   0.6666667  3.282015e-016

The following does what I want using a "for" loop:

 > by.df <- data.frame(A=rep(c("A1", "A2"), each=3),
+  B=rep(c("B1", "B2"), each=3), x=1:6, y=rep(0:1, length=6))
 > by.lvls <- paste(as.character(by.df$A), as.character(by.df$B), sep=":")
 > A.B <- unique(by.lvls)
 > Fits <- data.frame(A.B = A.B, .Intercept.=rep(NA, length(A.B)),
+  x=rep(NA, length(A.B)))
 > Fits$A <- substring(A.B, 1, regexpr(":", A.B)-1)
 > Fits$B <- substring(A.B, regexpr(":", A.B)+1)
 > for(i in 1:length(A.B))
+  Fits[i, 2:3] <- coef(lm(y~x, by.df[by.lvls==A.B[i],]))
 > Fits
     A.B X.Intercept.             x  A  B
1 A1:B1    0.3333333 -1.517960e-16 A1 B1
2 A2:B2    0.6666667  3.282015e-16 A2 B2
 >

	  I wondered if there was something easier.

Thanks again for your reply.
Spencer Graves
Thomas Lumley wrote:
#
Since I don't have your by.df to test with I may not have it exactly 
right, but something along these lines should work:

byFits <- lapply(split(by.df,paste(by.df$A,by.df$B)),
                  FUN=function(data.) {
                     tmp <- coef(lm(y~x,data.))
                     data.frame(A=unique(data.$A),
                                B=unique(data.$B),
                                intercept=tmp[1],
                                slope=tmp[2])
                    })

byFitsDF <- do.call('rbind',byFits)

That's assuming I've got all the closing parantheses in the right 
places, since my email software (Eudora) doesn't do R syntax checking!

This approach can get rather slow if by.df is big, or when the 
computations in FUN are extensive (or both).

If by.df$A has mode character (as opposed to being a factor), then 
replacing A=unique(data.$A) with A=I(unique(data.$A)) might improve 
performance. You want to avoid character to factor conversions when 
using an approach like this.

-Don
At 2:54 PM -0700 6/5/03, Spencer Graves wrote:

  
    
#
Spencer,
   Would "sapply" be better here?

R> by.df <- data.frame(A=rep(c("A1", "A2"), each=3),
R+                     B=rep(c("B1", "B2"), each=3),
R+                     x=1:6, y=rep(0:1, length=6))
R> t(sapply(split(by.df, do.call("paste", c(by.df[, 1:2], sep = ":"))),
R+          function(x) coef(lm(y ~ x, data = x))))
       (Intercept)             x
A1:B1   0.3333333 -1.517960e-16
A2:B2   0.6666667  3.282015e-16
R>

Sundar
Spencer Graves wrote:
2 days later
#
Thanks to Thomas Lumley, Sundar Dorai-Raj, and Don McQueen for their 
suggestions.  I need the INDICES as part of the output data.frame, which 
McQueen's solution provided.  I generalized his method as follows:

by.to.data.frame <-
function(x, INDICES, FUN){
# Split data.frame x on x[,INDICES]
# and lapply FUN to each data.frame subset,
# returning a data.frame
#
#  Internal functions
    get.Index <- function(x, INDICES){
	Ind <- as.character(x[,INDICES[1]])
	k <- length(INDICES)
	if(k > 1)
		Ind <- paste(Ind, get.Index(x, INDICES[-1]), sep=":")	
		Ind	
     }
     FUN2 <- function(data., INDICES, FUN){
	vec <- FUN(data.)
	Vec <- matrix(vec, nrow=1)
	dimnames(Vec) <- list(NULL, names(vec))
	cbind(data.[1,INDICES], Vec)
     }
#   Combine INDICES
     Ind <- get.Index(x, INDICES)
#   Apply ...:  Do the work.
     Split <- split(x, Ind)
     byFits <- lapply(Split, FUN2, INDICES, FUN)
#   Convert to a data.frame
     do.call('rbind',byFits) 	
}

Applying this to my toy problem produces the following:

 > by.df <- data.frame(A=rep(c("A1", "A2"), each=3),
+  B=rep(c("B1", "B2"), each=3), x=1:6, y=rep(0:1, length=6))
 >
 > by.to.data.frame(by.df, c("A", "B"), function(data.)coef(lm(y~x, data.)))
        A  B (Intercept)             x
A1:B1 A1 B1   0.3333333 -1.517960e-16
A2:B2 A2 B2   0.6666667  3.282015e-16

Thanks for the assistance.  I can now tackle the real problem that 
generated this question.

Best Wishes,
Spencer Graves
########################################
Don MacQueen wrote:
1 day later
#
Hi, Don:

	  Thanks for your suggestion to use "do.call" in my "get.Index". I 
discovered that your version actually produces cosmetically different 
answers in R 1.6.3 and S-Plus 6.1 for Windows.  Fortunately, in the 
context, this difference was unimportant.  Since yours is faster, it is 
clearly superior.

	  To check my understanding, I generalized my toy example as follows:

 > by.df <- data.frame(A=rep(c("A1", "A2"), each=3),
+  B=rep(c("B1", "B2"), each=3), C=rep(c("C1", "C2"), each=3),
+  x=1:6, y=rep(0:1, length=6))

	  With this, your "get.Index" produced the following in R 1.6.2:

 > get.Index <- function(x, INDICES) do.call('paste',c(x[INDICES],sep=':'))
 > get.Index(by.df, c("A", "B", "C"))
[1] "A1:B1:C1" "A1:B1:C1" "A1:B1:C1" "A2:B2:C2" "A2:B2:C2" "A2:B2:C2"

In S-Plus 6.1 for Windows, I got the following:

 > get.Index <- function(x, INDICES)
do.call("paste", c(x[INDICES], sep = ":"))
 > get.Index(by.df, c("A", "B", "C"))
[1] "1:1:1" "1:1:1" "1:1:1" "2:2:2" "2:2:2" "2:2:2"

	  Fortunately, this difference is unimportant in this context, as 
"by.to.data.frame" produces the same answer in both cases.  Moreover, 
your answer converts to a single call to "paste", which means that it 
should be faster.  For someone who understands "do.call", your version 
is also easier to read.

Thanks again for your help.
Spencer Graves

######################################
Don MacQueen wrote:
> Glad to hear it was helpful.
 >
 > You can also use the do.call trick for the paste indices business.
 >
 > Try
 >    get.Index <- function(x, INDICES) 
do.call('paste',c(x[INDICES],sep=':'))
 >
 > This works because a data frame is actually a list, albeit a special
 > kind of list, and do.call() wants a list for its second arg.
 >
 > -Don
#######################################
Thanks to Thomas Lumley, Sundar Dorai-Raj, and Don McQueen for their
suggestions.  I need the INDICES as part of the output data.frame, which
McQueen's solution provided.  I generalized his method as follows:

by.to.data.frame <-
function(x, INDICES, FUN){
# Split data.frame x on x[,INDICES]
# and lapply FUN to each data.frame subset,
# returning a data.frame
#
#  Internal functions
     get.Index <- function(x, INDICES){
	Ind <- as.character(x[,INDICES[1]])
	k <- length(INDICES)
	if(k > 1)
		Ind <- paste(Ind, get.Index(x, INDICES[-1]), sep=":")	
		Ind	
      }
      FUN2 <- function(data., INDICES, FUN){
	vec <- FUN(data.)
	Vec <- matrix(vec, nrow=1)
	dimnames(Vec) <- list(NULL, names(vec))
	cbind(data.[1,INDICES], Vec)
      }
#   Combine INDICES
      Ind <- get.Index(x, INDICES)
#   Apply ...:  Do the work.
      Split <- split(x, Ind)
      byFits <- lapply(Split, FUN2, INDICES, FUN)
#   Convert to a data.frame
      do.call('rbind',byFits) 	
}

Applying this to my toy problem produces the following:

  > by.df <- data.frame(A=rep(c("A1", "A2"), each=3),
+  B=rep(c("B1", "B2"), each=3), x=1:6, y=rep(0:1, length=6))
  >
  > by.to.data.frame(by.df, c("A", "B"), function(data.)coef(lm(y~x, 
data.)))
         A  B (Intercept)             x
A1:B1 A1 B1   0.3333333 -1.517960e-16
A2:B2 A2 B2   0.6666667  3.282015e-16

Thanks for the assistance.  I can now tackle the real problem that
generated this question.

Best Wishes,
Spencer Graves
########################################
Don MacQueen wrote:
> Since I don't have your by.df to test with I may not have it exactly
 > right, but something along these lines should work:
 >
 > byFits <- lapply(split(by.df,paste(by.df$A,by.df$B)),
 >                  FUN=function(data.) {
 >                     tmp <- coef(lm(y~x,data.))
 >                     data.frame(A=unique(data.$A),
 >                                B=unique(data.$B),
 >                                intercept=tmp[1],
 >                                slope=tmp[2])
 >                    })
 >
 > byFitsDF <- do.call('rbind',byFits)
 >
 > That's assuming I've got all the closing parantheses in the right
 > places, since my email software (Eudora) doesn't do R syntax checking!
 >
 > This approach can get rather slow if by.df is big, or when the
 > computations in FUN are extensive (or both).
 >
 > If by.df$A has mode character (as opposed to being a factor), then
 > replacing A=unique(data.$A) with A=I(unique(data.$A)) might improve
 > performance. You want to avoid character to factor conversions when
 > using an approach like this.
 >
 > -Don
 >
 >
> At 2:54 PM -0700 6/5/03, Spencer Graves wrote:
>
 >> Dear R-Help:
 >>
 >>       I want to (a) subset a data.frame by several columns, (b) fit a
 >> model to each subset, and (c) store a vector of results from the fit
 >> in the columns of a data.frame.  In the past, I've used "for" loops do
 >> do this.  Is there a way to use "by"?
 >>
 >>       Consider the following example:
 >>
 >>  > byFits <- by(by.df, list(A=by.df$A, B=by.df$B),
 >> +  function(data.)coef(lm(y~x, data.)))
 >>  > byFits
 >> A: A1
 >> B: B1
 >>   (Intercept)             x
 >>  3.333333e-01 -1.517960e-16
 >> ------------------------------------------------------------
 >> A: A2
 >> B: B1
 >> NULL
 >> ------------------------------------------------------------
 >> A: A1
 >> B: B2
 >> NULL
 >> ------------------------------------------------------------
 >> A: A2
 >> B: B2
 >>  (Intercept)            x
 >> 6.666667e-01 3.282015e-16
 >>
 >>>
 >>>
 >> #############################
 >> Desired output:
 >>
 >> data.frame(A=c("A1","A2"), B=c("B1", "B2"),
 >>     .Intercept.=c(1/3, 2/3), x=c(-1.5e-16, 3.3e-16))
 >>
 >> What's the simplest way to do this?
 >> Thanks,
 >> Spencer Graves
 >>
 >> ______________________________________________
 >> R-help at stat.math.ethz.ch mailing list
 >> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 >
 >
 >