Code is too slow: mean-centering variables in a data framebysubgroup
Dimitri,
You might try applying ave() to each column. E.g., use
f2 <- function(frame) {
for(i in 2:ncol(frame)) {
frame[,i] <- ave(frame[,i], frame[,1],
FUN=function(x)x/mean(x,na.rm=TRUE))
}
frame
}
Note that this returns a data.frame and retains the
grouping column (the first) while your original
code returns a matrix without the grouping column.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
Sent: Tuesday, March 30, 2010 10:52 AM
To: 'Dimitri Liakhovitski'; 'r-help'
Subject: Re: [R] Code is too slow: mean-centering variables
in a data framebysubgroup
?scale
Bert Gunter
Genentech Nonclinical Biostatistics
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On
Behalf Of Dimitri Liakhovitski
Sent: Tuesday, March 30, 2010 8:05 AM
To: r-help
Subject: [R] Code is too slow: mean-centering variables in a
data frame
bysubgroup
Dear R-ers,
I have a large data frame (several thousands of rows and about 2.5
thousand columns). One variable ("group") is a grouping variable with
over 30 levels. And I have a lot of NAs.
For each variable, I need to divide each value by variable mean - by
subgroup. I have the code but it's way too slow - takes me about 1.5
hours.
Below is a data example and my code that is too slow. Is there a
different, faster way of doing the same thing?
Thanks a lot for your advice!
Dimitri
# Building an example frame - with groups and a lot of NAs:
set.seed(1234)
frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:
100),b=rnorm(1
:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:1
00),g=rnorm(1:
100))
frame<-frame[order(frame$group),]
names.used<-names(frame)[2:length(frame)]
set.seed(1234)
for(i in names.used){
i.for.NA<-sample(1:100,60)
frame[[i]][i.for.NA]<-NA
}
frame
### Code that does what's needed but is too slow:
Start<-Sys.time()
frame <- do.call(cbind, lapply(names.used, function(x){
unlist(by(frame, frame$group, function(y) y[,x] /
mean(y[,x],na.rm=T)))
}))
Finish<-Sys.time()
print(Finish-Start) # Takes too long
--
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.