Dear colleagues,
I have the following code. This code is to 'filter' the data set.
It works on the data frame 'whole' with four numeric columns: a,b,d, and c.
Every row in the data frame is considered as a point in 3-D space.
Variables a,b, and d are the point's coordinates, and c is its value.
This code looks at every point, builds a cube 'centered' at this
point, selects the set of points inside this cube,
calculates mean and SD of their values,
and drops points whose values differ from the mean more than 2 SD.
Here is the code.
=======
# initialization
cube.half.size<-2 # half size of a cube to be built around every point
mult.sigma<-2 # we will drop every point with value differing
# from mean more than mult.sigma * SD
to.drop<-data.frame() # the list of points to drop.
for(i in 1:length(whole$c)){ #look at every point...
pv<-subset(whole,abs(a-whole$a[i])<cube.half.size & #make the subset...
abs(b-whole$b[i])<cube.half.size &
abs(d-whole$d[i])<cube.half.size);
if(length(pv$c)>1){ # if subset includes not only considered point, then
mean.c<-mean(pv$c) # calculate mean and SD
sd.c<-sd(pv$c)
#make a list of points to drop from current subset
td<-subset(pv,abs(c-mean.c)>sd.c*mult.sigma)
if(length(td$c)>0){
#check which of these point are already already in the list to drop
td.index<-which(row.names(td) %in% row.names(to.drop))
#and replenish the list of points to drop
to.drop<-rbind(to.drop,if(length(td.index)>0) td[-td.index,] else td)
#print out the message showing, we're alive (these messages will
#not appear regularly, that's OK)
if(length(td.index)!=length(td$c))
print(c("i=",i,"Points to drop: ",length(to.drop$c)))
}
}
}
# make a new data set without droppped points.
whole.flt.3<-whole[-which(row.names(to.drop) %in% row.names(whole)),]
=======
The problem is: the 'whole' data set is large, more than 100000
rows, and the script runs several hours.
The running time becomes greater, if I build a sphere instead of a
cube.
I would like to optimize it in order to make it run faster.
Is it possible?
Will a sorting take effect?
Thank you for attention and any good feedback.
--
Best regards
Wladimir Eremeev mailto:wl at eimb.ru
==========================================================================
Research Scientist, PhD
Russian Academy of Sciences
need any advises for code optimization.
4 messages · Vladimir Eremeev, Rich FitzJohn
Hi,
One fruitful course for optimisation is to vectorise wherever
possible, and avoid for-loops.
Something like the code below might be a good place to start.
=============
## Generate a thousand rows of data
cube.half.size <- 2
mult.sigma <- 2
n <- 1000
whole <- data.frame(a=runif(n), b=runif(n), c=rnorm(n), d=runif(n))*10
cube.look <- function() {
f <- function(x) {
with(whole,
{i <- (abs(a - a[x]) < cube.half.size &
abs(b - b[x]) < cube.half.size &
abs(d - d[x]) < cube.half.size)
if ( any(i) ) {
subdat <- c[i]
which(i)[abs(subdat - mean(subdat)) > sd(subdat)*mult.sigma]
} else NULL
})
}
td <- lapply(seq(length=n), f)
whole[-unique(unlist(td)),]
}
## And wrap the original in a function for comparison:
cube.look.orig <- function() {
to.drop<-data.frame()
for(i in 1:length(whole$c)){
pv<-subset(whole,abs(a-whole$a[i])<cube.half.size &
abs(b-whole$b[i])<cube.half.size &
abs(d-whole$d[i])<cube.half.size);
if(length(pv$c)>1){
mean.c<-mean(pv$c)
sd.c<-sd(pv$c)
td<-subset(pv,abs(c-mean.c)>sd.c*mult.sigma)
if(length(td$c)>0){
td.index<-which(row.names(td) %in% row.names(to.drop))
to.drop<-rbind(to.drop,if(length(td.index)>0) td[-td.index,]else td)
if(length(td.index)!=length(td$c))
print(c("i=",i,"Points to drop: ",length(to.drop$c)))
}
}
}
td.orig <<- to.drop
## This does not subset the way you want:
## whole[-which(row.names(to.drop) %in% row.names(whole)),]
whole[-as.integer(row.names(to.drop)),]
}
## Time how long it takes to run over the test data.frame (10 runs):
t.new <- t(replicate(10, system.time(cube.look())))
t.orig <- t(replicate(10, system.time(cube.look.orig())))
## On my alpha, the version using lapply() takes 4.9 seconds of CPU
## time (avg 10 runs), while the original version takes 23.3 seconds -
## so we're 4.8 times faster.
apply(t.orig, 2, mean)/apply(t.new, 2, mean)
## And the results are the same.
r.new <- cube.look()
r.orig <- cube.look.orig()
identical(r.new, r.orig)
==============
The above code could almost certainly be tweaked (replacing which()
with a stripped down version would probably save some time, since the
profile indicates we spend about 10% of our time there). Using with()
saved another 10% or so, compared with indexing a..d (e.g. whole$a)
every iteration. However, trying a completely different approach
would be more likely to yield better time savings. mapply() might be
one to try, though I have a feeling this is just a wrapper around
lapply(). I believe there is a section in the "Writing R Extensions"
manual that deals with profiling, and may touch on optimisation.
Cheers,
Rich
On Apr 4, 2005 6:50 PM, Wladimir Eremeev <wl at eimb.ru> wrote:
Dear colleagues,
I have the following code. This code is to 'filter' the data set.
It works on the data frame 'whole' with four numeric columns: a,b,d, and c.
Every row in the data frame is considered as a point in 3-D space.
Variables a,b, and d are the point's coordinates, and c is its value.
This code looks at every point, builds a cube 'centered' at this
point, selects the set of points inside this cube,
calculates mean and SD of their values,
and drops points whose values differ from the mean more than 2 SD.
Here is the code.
=======
# initialization
cube.half.size<-2 # half size of a cube to be built around every point
mult.sigma<-2 # we will drop every point with value differing
# from mean more than mult.sigma * SD
to.drop<-data.frame() # the list of points to drop.
for(i in 1:length(whole$c)){ #look at every point...
pv<-subset(whole,abs(a-whole$a[i])<cube.half.size & #make the subset...
abs(b-whole$b[i])<cube.half.size &
abs(d-whole$d[i])<cube.half.size);
if(length(pv$c)>1){ # if subset includes not only considered point, then
mean.c<-mean(pv$c) # calculate mean and SD
sd.c<-sd(pv$c)
#make a list of points to drop from current subset
td<-subset(pv,abs(c-mean.c)>sd.c*mult.sigma)
if(length(td$c)>0){
#check which of these point are already already in the list to drop
td.index<-which(row.names(td) %in% row.names(to.drop))
#and replenish the list of points to drop
to.drop<-rbind(to.drop,if(length(td.index)>0) td[-td.index,] else td)
#print out the message showing, we're alive (these messages will
#not appear regularly, that's OK)
if(length(td.index)!=length(td$c))
print(c("i=",i,"Points to drop: ",length(to.drop$c)))
}
}
}
# make a new data set without droppped points.
whole.flt.3<-whole[-which(row.names(to.drop) %in% row.names(whole)),]
=======
The problem is: the 'whole' data set is large, more than 100000
rows, and the script runs several hours.
The running time becomes greater, if I build a sphere instead of a
cube.
I would like to optimize it in order to make it run faster.
Is it possible?
Will a sorting take effect?
Thank you for attention and any good feedback.
--
Best regards
Wladimir Eremeev mailto:wl at eimb.ru
==========================================================================
Research Scientist, PhD
Russian Academy of Sciences
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
-- Rich FitzJohn rich.fitzjohn <at> gmail.com | http://homepages.paradise.net.nz/richa183 You are in a maze of twisty little functions, all alike
Dear Rich, Thank you for reply. I think, optimization, you offered will satisfy my needs. I don't completely understand the following. RF> ## And wrap the original in a function for comparison: RF> ## This does not subset the way you want: RF> ## whole[-which(row.names(to.drop) %in% row.names(whole)),] RF> whole[-as.integer(row.names(to.drop)),] Why doesn't my subset work properly? My data frame 'whole' was created from 3 another data frames by rbind, if it makes sense... Moreover, your variant gives the error:
as.integer(row.names(to.drop)[120:220])
[1] 2761 3616 3629 5808 7204 7627 8192 10851 20275 273611 4492 256691 8797 [14] 11756 46673 246981 250401 335591 773 774 786 993 995 1454 2715 6990 [27] 7951 7962 8185 8662 9406 442100 478100 528100 208710 211710 215910 19846 28660 [40] 28661 28691 28806 28878 450611 497411 81672 91572 119232 166191 166281 203981 204201 [53] 255171 255212 255301 300651 331212 371761 397651 405241 415331 8779 195510 197910 203210 [66] 205410 205510 211810 220610 19615 27165 28581 28640 28641 28642 28662 28714 48692 [79] 449611 449911 497211 81702 195451 202491 202551 253931 255071 259102 266971 303341 331831 [92] 353912 371931 374612 394461 397641 412671 9227 464100 1558 2161
whole[-as.integer(row.names(to.drop)[120:220]),]
Error in "[.data.frame"(whole, -as.integer(row.names(to.drop)[120:220]), :
subscript out of bounds
Row names don't coincide with row order numbers in my case.
--
Best regards
Wladimir Eremeev mailto:wl at eimb.ru
Hi again, The arguments to %in% are in the wrong order in your version. Because of that, the statement row.names(to.drop) %in% row.names(whole) will be TRUE for the first nrow(to.drop) elements, and FALSE for the remainder. To fix it, just switch the order around, or use the simpler version: whole[!row.names(whole) %in% row.names(to.drop),] The fact that your row names are different to the row indices in whole will be what is causing the error when trying my variant. Cheers, Rich
On Apr 4, 2005 10:21 PM, Wladimir Eremeev <wl at eimb.ru> wrote:
Dear Rich, Thank you for reply. I think, optimization, you offered will satisfy my needs. I don't completely understand the following. RF> ## And wrap the original in a function for comparison: RF> ## This does not subset the way you want: RF> ## whole[-which(row.names(to.drop) %in% row.names(whole)),] RF> whole[-as.integer(row.names(to.drop)),] Why doesn't my subset work properly? My data frame 'whole' was created from 3 another data frames by rbind, if it makes sense... Moreover, your variant gives the error:
as.integer(row.names(to.drop)[120:220])
[1] 2761 3616 3629 5808 7204 7627 8192 10851 20275 273611 4492 256691 8797 [14] 11756 46673 246981 250401 335591 773 774 786 993 995 1454 2715 6990 [27] 7951 7962 8185 8662 9406 442100 478100 528100 208710 211710 215910 19846 28660 [40] 28661 28691 28806 28878 450611 497411 81672 91572 119232 166191 166281 203981 204201 [53] 255171 255212 255301 300651 331212 371761 397651 405241 415331 8779 195510 197910 203210 [66] 205410 205510 211810 220610 19615 27165 28581 28640 28641 28642 28662 28714 48692 [79] 449611 449911 497211 81702 195451 202491 202551 253931 255071 259102 266971 303341 331831 [92] 353912 371931 374612 394461 397641 412671 9227 464100 1558 2161
whole[-as.integer(row.names(to.drop)[120:220]),]
Error in "[.data.frame"(whole, -as.integer(row.names(to.drop)[120:220]), :
subscript out of bounds
Row names don't coincide with row order numbers in my case.
--
Best regards
Wladimir Eremeev mailto:wl at eimb.ru
-- Rich FitzJohn rich.fitzjohn <at> gmail.com | http://homepages.paradise.net.nz/richa183 You are in a maze of twisty little functions, all alike