-----Original Message-----
From: William Dunlap
Sent: Thursday, December 04, 2008 9:59 AM
To: 'daren76 at hotmail.com'
Cc: 'R help'
Subject: Re: [R] How to optimize this codes ?
[R] How to optimize this codes ?
Daren Tan daren76 at hotmail.com
Thu Dec 4 17:02:49 CET 2008
How to optimize the for-loop to be reasonably fast for
sample.size=100000000 ?
You may want to change sample.size=1000 to have an idea
what I am achieving.
set.seed(143)
A <- matrix(sample(0:1, sample.size, TRUE), ncol=10,
dimnames=list(NULL, LETTERS[1:10]))
B <- list()
for(i in 1:10) {
B[[i]] <- apply(combn(LETTERS[1:10], i), 2, function(x)
{ sum(apply(data.frame(A[,x]), 1, all)) })
}
Instead computing all(A[row,x]) each row of the matrix by looping
over the rows you could loop over the columns. That is generally
quicker if there are many more rows than columns. S+ and R have
functions called pmin and pmax that do this for min and max, but
no pany or pall functions. In your 0/1 case all is the same as min
so you can replace
apply(data.frame(A[,x]), 1, all)
with
sum(do.call("pmin", unname(A[,x,drop=FALSE])))
after first converting A to a data.frame before starting to compute B.
(The do.call/unname combo is a hack to pass all the columns
of a data.frame
as separate arguments to a function. If pmin took a list
that wouldn't
be necessary.)
When you do timings, looking at one size of problem often
isn't helpful,
as it doesn't tell you how the time depends on the size of
the input. The
relationship often is not linear. E.g., I put your method in
a function
called computeB0 and my modification of it into computeB1 and
wrote a makeA
that makes an A matrix with a given sample.size. Here are
the times, it looks
like 1000 is below the linear range for this example:
sample.size time:computeB0 time:computeB1
1000 5.36 0.98
10000 36.82 2.49
100000 381.34 18.97
and here are the functions used
makeA <-
function(sample.size){
set.seed(143);
A<-matrix(sample(0:1,size=sample.size,replace=TRUE),
ncol=10, dimnames=list(NULL, LETTERS[1:10]));
A
}
computeB0 <-
function(A){B<-list()
for(i in 1:10) {
B[[i]] <- apply(combn(LETTERS[1:10], i), 2, function(x) {
sum(apply(data.frame(A[,x]), 1, all)) })
}
B
}
computeB1 <-
function(A){
A<-as.data.frame(A)
B<-list()
for(i in 1:10) {
B[[i]] <- apply(combn(LETTERS[1:10], i), 2,
FUN=function(x) { sum(do.call("pmin",
unname(A[,x,drop=FALSE]))) }
)
}
B
}
You could probably same more time by restructuring this as a
recursive function,
still operating on columns, that traversed the binary tree of
column inclusion/exclusion.
I just wanted to point out the advantage of looping over
columns instead of looping over
rows.