efficient code. how to reduce running time?

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20070121/4691fcdf/attachment.pl
An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20070121/4f61820a/attachment.pl
Dear Mira,

I didn't work through your code in detail, but I did notice that you're
doing something that's very inefficient in R -- building up objects
element-by-element, e.g., by indexing beyond their current length. Instead,
you can preallocate the objects and simply replace elements. For example,
create the vector F in your perm.F() as F <- numeric(nperms) rather than as
an empty vector. (BTW, I'd probably not name a variable "F" since this is
usually a synonym for the logical value FALSE.) There's a similar problem in
your handling of maxF, which you build up column-by-column via cbind().
(BTW, is maxF really a matrix?) You also needlessly recompute max(F), never
appear to use MSSid, and end lines with unnecessary semicolons.

I hope that this helps,
 John

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
--------------------------------
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch 
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of miraceti
Sent: Sunday, January 21, 2007 12:38 PM
To: r-help at stat.math.ethz.ch
Subject: [R] efficient code. how to reduce running time?

Hi,
I am new to R.
and even though I've made my code to run and do what it needs to .
It is taking forever and I can't use it like this.
I was wondering if you could help me find ways to fix the 
code to run faster.
Here are my codes..
the data set is a bunch of 0s and 1s in a data.frame.
What I am doing is this.
I pick a column and make up a new column Y with values 
associated with that column I picked.
then I remove the column.
and see if any other columns in the data have a significant 
association with the Y column I've generated.
If you see anything obvious that takes a long time, any 
suggestions would be appreciated.
thanks a lot.

Mira

--------------------------------------------------------------
--------------------------------
#sub function for finding index
rfind <- function(x)seq(along=x)[x*(1-x)>MAF*(1-MAF)]

#sub function for permutation test
perm.F = function(y,x,nperms,sites)
{
  maxF = c();
  for (i in 1:nperms)
  {
    F=numeric(S)   #create an empty vector to store the F-values
    newY=sample(y,length(y))     #permute the cancer types
    newX = cbind(x, newY);
    # anova for all sites
    for ( i in sites )
    {
      a <- anova(lm(newY~factor(newX[,i])));
      F[i] <- a$`F value`[1];
    }
    MSSid <- which (F == max(F)); # index of MSS (Most 
Significant Site)
    maxF = cbind(maxF,max(F));
  }
  maxF;
}

# set the output file
sink("/tmp/R.out.3932.100")
# load the dataset
snp = read.table(file("/tmp/msoutput.3932.100"))
#print (snp);

# pi: desired proportion of variation due to QTN pi = 0.05; 
print (paste("pi:", pi)); MAF = 0.05; print (paste("MAF:", 
MAF)); # S: number of segregating sites S = length(snp[1,]); 
# N: number of samples N = length(snp[,1]); Dips = 
sample(1:N,N) DipA = Dips[1:50] DipB = Dips[51:100] disnp = 
snp[DipA,]+snp[DipB,] snp = as.data.frame(disnp, 
row.names=NULL); N = length(snp[,1]);

# get allele freq for all SNPs
allele_f <- mean(snp[,1:S])/2;
print (allele_f);
sites = rfind(allele_f);
print(sites);

# collapse sites that have perfect correlation newsites <- 
sites; for (i in 1:(length(sites)-1)) {
  for (j in (i+1):length(sites))
  {
    test = (snp[sites[i]] == snp[sites[j]])
    if ( all(test) || all(!test) )
    {
      print (paste("perfect correlation with", sites[i]));
      print (paste("removing alleles", sites[j]));
      newsites <- newsites[newsites!=sites[j]];
    }
  }
}
sites <- newsites;
print(sites);

# QTN: the site nearest right to S/4
sitesid = floor(length(sites)/4);
QTNid = sites[sitesid];
QTN = snp[,QTNid];

print (paste("QTN:", names(snp[QTNid]))); print (QTN);

# remove QTN from sites
sites <- sites [ sites != QTNid ];
print(sites);
print (paste("Number of usable SNPs:", length(sites)));

# p: allele frequency of QTN
p0 = allele_f[QTNid];
p = min(p0, 1-p0);
print (paste("QTN_freq:", p));

# z: random normal deviate
z = rnorm(N, mean = 0, sd = 1);
# foreach sample give quantitative phenotype # each row is a 
sample # phenotype value depends on QTN genotype, pi, p, and 
z Y <- sqrt(10-(10*pi))*z + QTN*sqrt((10*pi)/(2*p*(1-p))); 
snp = data.frame(cbind(snp, Y)); # anova for QTN 
df=data.frame(Y=Y, QTN=factor(QTN)); QTN_a <- anova(lm(Y~QTN, 
data=df)); print (QTN_a); SSB <- QTN_a$`Sum Sq`[1]; SSW <- 
QTN_a$`Sum Sq`[2]; QTN_PRE <- SSB / (SSB + SSW); print 
(paste("var_QTN/var_tot:", QTN_PRE));

# anova for all sites
F=numeric(S)   #create an empty vector to store the F-values
Pval=rep(1,S)   #create an empty vector to store the Pval
PRE=numeric(S)   #create an empty vector to store the PRE

for ( i in sites )
{
  a <- anova(lm(Y~factor(snp[,i])));
  print (a);
  F[i] <- a$`F value`[1];
  Pval[i] <- a$`Pr`[1];
  SSB <- a$`Sum Sq`[1];
  SSW <- a$`Sum Sq`[2];
  PRE[i] <- SSB / (SSB + SSW);

}
print (paste("Max F:", max(F)));
MSSid <- which (F == max(F)); # index of MSS (Most 
Significant Site) MSS = snp[,MSSid]; print (paste("MSS(Most 
Significant Site):", MSSid)); p0 = length(MSS[MSS==0])/N; p = 
min(p0, 1-p0); print (paste("assoc_freq:", p)); print 
(paste("assoc_var:", PRE[MSSid])); #lets do a permutation 
test Fdist <- perm.F(Y, snp[,1:S], 1000, sites); print 
("permutation test maxF dist"); print (Fdist); pvalue <- 
mean(Fdist>F[MSSid]); print (paste("assoc_prob:", pvalue));

# close the output file
sink()
--------------------------------------------------------------
------------------------------------

	[[alternative HTML version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20070121/e368a614/attachment.pl

Thank you all for lookin at it.
I'll fix the code to preallocate the objects.
and I wonder if there is a way to call anova on all the columns at  
the same
time..
Right now I am calling (Y~V1, data) from V1 to V50 thru a loop.
I tried (Y~., data) but it gave me different values from the  
results I get
when I call them separately,
So I can't help but call them 25,000 times...
Have you looked at lapply, sapply, apply and friends?

Haris
Dear Haris,

Using lapply() et al. may produce cleaner code, but it won't necessarily
speed up a computation. For example:
X <- data.frame(matrix(rnorm(1000*1000), 1000, 1000))
y <- rnorm(1000)

mods <- as.list(1:1000)
system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 40.53  0.05 40.61    NA    NA
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 53.29  0.37 53.94    NA    NA

In cases such as this, I don't even find the code using *apply() easier to
read.

Regards,
 John

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
--------------------------------
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch 
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
Charilaos Skiadas
Sent: Sunday, January 21, 2007 7:01 PM
To: miraceti
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] efficient code. how to reduce running time?

On Jan 21, 2007, at 5:55 PM, miraceti wrote:

Thank you all for lookin at it.
I'll fix the code to preallocate the objects.
and I wonder if there is a way to call anova on all the 
columns at the 
same time..
Right now I am calling (Y~V1, data) from V1 to V50 thru a loop.
I tried (Y~., data) but it gave me different values from 
the results I 
get when I call them separately, So I can't help but call 
them 25,000 
times...
Have you looked at lapply, sapply, apply and friends?

Haris

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Dear Haris,

Using lapply() et al. may produce cleaner code, but it won't  
necessarily
speed up a computation. For example:

X <- data.frame(matrix(rnorm(1000*1000), 1000, 1000))
y <- rnorm(1000)

mods <- as.list(1:1000)
system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 40.53  0.05 40.61    NA    NA
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 53.29  0.37 53.94    NA    NA

Interesting, in my system the results are quite different:

 > system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 192.035  12.601 797.094   0.000   0.000
 > system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1]  59.913   9.918 289.030   0.000   0.000

Regular MacOSX install with ~760MB memory.
In cases such as this, I don't even find the code using *apply()  
easier to
read.

Regards,
 John
Haris
Dear Haris,

My timings were on a 3 GHz Pentium 4 system with 1 GB of memory running Win
XP SP2 and R 2.4.1.

I'm no expert on these matters, and I wouldn't have been surprised by
qualitatively different results on different systems, but this difference is
larger than I would have expected. One thing that seems particularly
striking in your results is the large difference between elapsed time and
user CPU time, making me wonder what else was going on when you ran these
examples.

Regards,
 John

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
--------------------------------
-----Original Message-----
From: Charilaos Skiadas [mailto:skiadas at hanover.edu] 
Sent: Monday, January 22, 2007 10:00 AM
To: John Fox
Cc: r-help at stat.math.ethz.ch; 'miraceti'
Subject: Re: [R] efficient code. how to reduce running time?

On Jan 21, 2007, at 8:11 PM, John Fox wrote:

Dear Haris,

Using lapply() et al. may produce cleaner code, but it won't 
necessarily speed up a computation. For example:

X <- data.frame(matrix(rnorm(1000*1000), 1000, 1000)) y <- 
rnorm(1000)

mods <- as.list(1:1000)
system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 40.53  0.05 40.61    NA    NA
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 53.29  0.37 53.94    NA    NA

Interesting, in my system the results are quite different:

 > system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 192.035  12.601 797.094   0.000   0.000
 > system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1]  59.913   9.918 289.030   0.000   0.000

Regular MacOSX install with ~760MB memory.

In cases such as this, I don't even find the code using *apply() 
easier to read.

Regards,
 John
Haris

On Jan 21, 2007, at 8:11 PM, John Fox wrote:

Dear Haris,

Using lapply() et al. may produce cleaner code, but it won't
necessarily
speed up a computation. For example:

X <- data.frame(matrix(rnorm(1000*1000), 1000, 1000))
y <- rnorm(1000)

mods <- as.list(1:1000)
system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 40.53  0.05 40.61    NA    NA
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 53.29  0.37 53.94    NA    NA

Interesting, in my system the results are quite different:

system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 192.035  12.601 797.094   0.000   0.000
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1]  59.913   9.918 289.030   0.000   0.000

Regular MacOSX install with ~760MB memory.
But MacOS X is infamous for having rather specific speed problems with its 
malloc, and so gives different timing results from all other platforms.
We are promised a solution in MacOS 10.5.

Both of your machines seem very slow compared to mine:
system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
user  system elapsed
  11.011   0.250  11.311
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
user  system elapsed
  13.463   0.260  13.812

and that on a 64-bit platform (AMD64 Linux FC5).
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
Dear Brian,
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch 
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Prof 
Brian Ripley
Sent: Monday, January 22, 2007 11:06 AM
To: Charilaos Skiadas
Cc: John Fox; r-help at stat.math.ethz.ch
Subject: Re: [R] efficient code. how to reduce running time?

On Mon, 22 Jan 2007, Charilaos Skiadas wrote:

On Jan 21, 2007, at 8:11 PM, John Fox wrote:

Dear Haris,

Using lapply() et al. may produce cleaner code, but it won't 
necessarily speed up a computation. For example:

X <- data.frame(matrix(rnorm(1000*1000), 1000, 1000)) y <- 
rnorm(1000)

mods <- as.list(1:1000)
system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 40.53  0.05 40.61    NA    NA
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 53.29  0.37 53.94    NA    NA

Interesting, in my system the results are quite different:

system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 192.035  12.601 797.094   0.000   0.000
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1]  59.913   9.918 289.030   0.000   0.000

Regular MacOSX install with ~760MB memory.
But MacOS X is infamous for having rather specific speed 
problems with its malloc, and so gives different timing 
results from all other platforms.
We are promised a solution in MacOS 10.5.

Thanks for the clarification.
Both of your machines seem very slow compared to mine:

system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
    user  system elapsed
  11.011   0.250  11.311
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
    user  system elapsed
  13.463   0.260  13.812

and that on a 64-bit platform (AMD64 Linux FC5).

As you can see from the specs (in a previous message), my system is quite
old, which probably accounts for at least part of the difference. The ratios
of the user times for my and your system aren't too different though:
53.29/40.53  # mine
[1] 1.314829
13.463/11.011  # yours
[1] 1.222686

Regards,
 John
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

 One thing that seems particularly
striking in your results is the large difference between elapsed  
time and
user CPU time, making me wonder what else was going on when you ran  
these
examples.
Yes, indeed there were a lot of other things going on, this is the  
only machine I have and I use it continuously. I'll try to run  
another test tonight when the machine is not in use.
It did seem a very striking difference though.

But am I wrong in thinking that these measurements should be  
independent of what other applications are running at the same time,  
and should measure exactly the time in terms of CPU cycles needed to  
finish this task, regardless of how often the process got to use the  
CPU? I guess I was working under that assumption, which indeed makes  
the above comparison a very unfair one, because there was a lot more  
going on during the first system.time call.

Still, the difference is quite large, which of course could simply  
have to do with the internals of the two commands, coupled with Prof.  
Ripley's comments about malloc in Mac OS X.
Regards,
 John
Haris