An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-help/attachments/20070121/4691fcdf/attachment.pl
efficient code. how to reduce running time?
11 messages · miraceti, Charilaos Skiadas, Brian Ripley +1 more
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-help/attachments/20070121/4f61820a/attachment.pl
Dear Mira, I didn't work through your code in detail, but I did notice that you're doing something that's very inefficient in R -- building up objects element-by-element, e.g., by indexing beyond their current length. Instead, you can preallocate the objects and simply replace elements. For example, create the vector F in your perm.F() as F <- numeric(nperms) rather than as an empty vector. (BTW, I'd probably not name a variable "F" since this is usually a synonym for the logical value FALSE.) There's a similar problem in your handling of maxF, which you build up column-by-column via cbind(). (BTW, is maxF really a matrix?) You also needlessly recompute max(F), never appear to use MSSid, and end lines with unnecessary semicolons. I hope that this helps, John -------------------------------- John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox --------------------------------
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of miraceti
Sent: Sunday, January 21, 2007 12:38 PM
To: r-help at stat.math.ethz.ch
Subject: [R] efficient code. how to reduce running time?
Hi,
I am new to R.
and even though I've made my code to run and do what it needs to .
It is taking forever and I can't use it like this.
I was wondering if you could help me find ways to fix the
code to run faster.
Here are my codes..
the data set is a bunch of 0s and 1s in a data.frame.
What I am doing is this.
I pick a column and make up a new column Y with values
associated with that column I picked.
then I remove the column.
and see if any other columns in the data have a significant
association with the Y column I've generated.
If you see anything obvious that takes a long time, any
suggestions would be appreciated.
thanks a lot.
Mira
--------------------------------------------------------------
--------------------------------
#sub function for finding index
rfind <- function(x)seq(along=x)[x*(1-x)>MAF*(1-MAF)]
#sub function for permutation test
perm.F = function(y,x,nperms,sites)
{
maxF = c();
for (i in 1:nperms)
{
F=numeric(S) #create an empty vector to store the F-values
newY=sample(y,length(y)) #permute the cancer types
newX = cbind(x, newY);
# anova for all sites
for ( i in sites )
{
a <- anova(lm(newY~factor(newX[,i])));
F[i] <- a$`F value`[1];
}
MSSid <- which (F == max(F)); # index of MSS (Most
Significant Site)
maxF = cbind(maxF,max(F));
}
maxF;
}
# set the output file
sink("/tmp/R.out.3932.100")
# load the dataset
snp = read.table(file("/tmp/msoutput.3932.100"))
#print (snp);
# pi: desired proportion of variation due to QTN pi = 0.05;
print (paste("pi:", pi)); MAF = 0.05; print (paste("MAF:",
MAF)); # S: number of segregating sites S = length(snp[1,]);
# N: number of samples N = length(snp[,1]); Dips =
sample(1:N,N) DipA = Dips[1:50] DipB = Dips[51:100] disnp =
snp[DipA,]+snp[DipB,] snp = as.data.frame(disnp,
row.names=NULL); N = length(snp[,1]);
# get allele freq for all SNPs
allele_f <- mean(snp[,1:S])/2;
print (allele_f);
sites = rfind(allele_f);
print(sites);
# collapse sites that have perfect correlation newsites <-
sites; for (i in 1:(length(sites)-1)) {
for (j in (i+1):length(sites))
{
test = (snp[sites[i]] == snp[sites[j]])
if ( all(test) || all(!test) )
{
print (paste("perfect correlation with", sites[i]));
print (paste("removing alleles", sites[j]));
newsites <- newsites[newsites!=sites[j]];
}
}
}
sites <- newsites;
print(sites);
# QTN: the site nearest right to S/4
sitesid = floor(length(sites)/4);
QTNid = sites[sitesid];
QTN = snp[,QTNid];
print (paste("QTN:", names(snp[QTNid]))); print (QTN);
# remove QTN from sites
sites <- sites [ sites != QTNid ];
print(sites);
print (paste("Number of usable SNPs:", length(sites)));
# p: allele frequency of QTN
p0 = allele_f[QTNid];
p = min(p0, 1-p0);
print (paste("QTN_freq:", p));
# z: random normal deviate
z = rnorm(N, mean = 0, sd = 1);
# foreach sample give quantitative phenotype # each row is a
sample # phenotype value depends on QTN genotype, pi, p, and
z Y <- sqrt(10-(10*pi))*z + QTN*sqrt((10*pi)/(2*p*(1-p)));
snp = data.frame(cbind(snp, Y)); # anova for QTN
df=data.frame(Y=Y, QTN=factor(QTN)); QTN_a <- anova(lm(Y~QTN,
data=df)); print (QTN_a); SSB <- QTN_a$`Sum Sq`[1]; SSW <-
QTN_a$`Sum Sq`[2]; QTN_PRE <- SSB / (SSB + SSW); print
(paste("var_QTN/var_tot:", QTN_PRE));
# anova for all sites
F=numeric(S) #create an empty vector to store the F-values
Pval=rep(1,S) #create an empty vector to store the Pval
PRE=numeric(S) #create an empty vector to store the PRE
for ( i in sites )
{
a <- anova(lm(Y~factor(snp[,i])));
print (a);
F[i] <- a$`F value`[1];
Pval[i] <- a$`Pr`[1];
SSB <- a$`Sum Sq`[1];
SSW <- a$`Sum Sq`[2];
PRE[i] <- SSB / (SSB + SSW);
}
print (paste("Max F:", max(F)));
MSSid <- which (F == max(F)); # index of MSS (Most
Significant Site) MSS = snp[,MSSid]; print (paste("MSS(Most
Significant Site):", MSSid)); p0 = length(MSS[MSS==0])/N; p =
min(p0, 1-p0); print (paste("assoc_freq:", p)); print
(paste("assoc_var:", PRE[MSSid])); #lets do a permutation
test Fdist <- perm.F(Y, snp[,1:S], 1000, sites); print
("permutation test maxF dist"); print (Fdist); pvalue <-
mean(Fdist>F[MSSid]); print (paste("assoc_prob:", pvalue));
# close the output file
sink()
--------------------------------------------------------------
------------------------------------
[[alternative HTML version deleted]]
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-help/attachments/20070121/e368a614/attachment.pl
On Jan 21, 2007, at 5:55 PM, miraceti wrote:
Thank you all for lookin at it. I'll fix the code to preallocate the objects. and I wonder if there is a way to call anova on all the columns at the same time.. Right now I am calling (Y~V1, data) from V1 to V50 thru a loop. I tried (Y~., data) but it gave me different values from the results I get when I call them separately, So I can't help but call them 25,000 times...
Have you looked at lapply, sapply, apply and friends? Haris
Dear Haris, Using lapply() et al. may produce cleaner code, but it won't necessarily speed up a computation. For example:
X <- data.frame(matrix(rnorm(1000*1000), 1000, 1000)) y <- rnorm(1000) mods <- as.list(1:1000) system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 40.53 0.05 40.61 NA NA
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 53.29 0.37 53.94 NA NA In cases such as this, I don't even find the code using *apply() easier to read. Regards, John -------------------------------- John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox --------------------------------
-----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Charilaos Skiadas Sent: Sunday, January 21, 2007 7:01 PM To: miraceti Cc: r-help at stat.math.ethz.ch Subject: Re: [R] efficient code. how to reduce running time? On Jan 21, 2007, at 5:55 PM, miraceti wrote:
Thank you all for lookin at it. I'll fix the code to preallocate the objects. and I wonder if there is a way to call anova on all the
columns at the
same time.. Right now I am calling (Y~V1, data) from V1 to V50 thru a loop. I tried (Y~., data) but it gave me different values from
the results I
get when I call them separately, So I can't help but call
them 25,000
times...
Have you looked at lapply, sapply, apply and friends? Haris
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Jan 21, 2007, at 8:11 PM, John Fox wrote:
Dear Haris, Using lapply() et al. may produce cleaner code, but it won't necessarily speed up a computation. For example:
X <- data.frame(matrix(rnorm(1000*1000), 1000, 1000)) y <- rnorm(1000) mods <- as.list(1:1000) system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 40.53 0.05 40.61 NA NA
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 53.29 0.37 53.94 NA NA
Interesting, in my system the results are quite different: > system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i])) [1] 192.035 12.601 797.094 0.000 0.000 > system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x))) [1] 59.913 9.918 289.030 0.000 0.000 Regular MacOSX install with ~760MB memory.
In cases such as this, I don't even find the code using *apply() easier to read. Regards, John
Haris
Dear Haris, My timings were on a 3 GHz Pentium 4 system with 1 GB of memory running Win XP SP2 and R 2.4.1. I'm no expert on these matters, and I wouldn't have been surprised by qualitatively different results on different systems, but this difference is larger than I would have expected. One thing that seems particularly striking in your results is the large difference between elapsed time and user CPU time, making me wonder what else was going on when you ran these examples. Regards, John -------------------------------- John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox --------------------------------
-----Original Message----- From: Charilaos Skiadas [mailto:skiadas at hanover.edu] Sent: Monday, January 22, 2007 10:00 AM To: John Fox Cc: r-help at stat.math.ethz.ch; 'miraceti' Subject: Re: [R] efficient code. how to reduce running time? On Jan 21, 2007, at 8:11 PM, John Fox wrote:
Dear Haris, Using lapply() et al. may produce cleaner code, but it won't necessarily speed up a computation. For example:
X <- data.frame(matrix(rnorm(1000*1000), 1000, 1000)) y <- rnorm(1000) mods <- as.list(1:1000) system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 40.53 0.05 40.61 NA NA
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 53.29 0.37 53.94 NA NA
Interesting, in my system the results are quite different:
> system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 192.035 12.601 797.094 0.000 0.000
> system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 59.913 9.918 289.030 0.000 0.000 Regular MacOSX install with ~760MB memory.
In cases such as this, I don't even find the code using *apply() easier to read. Regards, John
Haris
On Mon, 22 Jan 2007, Charilaos Skiadas wrote:
On Jan 21, 2007, at 8:11 PM, John Fox wrote:
Dear Haris, Using lapply() et al. may produce cleaner code, but it won't necessarily speed up a computation. For example:
X <- data.frame(matrix(rnorm(1000*1000), 1000, 1000)) y <- rnorm(1000) mods <- as.list(1:1000) system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 40.53 0.05 40.61 NA NA
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 53.29 0.37 53.94 NA NA
Interesting, in my system the results are quite different:
system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 192.035 12.601 797.094 0.000 0.000
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 59.913 9.918 289.030 0.000 0.000 Regular MacOSX install with ~760MB memory.
But MacOS X is infamous for having rather specific speed problems with its malloc, and so gives different timing results from all other platforms. We are promised a solution in MacOS 10.5. Both of your machines seem very slow compared to mine:
system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
user system elapsed 11.011 0.250 11.311
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
user system elapsed 13.463 0.260 13.812 and that on a 64-bit platform (AMD64 Linux FC5).
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Dear Brian,
-----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Prof Brian Ripley Sent: Monday, January 22, 2007 11:06 AM To: Charilaos Skiadas Cc: John Fox; r-help at stat.math.ethz.ch Subject: Re: [R] efficient code. how to reduce running time? On Mon, 22 Jan 2007, Charilaos Skiadas wrote:
On Jan 21, 2007, at 8:11 PM, John Fox wrote:
Dear Haris, Using lapply() et al. may produce cleaner code, but it won't necessarily speed up a computation. For example:
X <- data.frame(matrix(rnorm(1000*1000), 1000, 1000)) y <- rnorm(1000) mods <- as.list(1:1000) system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 40.53 0.05 40.61 NA NA
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 53.29 0.37 53.94 NA NA
Interesting, in my system the results are quite different:
system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
[1] 192.035 12.601 797.094 0.000 0.000
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
[1] 59.913 9.918 289.030 0.000 0.000 Regular MacOSX install with ~760MB memory.
But MacOS X is infamous for having rather specific speed problems with its malloc, and so gives different timing results from all other platforms. We are promised a solution in MacOS 10.5.
Thanks for the clarification.
Both of your machines seem very slow compared to mine:
system.time(for (i in 1:1000) mods[[i]] <- lm(y ~ X[,i]))
user system elapsed 11.011 0.250 11.311
system.time(mods <- lapply(as.list(X), function(x) lm(y ~ x)))
user system elapsed 13.463 0.260 13.812 and that on a 64-bit platform (AMD64 Linux FC5).
As you can see from the specs (in a previous message), my system is quite old, which probably accounts for at least part of the difference. The ratios of the user times for my and your system aren't too different though:
53.29/40.53 # mine
[1] 1.314829
13.463/11.011 # yours
[1] 1.222686 Regards, John
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Jan 22, 2007, at 10:39 AM, John Fox wrote:
One thing that seems particularly striking in your results is the large difference between elapsed time and user CPU time, making me wonder what else was going on when you ran these examples.
Yes, indeed there were a lot of other things going on, this is the only machine I have and I use it continuously. I'll try to run another test tonight when the machine is not in use. It did seem a very striking difference though. But am I wrong in thinking that these measurements should be independent of what other applications are running at the same time, and should measure exactly the time in terms of CPU cycles needed to finish this task, regardless of how often the process got to use the CPU? I guess I was working under that assumption, which indeed makes the above comparison a very unfair one, because there was a lot more going on during the first system.time call. Still, the difference is quite large, which of course could simply have to do with the internals of the two commands, coupled with Prof. Ripley's comments about malloc in Mac OS X.
Regards, John
Haris