Jim,
No, this is _not the problem. If you go to my 1st mail I have a monster
(at least was when I purchased it) with 32GB (sic :-) of RAM and 4 dual
core AMD64 285 (the fastest at that time and still pretty fast now :-)
The machine stats paging when I run 2 copies of R working on two things
like that :-). If you look at my last e-mail I found a solution but
still have no clue why the heck x<-as.data.frame(y) where why is a list
of the same columns take real for ever and this the thing that killed me
before.
Thanks,
Latchezar
-----Original Message-----
From: jim holtman [mailto:jholtman at gmail.com]
Sent: Saturday, July 21, 2007 5:33 PM
To: Latchezar Dimitrov
Cc: Benilton Carvalho; r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?
One of the problems is that you are probably paging on your
system with an object that size (240000 x 1000). This is
about 1GB for a single object:
set.seed(123)
n <- 240000
system.time({
+ genoT <- lapply(1:n, function(i) factor(sample(c("AA", "AB", "BB"),
+ 1000, prob=c(1000, 1, 1), rep=T)))
+ })
user system elapsed
95.00 0.61 104.71
names(genoT) = paste("snp", 1:n, sep="")
object.size(genoT)
I can create it on my 2GB machine as a list, but have
problems converting it to a dataframe because I don't have
enough memory.
So unless you have at least 4GB on your system, it might take
a long time. Look at your performance measurements on your
system and see if you have run out of physical memory and are paging.
On 7/21/07, Latchezar Dimitrov <ldimitro at wfubmc.edu> wrote:
Hi,
Thanks for the help. My 1st question still unanswered though :-)
Please see bellow
-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
Sent: Friday, July 20, 2007 3:30 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?
set.seed(123)
genoT = lapply(1:240000, function(i) factor(sample(c("AA", "AB",
"BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
names(genoT) = paste("snp", 1:240000, sep="") genoT =
as.data.frame(genoT)
Now this _is the problem. Everything before converting to
worked almost instantaneously however as.data.frame runs forever.
Obviously there is some scalability memory management issue. When I
tried my own method but creating a new result (instead of modifying
the
old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I
figured 300,000 cols should be ~1000s. Nope! It ran for about
50,000(!)s to finish about 42,000 cols only.
BTW, what ver. of R is yours?
Now here's what I "discovered" further.
#-- create a 1-col frame:
geno <-
data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.G
AS
P),rownames(geno.JAG)))
#-- main code I repeated it w/ j in 1:1000, 2001:3000, and
i.e., adding a 1000 of cols to geno each time
system.time(
# for(j in 1:(ncol(geno.GASP ))){
for(j in 3001:(4000 )){
gt.GASP<-geno.GASP[[j]]
for(l in 1:length(gt.GASP at levels)){
levels(gt.GASP)[l] <-
switch(gt.GASP at levels[l],AA="0",AB="1",BB="2")
}
gt.JAG <-geno.JAG [[j]]
# for(l in 1:length(gt.JAG @levels)){
# levels(gt.JAG )[l] <- switch(gt.JAG
@levels[l],AA="0",AB="1",BB="2")
# }
geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
### factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
,as.numeric(factor(gt.JAG, levels=0:2))-1
)
,levels=0:2
)
}
)
Times (each one is for a 1000 cols!):
[1] 26.673 0.032 26.705 0.000 0.000 [1] 77.186 0.037
0.000
[1] 128.165 0.042 128.209 0.000 0.000
[1] 180.940 0.047 180.989 0.000 0.000
See the big diff and the scaling I mentioned above?
Further more I removed geno[[j]] assignment leaving the operation
though, i.e., replaced it with ### line above. Times:
[1] 0.857 0.008 0.865 0.000 0.000
Huh!? What the heck! That's my second question :-) Any ideas?
I still believe my method is near optimal. Of course I have
get rid of the assignment bottleneck.
For now the lesson is: "God bless lists"
Here is my final solution:
+ geno.GASP.L<-lapply(geno.GASP
+ ,function(x){
+ for(l in
1:length(x at levels)){levels(x)[l]
+ <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+ factor(x,levels=0:2)
+ }
+ )
+ geno.JAG.L <-lapply(geno.JAG
+ ,function(x){
+ # for(l in
1:length(x at levels)){levels(x)[l] <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+ factor(x,levels=0:2)
+ }
+ )
+ })
[1] 192.800 1.566 194.413 0.000 0.000 !!!!!!!!! :-)))))
+ class (geno.GASP.L)<-"data.frame"
+ row.names(geno.GASP.L)<-row.names(geno.GASP)
+ class (geno.JAG.L )<-"data.frame"
+ row.names(geno.JAG.L )<-row.names(geno.JAG )
+ })
[1] 12.156 0.001 12.155 0.000 0.000
+ geno<-rbind(geno.GASP.L,geno.JAG.L)
+ })
[1] 1542.340 9.072 2066.310 0.000 0.000
I logged my notes here as I was trying various things. Partly the
reason is my two questions:
"What was wrong with me?" and
"What the heck?!" remember above? :-)))
which still remain unanswered :-(
I would have had a lot of fun if I had not to have this done by ...
Yesterday :-))
Thanks a lot for the help
Latchezar
dim(genoT)
class(genoT)
system.time(out <- lapply(genoT, function(x) match(x,
"BB"))-1))
##
##
user system elapsed
119.288 0.004 119.339
(for all 240K)
best,
b
ps: note that "out" is a list.
On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote:
-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
Sent: Friday, July 20, 2007 12:25 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?
it looks like that whatever method you used to genotype the
1002 samples on the STY array gave you a transposed matrix of
genotype calls. :-)
It only looks like :-)
Otherwise it is correctly created dataframe of 1002
number) of columns (SNP genotypes). It worked perfectly until I
decided to put together to cohorts independently processed in R
already. I got stuck with my lack of foreseeing.
have put 3 dummy lines w/ AA,AB, and AB on each one to make
genotypes are present and that's it! Lesson for the future :-)
Maybe I am not using columns and rows appropriately
dataframe is correct (I have not used FORTRAN since
- as
str says 1002 observ. of (big number) vars.
i'd use:
genoT = read.table(yourFile, stringsAsFactors = FALSE)
as a starting point... but I don't think that would be
you'd need to fix one column at a time - lapply).
No it was not efficient at all. 'matter of fact nothing is more
efficient then loading already read data, alas :-(
i'd preprocess yourFile before trying to load it:
cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e
's/BB/3/ g' > outFile
and, now, in R:
genoT = read.table(outFile, header=TRUE)
... Too late ;-) As it must be clear now I have two
to put together with rbind(geno1,geno2). The issue again is
"uniformization" of factor variables w/ missing factors -
up like levels AA,BB on one of the and levels AB,BB on the
means as.numeric of AA is 1 on the 1st and as.numeric
on the second - complete mess. That's why I tried to make both
i.e.
levels "AA","AB", and "BB" for every SNP and then rbind works.
In any case my 1st questions remains: "What's wrong
b
On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:
Hello,
This is a speed question. I have a dataframe genoT:
'data.frame': 1002 obs. of 238304 variables:
$ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
...
$ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1
...
$ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1
...
$ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
...
$ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2
...
$ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA
2 1
...
$ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
...
$ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3
...
$ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
...
$ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2
...
$ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB":
1
2 2 3
...
$ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB":
3
3 3 3
...
$ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB":
2
2 2 2
...
$ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1
1
1 ...
$ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB":
2
1 1 2
...
$ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB":
1
1 1 1
...
$ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB":
1
1 1 1
...
$ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1
1
1 ...
$ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB":
1
1 1 2
...
$ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2
2
1 ...
$ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB":
3
1 1 1
...
$ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2
2
2 ...
$ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1
1
1 ...
$ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB":
1
2 2 1
...
Its columns are factors with different number of levels
that's what I got from read.table, i.e., it dropped missing
want to convert it to uniform factors with 3 levels. The
above show already converted columns and the rest are not yet
converted.
Here's my attempt wich is a complete failure as speed:
+ for(j in 1:(10 )){ #-- this is to try 1st
measure the time, it otherwise is ncol(genoT) instead of 10
+ gt<-genoT[[j]] #-- this is to avoid
+ for(l in 1:length(gt at levels)){
+ levels(gt)[l] <-
switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+ genoT[[j]]<-factor(gt,levels=0:2) #--
factor
and put it back
+ }
+ }
+ )
[1] 785.085 4.358 789.454 0.000 0.000
789s for 10 columns only!
To me it seems like replacing 10 x 3 levels and then making
of
1002 element vector x 10 is a "negligible" amount of
needed.
So, what's wrong with me? Any idea how to accelerate
transformation or (to go to the very beginning) to make
a fixed set of levels ("AA","AB", and "BB") and not
(missing)
level?
R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit
The machine is with 32G RAM and AMD Opteron 285 (2.? GHz)
it.
Thank you very much for the help,
Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine,