Dataframe of factors transform speed?

Hello,

This is a speed question. I have a dataframe genoT:
dim(genoT)
[1]   1002 238304
str(genoT)
'data.frame':   1002 obs. of  238304 variables:
 $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2
...
 $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1
...
 $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1
...
 $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1
...
 $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
 $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2
...
 $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
 $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3
...
 $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3
...
 $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2
...
 $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2
...
 $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1
...
 $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1
...
 $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2
...
 $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 NA 2
1 ...
 $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 1 1 1
...
 $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 2 ...
 $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 2 2 1
...

Its columns are factors with different number of levels (from 1 to 3 -
that's what I got from read.table, i.e., it dropped missing levels). I
want to convert it to uniform factors with 3 levels. The 1st 10 rows
above show already converted columns and the rest are not yet converted.
Here's my attempt wich is a complete failure as speed:
system.time(
+     for(j in 1:(10         )){ #-- this is to try 1st 10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+        gt<-genoT[[j]]          #-- this is to avoid 2D indices
+        for(l in 1:length(gt at levels)){
+          levels(gt)[l] <- switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level factor
and put it back
+        }
+     }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making a factor of
1002 element vector x 10 is a "negligible" amount of operations needed.

So, what's wrong with me? Any idea how to accelerate significantly the
transformation or (to go to the very beginning) to make read.table use a
fixed set of levels ("AA","AB", and "BB") and not to drop any (missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine,
Winston-Salem, North Carolina, USA
it looks like that whatever method you used to genotype the 1002  
samples on the STY array gave you a transposed matrix of genotype  
calls. :-)

i'd use:

genoT = read.table(yourFile, stringsAsFactors = FALSE)

as a starting point... but I don't think that would be efficient (as  
you'd need to fix one column at a time - lapply).

i'd preprocess yourFile before trying to load it:

cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e 's/BB/3/ 
g' > outFile

and, now, in R:

genoT = read.table(outFile, header=TRUE)

b

Hello,

This is a speed question. I have a dataframe genoT:

dim(genoT)
[1]   1002 238304

str(genoT)
'data.frame':   1002 obs. of  238304 variables:
 $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2
...
 $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1
...
 $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1
...
 $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1  
2 1
...
 $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
 $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2
...
 $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
 $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3
...
 $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1  
2 2 3
...
 $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3  
3 3 3
...
 $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2  
2 2 2
...
 $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1  
1 ...
 $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2  
1 1 2
...
 $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1  
1 1 1
...
 $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1  
1 1 1
...
 $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1  
1 ...
 $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1  
1 1 2
...
 $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1  
NA 2
1 ...
 $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3  
1 1 1
...
 $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2  
2 ...
 $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1  
1 ...
 $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1  
2 2 1
...

Its columns are factors with different number of levels (from 1 to 3 -
that's what I got from read.table, i.e., it dropped missing levels). I
want to convert it to uniform factors with 3 levels. The 1st 10 rows
above show already converted columns and the rest are not yet  
converted.
Here's my attempt wich is a complete failure as speed:

system.time(
+     for(j in 1:(10         )){ #-- this is to try 1st 10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+        gt<-genoT[[j]]          #-- this is to avoid 2D indices
+        for(l in 1:length(gt at levels)){
+          levels(gt)[l] <- switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level  
factor
and put it back
+        }
+     }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making a  
factor of
1002 element vector x 10 is a "negligible" amount of operations  
needed.

So, what's wrong with me? Any idea how to accelerate significantly the
transformation or (to go to the very beginning) to make read.table  
use a
fixed set of levels ("AA","AB", and "BB") and not to drop any  
(missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine,
Winston-Salem, North Carolina, USA

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting- 
guide.html
and provide commented, minimal, self-contained, reproducible code.
Is this what you want?  It took 0.01 seconds to convert 20 rows of the
test data:
# create some data     (20 rows with 1000 columns)
n <- 20
result <- list()
vals <- c("AA", "AB", "BB")
for (i in 1:n){
+     result[[as.character(i)]] <- sample(vals,1000, replace=TRUE,
prob=c(9000,1,1))
+ }
result.df <- do.call('data.frame', result)

str(result.df)
'data.frame':   1000 obs. of  20 variables:
 $ X1 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X2 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X3 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X4 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X5 : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ X6 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X7 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X8 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X9 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X10: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X11: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X12: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X13: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X14: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X15: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X16: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X17: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X18: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X19: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X20: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
# go through each row and convert the factors according to 'vals' above
system.time({      # time to convert 20 rows
+     x <- lapply(result.df, function(facts){
+         factor(match(as.character(facts), vals) - 1, levels=0:2)
+     })
+     result.df <- do.call('data.frame', x)
+ })
   user  system elapsed
   0.01    0.00    0.01
str(result.df)
'data.frame':   1000 obs. of  20 variables:
 $ X1 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X2 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X3 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X4 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X5 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X6 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X7 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X8 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X9 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X10: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X11: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X12: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X13: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X14: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X15: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X16: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X17: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X18: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X19: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X20: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Hello,

This is a speed question. I have a dataframe genoT:

dim(genoT)
[1]   1002 238304

str(genoT)
'data.frame':   1002 obs. of  238304 variables:
 $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2
...
 $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1
...
 $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1
...
 $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1
...
 $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
 $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2
...
 $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
 $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3
...
 $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3
...
 $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2
...
 $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2
...
 $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1
...
 $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1
...
 $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2
...
 $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 NA 2
1 ...
 $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 1 1 1
...
 $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 2 ...
 $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 2 2 1
...

Its columns are factors with different number of levels (from 1 to 3 -
that's what I got from read.table, i.e., it dropped missing levels). I
want to convert it to uniform factors with 3 levels. The 1st 10 rows
above show already converted columns and the rest are not yet converted.
Here's my attempt wich is a complete failure as speed:

system.time(
+     for(j in 1:(10         )){ #-- this is to try 1st 10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+        gt<-genoT[[j]]          #-- this is to avoid 2D indices
+        for(l in 1:length(gt at levels)){
+          levels(gt)[l] <- switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level factor
and put it back
+        }
+     }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making a factor of
1002 element vector x 10 is a "negligible" amount of operations needed.

So, what's wrong with me? Any idea how to accelerate significantly the
transformation or (to go to the very beginning) to make read.table use a
fixed set of levels ("AA","AB", and "BB") and not to drop any (missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine,
Winston-Salem, North Carolina, USA

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?
Hi,
-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu] 
Sent: Friday, July 20, 2007 12:25 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

it looks like that whatever method you used to genotype the 
1002 samples on the STY array gave you a transposed matrix of 
genotype calls. :-)
It only looks like :-)

Otherwise it is correctly created dataframe of 1002 samples X (big
number) of columns (SNP genotypes). It worked perfectly until I decided
to put together to cohorts independently processed in R already. I got
stuck with my lack of foreseeing. Otherwise I would have put 3 dummy
lines w/ AA,AB, and AB on each one to make sure all 3 genotypes are
present and that's it! Lesson for the future :-)

Maybe I am not using columns and rows appropriately here but the
dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-) - as
str says 1002 observ. of (big number) vars.
i'd use:

genoT = read.table(yourFile, stringsAsFactors = FALSE)

as a starting point... but I don't think that would be 
efficient (as you'd need to fix one column at a time - lapply).
No it was not efficient at all. 'matter of fact nothing is more
efficient then loading already read data, alas :-(
i'd preprocess yourFile before trying to load it:

cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e 
's/BB/3/ g' > outFile

and, now, in R:

genoT = read.table(outFile, header=TRUE)
... Too late ;-) As it must be clear now I have two dataframes I want to
put together with rbind(geno1,geno2). The issue again is
"uniformization" of factor variables w/ missing factors - they ended up
like levels AA,BB on one of the and levels AB,BB on the other which
means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on the
second - complete mess. That's why I tried to make both uniform, i.e.
levels "AA","AB", and "BB" for every SNP and then rbind works.

In any case my 1st questions remains: "What's wrong with me?" :-)

Thanks,
Latchezar
b

On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:

Hello,

This is a speed question. I have a dataframe genoT:

dim(genoT)
[1]   1002 238304

str(genoT)
'data.frame':   1002 obs. of  238304 variables:
 $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 
3 3 3 3 3 
...
 $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 
1 1 2 2 2 
...
 $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 
1 1 1 1 1 
...
 $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 
3 3 3 3 3 
...
 $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 
3 2 3 3 1 
...
 $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1
2 1
...
 $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 
1 1 1 1 2 
...
 $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 
3 3 3 3 2 
...
 $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 
1 1 1 1 2 
...
 $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 
1 2 1 1 3 
...
 $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1
2 2 3
...
 $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3
3 3 3
...
 $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2
2 2 2
...
 $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2
1 1 2
...
 $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
1 1 1
...
 $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1
1 1 1
...
 $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1
1 1 2
...
 $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 
2 2 NA 1 NA 
2
1 ...
 $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3
1 1 1
...
 $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2
2 ...
 $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
2 2 1
...

Its columns are factors with different number of levels 
(from 1 to 3 - 
that's what I got from read.table, i.e., it dropped missing 
levels). I 
want to convert it to uniform factors with 3 levels. The 
1st 10 rows 
above show already converted columns and the rest are not yet 
converted.
Here's my attempt wich is a complete failure as speed:

system.time(
+     for(j in 1:(10         )){ #-- this is to try 1st 10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+        gt<-genoT[[j]]          #-- this is to avoid 2D indices
+        for(l in 1:length(gt at levels)){
+          levels(gt)[l] <- 
switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level  
factor
and put it back
+        }
+     }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making 
a factor 
of
1002 element vector x 10 is a "negligible" amount of operations 
needed.

So, what's wrong with me? Any idea how to accelerate 
significantly the 
transformation or (to go to the very beginning) to make 
read.table use 
a fixed set of levels ("AA","AB", and "BB") and not to drop any
(missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) 
so it's not 
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine, Winston-Salem, North 
Carolina, USA

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting- 
guide.html and provide commented, minimal, self-contained, 
reproducible code.

set.seed(123)
genoT = lapply(1:240000, function(i) factor(sample(c("AA", "AB",  
"BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
names(genoT) = paste("snp", 1:240000, sep="")
genoT = as.data.frame(genoT)
dim(genoT)
class(genoT)
system.time(out <- lapply(genoT, function(x) match(x, c("AA", "AB",  
"BB"))-1))
##
##
    user  system elapsed
119.288   0.004 119.339

(for all 240K)

best,
b

ps: note that "out" is a list.

Hi,

-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
Sent: Friday, July 20, 2007 12:25 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

it looks like that whatever method you used to genotype the
1002 samples on the STY array gave you a transposed matrix of
genotype calls. :-)
It only looks like :-)

Otherwise it is correctly created dataframe of 1002 samples X (big
number) of columns (SNP genotypes). It worked perfectly until I  
decided
to put together to cohorts independently processed in R already. I got
stuck with my lack of foreseeing. Otherwise I would have put 3 dummy
lines w/ AA,AB, and AB on each one to make sure all 3 genotypes are
present and that's it! Lesson for the future :-)

Maybe I am not using columns and rows appropriately here but the
dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-)  
- as
str says 1002 observ. of (big number) vars.

i'd use:

genoT = read.table(yourFile, stringsAsFactors = FALSE)

as a starting point... but I don't think that would be
efficient (as you'd need to fix one column at a time - lapply).
No it was not efficient at all. 'matter of fact nothing is more
efficient then loading already read data, alas :-(

i'd preprocess yourFile before trying to load it:

cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e
's/BB/3/ g' > outFile

and, now, in R:

genoT = read.table(outFile, header=TRUE)
... Too late ;-) As it must be clear now I have two dataframes I  
want to
put together with rbind(geno1,geno2). The issue again is
"uniformization" of factor variables w/ missing factors - they  
ended up
like levels AA,BB on one of the and levels AB,BB on the other which
means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on  
the
second - complete mess. That's why I tried to make both uniform, i.e.
levels "AA","AB", and "BB" for every SNP and then rbind works.

In any case my 1st questions remains: "What's wrong with me?" :-)

Thanks,
Latchezar

b

On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:

Hello,

This is a speed question. I have a dataframe genoT:

dim(genoT)
[1]   1002 238304

str(genoT)
'data.frame':   1002 obs. of  238304 variables:
 $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
3 3 3 3 3
...
 $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1
1 1 2 2 2
...
 $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1
1 1 1 1 1
...
 $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
3 3 3 3 3
...
 $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2
3 2 3 3 1
...
 $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1
2 1
...
 $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
1 1 1 1 2
...
 $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3
3 3 3 3 2
...
 $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
1 1 1 1 2
...
 $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2
1 2 1 1 3
...
 $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1
2 2 3
...
 $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3
3 3 3
...
 $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2
2 2 2
...
 $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2
1 1 2
...
 $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
1 1 1
...
 $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1
1 1 1
...
 $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1
1 1 2
...
 $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2
2 2 NA 1 NA
2
1 ...
 $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3
1 1 1
...
 $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2
2 ...
 $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
2 2 1
...

Its columns are factors with different number of levels
(from 1 to 3 -
that's what I got from read.table, i.e., it dropped missing
levels). I
want to convert it to uniform factors with 3 levels. The
1st 10 rows
above show already converted columns and the rest are not yet
converted.
Here's my attempt wich is a complete failure as speed:

system.time(
+     for(j in 1:(10         )){ #-- this is to try 1st 10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+        gt<-genoT[[j]]          #-- this is to avoid 2D indices
+        for(l in 1:length(gt at levels)){
+          levels(gt)[l] <-
switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level
factor
and put it back
+        }
+     }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making
a factor
of
1002 element vector x 10 is a "negligible" amount of operations
needed.

So, what's wrong with me? Any idea how to accelerate
significantly the
transformation or (to go to the very beginning) to make
read.table use
a fixed set of levels ("AA","AB", and "BB") and not to drop any
(missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz)
so it's not
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine, Winston-Salem, North
Carolina, USA

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html and provide commented, minimal, self-contained,
reproducible code.

Hello,

This is a speed question. I have a dataframe genoT:

dim(genoT)
[1]   1002 238304
It looks like these are all numeric originally. Handling these as a
vector or matrix will speed things up a bit. You can then stitch
together a data.frame:

# simulate: 
#       genoT.names <- scan('data.file, what='a', nlines=1, <etc> ) 
# 	genoT <- scan('data.file',skip=1)
#
genoT <- sample(0:2, 240000*1002, repl=T)
t1 <- proc.time()
genoT <- factor(genoT,0:2,c("AA","AB","BB"))
dim(genoT) <- c(1002,240000)
genoT.list <- lapply(1:240000, function(x) genoT[,x])
# simulate: names(genoT.list) <- genoT.names :
names(genoT.list) <- make.names(1:240000)
class(genoT.list) <- "data.frame"
row.names(genoT.list) <- 1:1002
proc.time()-t1
user  system elapsed
  20.978   2.036  49.714

Most of the _elapsed_ time is due to lags in copy-and-paste-ing in the 
commands.

HTH,

Chuck

str(genoT)
'data.frame':   1002 obs. of  238304 variables:
$ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
$ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2
...
$ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1
...
$ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
$ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1
...
$ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1
...
$ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
$ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2
...
$ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
$ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3
...
$ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3
...
$ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3
...
$ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2
...
$ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
$ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2
...
$ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1
...
$ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1
...
$ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
$ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2
...
$ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 NA 2
1 ...
$ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 1 1 1
...
$ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 2 ...
$ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ...
$ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 2 2 1
...

Its columns are factors with different number of levels (from 1 to 3 -
that's what I got from read.table, i.e., it dropped missing levels). I
want to convert it to uniform factors with 3 levels. The 1st 10 rows
above show already converted columns and the rest are not yet converted.
Here's my attempt wich is a complete failure as speed:

system.time(
+     for(j in 1:(10         )){ #-- this is to try 1st 10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+        gt<-genoT[[j]]          #-- this is to avoid 2D indices
+        for(l in 1:length(gt at levels)){
+          levels(gt)[l] <- switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level factor
and put it back
+        }
+     }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making a factor of
1002 element vector x 10 is a "negligible" amount of operations needed.

So, what's wrong with me? Any idea how to accelerate significantly the
transformation or (to go to the very beginning) to make read.table use a
fixed set of levels ("AA","AB", and "BB") and not to drop any (missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine,
Winston-Salem, North Carolina, USA

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
Hi,

Thanks for the help. My 1st question still unanswered though :-) Please
see bellow
-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu] 
Sent: Friday, July 20, 2007 3:30 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

set.seed(123)
genoT = lapply(1:240000, function(i) factor(sample(c("AA", 
"AB", "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
names(genoT) = paste("snp", 1:240000, sep="") genoT = 
as.data.frame(genoT)
Now this _is the problem. Everything before converting to data.frame
worked almost instantaneously however as.data.frame runs forever.
Obviously there is some scalability memory management issue. When I
tried my own method but creating a new result (instead of modifying the
old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I
figured 300,000 cols should be ~1000s. Nope! It ran for about 50,000(!)s
to finish about 42,000 cols only. 

BTW, what ver. of R is yours?

Now here's what I "discovered" further.

#-- create a 1-col frame:
    geno   <-
data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.GAS
P),rownames(geno.JAG)))

#-- main code I repeated it w/ j in 1:1000, 2001:3000, and 3001:4000,
i.e., adding a 1000 of cols to geno each time

system.time(
#   for(j in 1:(ncol(geno.GASP      ))){
    for(j in 3001:(4000              )){
      gt.GASP<-geno.GASP[[j]]
       for(l in 1:length(gt.GASP at levels)){
         levels(gt.GASP)[l] <-
switch(gt.GASP at levels[l],AA="0",AB="1",BB="2")
       }
       gt.JAG <-geno.JAG [[j]]
#      for(l in 1:length(gt.JAG @levels)){
#        levels(gt.JAG )[l] <- switch(gt.JAG
@levels[l],AA="0",AB="1",BB="2")
#      }
       geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
###               factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
                          ,as.numeric(factor(gt.JAG, levels=0:2))-1
                          )
                        ,levels=0:2
                        )
    }
)

Times (each one is for a 1000 cols!):
[1] 26.673  0.032 26.705  0.000  0.000
[1] 77.186  0.037 77.225  0.000  0.000
[1] 128.165   0.042 128.209   0.000   0.000
[1] 180.940   0.047 180.989   0.000   0.000

See the big diff and the scaling I mentioned above?

Further more I removed geno[[j]] assignment leaving the operation
though, i.e., replaced it with ### line above. Times:

[1] 0.857 0.008 0.865 0.000 0.000

Huh!? What the heck! That's my second question :-) Any ideas?

I still believe my method is near optimal. Of course I have to somehow
get rid of the assignment bottleneck.

For now the lesson is: "God bless lists"

Here is my final solution:
system.time({
+     geno.GASP.L<-lapply(geno.GASP
+                        ,function(x){
+                           for(l in 1:length(x at levels)){levels(x)[l] <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+                           factor(x,levels=0:2)
+                         }
+                  )
+     geno.JAG.L <-lapply(geno.JAG
+                        ,function(x){
+ #                         for(l in 1:length(x at levels)){levels(x)[l] <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+                           factor(x,levels=0:2)
+                         }
+                  )
+ })
[1] 192.800   1.566 194.413   0.000   0.000   !!!!!!!!! :-)))))
system.time({
+     class    (geno.GASP.L)<-"data.frame"
+     row.names(geno.GASP.L)<-row.names(geno.GASP)
+     class    (geno.JAG.L )<-"data.frame"
+     row.names(geno.JAG.L )<-row.names(geno.JAG )
+ })
[1] 12.156  0.001 12.155  0.000  0.000
system.time({
+     geno<-rbind(geno.GASP.L,geno.JAG.L)
+ })
[1] 1542.340    9.072 2066.310    0.000    0.000

I logged my notes here as I was trying various things. Partly the reason
is my two questions:

"What was wrong with me?" and
"What the heck?!" remember above? :-)))

which  still remain unanswered :-(

I would have had a lot of fun if I had not to have this done by ...
Yesterday :-))

Thanks a lot for the help

Latchezar
dim(genoT)
class(genoT)
system.time(out <- lapply(genoT, function(x) match(x, c("AA", "AB",
"BB"))-1))
##
##
    user  system elapsed
119.288   0.004 119.339

(for all 240K)

best,
b

ps: note that "out" is a list.

On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote:

Hi,

-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
Sent: Friday, July 20, 2007 12:25 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

it looks like that whatever method you used to genotype the
1002 samples on the STY array gave you a transposed matrix of 
genotype calls. :-)
It only looks like :-)

Otherwise it is correctly created dataframe of 1002 samples X (big
number) of columns (SNP genotypes). It worked perfectly until I 
decided to put together to cohorts independently processed in R 
already. I got stuck with my lack of foreseeing. Otherwise I would 
have put 3 dummy lines w/ AA,AB, and AB on each one to make 
sure all 3 
genotypes are present and that's it! Lesson for the future :-)

Maybe I am not using columns and rows appropriately here but the 
dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-)
- as
str says 1002 observ. of (big number) vars.

i'd use:

genoT = read.table(yourFile, stringsAsFactors = FALSE)

as a starting point... but I don't think that would be 
efficient (as 
you'd need to fix one column at a time - lapply).
No it was not efficient at all. 'matter of fact nothing is more 
efficient then loading already read data, alas :-(

i'd preprocess yourFile before trying to load it:

cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e 
's/BB/3/ g' > outFile

and, now, in R:

genoT = read.table(outFile, header=TRUE)
... Too late ;-) As it must be clear now I have two 
dataframes I want 
to put together with rbind(geno1,geno2). The issue again is 
"uniformization" of factor variables w/ missing factors - 
they ended 
up like levels AA,BB on one of the and levels AB,BB on the 
other which 
means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on 
the second - complete mess. That's why I tried to make both 
uniform, 
i.e.
levels "AA","AB", and "BB" for every SNP and then rbind works.

In any case my 1st questions remains: "What's wrong with me?" :-)

Thanks,
Latchezar

b

On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:

Hello,

This is a speed question. I have a dataframe genoT:

dim(genoT)
[1]   1002 238304

str(genoT)
'data.frame':   1002 obs. of  238304 variables:
 $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
3 3 3 3 3
...
 $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1
1 1 2 2 2
...
 $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1
1 1 1 1 1
...
 $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
3 3 3 3 3
...
 $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2
3 2 3 3 1
...
 $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 
1 NA 2 1 1
2 1
...
 $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
1 1 1 1 2
...
 $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3
3 3 3 3 2
...
 $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
1 1 1 1 2
...
 $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2
1 2 1 1 3
...
 $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1
2 2 3
...
 $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3
3 3 3
...
 $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2
2 2 2
...
 $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2
1 1 2
...
 $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
1 1 1
...
 $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1
1 1 1
...
 $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1
1 1 2
...
 $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2
2 2 NA 1 NA
2
1 ...
 $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3
1 1 1
...
 $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2
2 ...
 $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
2 2 1
...

Its columns are factors with different number of levels
(from 1 to 3 -
that's what I got from read.table, i.e., it dropped missing
levels). I
want to convert it to uniform factors with 3 levels. The
1st 10 rows
above show already converted columns and the rest are not yet 
converted.
Here's my attempt wich is a complete failure as speed:

system.time(
+     for(j in 1:(10         )){ #-- this is to try 1st 
10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+        gt<-genoT[[j]]          #-- this is to avoid 2D indices
+        for(l in 1:length(gt at levels)){
+          levels(gt)[l] <-
switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level
factor
and put it back
+        }
+     }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making
a factor
of
1002 element vector x 10 is a "negligible" amount of operations 
needed.

So, what's wrong with me? Any idea how to accelerate
significantly the
transformation or (to go to the very beginning) to make
read.table use
a fixed set of levels ("AA","AB", and "BB") and not to drop any
(missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz)
so it's not
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine, Winston-Salem, North 
Carolina, USA

______________________________________________
R-help at stat.math.ethz.ch mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting- 
guide.html and provide commented, minimal, self-contained, 
reproducible code.

One of the problems is that you are probably paging on your system
with an object that size (240000 x 1000).  This is about 1GB for a
single object:
set.seed(123)
n <- 240000
system.time({
+ genoT <- lapply(1:n, function(i) factor(sample(c("AA",
+ "AB", "BB"), 1000, prob=c(1000, 1, 1), rep=T)))
+ })
   user  system elapsed
  95.00    0.61  104.71
names(genoT) = paste("snp", 1:n, sep="")

object.size(genoT)
[1] 1045258752

I can create it on my 2GB machine as a list, but have problems
converting it to a dataframe because I don't have enough memory.

So unless you have at least 4GB on your system, it might take a long
time.  Look at your performance measurements on your system and see if
you have run out of physical memory and are paging.
Hi,

Thanks for the help. My 1st question still unanswered though :-) Please
see bellow

-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
Sent: Friday, July 20, 2007 3:30 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

set.seed(123)
genoT = lapply(1:240000, function(i) factor(sample(c("AA",
"AB", "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
names(genoT) = paste("snp", 1:240000, sep="") genoT =
as.data.frame(genoT)
Now this _is the problem. Everything before converting to data.frame
worked almost instantaneously however as.data.frame runs forever.
Obviously there is some scalability memory management issue. When I
tried my own method but creating a new result (instead of modifying the
old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I
figured 300,000 cols should be ~1000s. Nope! It ran for about 50,000(!)s
to finish about 42,000 cols only.

BTW, what ver. of R is yours?

Now here's what I "discovered" further.

#-- create a 1-col frame:
   geno   <-
data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.GAS
P),rownames(geno.JAG)))

#-- main code I repeated it w/ j in 1:1000, 2001:3000, and 3001:4000,
i.e., adding a 1000 of cols to geno each time

system.time(
#   for(j in 1:(ncol(geno.GASP      ))){
   for(j in 3001:(4000              )){
     gt.GASP<-geno.GASP[[j]]
      for(l in 1:length(gt.GASP at levels)){
        levels(gt.GASP)[l] <-
switch(gt.GASP at levels[l],AA="0",AB="1",BB="2")
      }
      gt.JAG <-geno.JAG [[j]]
#      for(l in 1:length(gt.JAG @levels)){
#        levels(gt.JAG )[l] <- switch(gt.JAG
@levels[l],AA="0",AB="1",BB="2")
#      }
      geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
###               factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
                         ,as.numeric(factor(gt.JAG, levels=0:2))-1
                         )
                       ,levels=0:2
                       )
   }
)

Times (each one is for a 1000 cols!):
[1] 26.673  0.032 26.705  0.000  0.000
[1] 77.186  0.037 77.225  0.000  0.000
[1] 128.165   0.042 128.209   0.000   0.000
[1] 180.940   0.047 180.989   0.000   0.000

See the big diff and the scaling I mentioned above?

Further more I removed geno[[j]] assignment leaving the operation
though, i.e., replaced it with ### line above. Times:

[1] 0.857 0.008 0.865 0.000 0.000

Huh!? What the heck! That's my second question :-) Any ideas?

I still believe my method is near optimal. Of course I have to somehow
get rid of the assignment bottleneck.

For now the lesson is: "God bless lists"

Here is my final solution:

system.time({
+     geno.GASP.L<-lapply(geno.GASP
+                        ,function(x){
+                           for(l in 1:length(x at levels)){levels(x)[l] <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+                           factor(x,levels=0:2)
+                         }
+                  )
+     geno.JAG.L <-lapply(geno.JAG
+                        ,function(x){
+ #                         for(l in 1:length(x at levels)){levels(x)[l] <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+                           factor(x,levels=0:2)
+                         }
+                  )
+ })
[1] 192.800   1.566 194.413   0.000   0.000   !!!!!!!!! :-)))))
system.time({
+     class    (geno.GASP.L)<-"data.frame"
+     row.names(geno.GASP.L)<-row.names(geno.GASP)
+     class    (geno.JAG.L )<-"data.frame"
+     row.names(geno.JAG.L )<-row.names(geno.JAG )
+ })
[1] 12.156  0.001 12.155  0.000  0.000
system.time({
+     geno<-rbind(geno.GASP.L,geno.JAG.L)
+ })
[1] 1542.340    9.072 2066.310    0.000    0.000

I logged my notes here as I was trying various things. Partly the reason
is my two questions:

"What was wrong with me?" and
"What the heck?!" remember above? :-)))

which  still remain unanswered :-(

I would have had a lot of fun if I had not to have this done by ...
Yesterday :-))

Thanks a lot for the help

Latchezar

dim(genoT)
class(genoT)
system.time(out <- lapply(genoT, function(x) match(x, c("AA", "AB",
"BB"))-1))
##
##
    user  system elapsed
119.288   0.004 119.339

(for all 240K)

best,
b

ps: note that "out" is a list.

On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote:

Hi,

-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
Sent: Friday, July 20, 2007 12:25 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

it looks like that whatever method you used to genotype the
1002 samples on the STY array gave you a transposed matrix of
genotype calls. :-)
It only looks like :-)

Otherwise it is correctly created dataframe of 1002 samples X (big
number) of columns (SNP genotypes). It worked perfectly until I
decided to put together to cohorts independently processed in R
already. I got stuck with my lack of foreseeing. Otherwise I would
have put 3 dummy lines w/ AA,AB, and AB on each one to make
sure all 3
genotypes are present and that's it! Lesson for the future :-)

Maybe I am not using columns and rows appropriately here but the
dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-)
- as
str says 1002 observ. of (big number) vars.

i'd use:

genoT = read.table(yourFile, stringsAsFactors = FALSE)

as a starting point... but I don't think that would be
efficient (as
you'd need to fix one column at a time - lapply).
No it was not efficient at all. 'matter of fact nothing is more
efficient then loading already read data, alas :-(

i'd preprocess yourFile before trying to load it:

cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e
's/BB/3/ g' > outFile

and, now, in R:

genoT = read.table(outFile, header=TRUE)
... Too late ;-) As it must be clear now I have two
dataframes I want
to put together with rbind(geno1,geno2). The issue again is
"uniformization" of factor variables w/ missing factors -
they ended
up like levels AA,BB on one of the and levels AB,BB on the
other which
means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on
the second - complete mess. That's why I tried to make both
uniform,
i.e.
levels "AA","AB", and "BB" for every SNP and then rbind works.

In any case my 1st questions remains: "What's wrong with me?" :-)

Thanks,
Latchezar

b

On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:

Hello,

This is a speed question. I have a dataframe genoT:

dim(genoT)
[1]   1002 238304

str(genoT)
'data.frame':   1002 obs. of  238304 variables:
 $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
3 3 3 3 3
...
 $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1
1 1 2 2 2
...
 $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1
1 1 1 1 1
...
 $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
3 3 3 3 3
...
 $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2
3 2 3 3 1
...
 $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA
1 NA 2 1 1
2 1
...
 $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
1 1 1 1 2
...
 $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3
3 3 3 3 2
...
 $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
1 1 1 1 2
...
 $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2
1 2 1 1 3
...
 $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1
2 2 3
...
 $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3
3 3 3
...
 $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2
2 2 2
...
 $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2
1 1 2
...
 $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
1 1 1
...
 $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1
1 1 1
...
 $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1
1 1 2
...
 $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2
2 2 NA 1 NA
2
1 ...
 $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3
1 1 1
...
 $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2
2 ...
 $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1
1 ...
 $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
2 2 1
...

Its columns are factors with different number of levels
(from 1 to 3 -
that's what I got from read.table, i.e., it dropped missing
levels). I
want to convert it to uniform factors with 3 levels. The
1st 10 rows
above show already converted columns and the rest are not yet
converted.
Here's my attempt wich is a complete failure as speed:

system.time(
+     for(j in 1:(10         )){ #-- this is to try 1st
10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+        gt<-genoT[[j]]          #-- this is to avoid 2D indices
+        for(l in 1:length(gt at levels)){
+          levels(gt)[l] <-
switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level
factor
and put it back
+        }
+     }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making
a factor
of
1002 element vector x 10 is a "negligible" amount of operations
needed.

So, what's wrong with me? Any idea how to accelerate
significantly the
transformation or (to go to the very beginning) to make
read.table use
a fixed set of levels ("AA","AB", and "BB") and not to drop any
(missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz)
so it's not
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine, Winston-Salem, North
Carolina, USA

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-
guide.html and provide commented, minimal, self-contained,
reproducible code.

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?
Jim,

No, this is _not the problem. If you go to my 1st mail I have a monster
(at least was when I purchased it) with 32GB (sic :-) of RAM and 4 dual
core AMD64 285 (the fastest at that time and still pretty fast now :-) 

The machine stats paging when I run 2 copies of R working on two things
like that :-). If you look at my last e-mail I found a solution but
still have no clue why the heck x<-as.data.frame(y) where why is a list
of the same columns take real for ever and this the thing that killed me
before.

Thanks,
Latchezar
-----Original Message-----
From: jim holtman [mailto:jholtman at gmail.com] 
Sent: Saturday, July 21, 2007 5:33 PM
To: Latchezar Dimitrov
Cc: Benilton Carvalho; r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

One of the problems is that you are probably paging on your 
system with an object that size (240000 x 1000).  This is 
about 1GB for a single object:

set.seed(123)
n <- 240000
system.time({
+ genoT <- lapply(1:n, function(i) factor(sample(c("AA", "AB", "BB"), 
+ 1000, prob=c(1000, 1, 1), rep=T)))
+ })
   user  system elapsed
  95.00    0.61  104.71
names(genoT) = paste("snp", 1:n, sep="")

object.size(genoT)
[1] 1045258752

I can create it on my 2GB machine as a list, but have 
problems converting it to a dataframe because I don't have 
enough memory.

So unless you have at least 4GB on your system, it might take 
a long time.  Look at your performance measurements on your 
system and see if you have run out of physical memory and are paging.

On 7/21/07, Latchezar Dimitrov <ldimitro at wfubmc.edu> wrote:
Hi,

Thanks for the help. My 1st question still unanswered though :-) 
Please see bellow

-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
Sent: Friday, July 20, 2007 3:30 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

set.seed(123)
genoT = lapply(1:240000, function(i) factor(sample(c("AA", "AB", 
"BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
names(genoT) = paste("snp", 1:240000, sep="") genoT =
as.data.frame(genoT)
Now this _is the problem. Everything before converting to 
data.frame 
worked almost instantaneously however as.data.frame runs forever.
Obviously there is some scalability memory management issue. When I 
tried my own method but creating a new result (instead of modifying 
the
old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I 
figured 300,000 cols should be ~1000s. Nope! It ran for about 
50,000(!)s to finish about 42,000 cols only.

BTW, what ver. of R is yours?

Now here's what I "discovered" further.

#-- create a 1-col frame:
   geno   <-

data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.G
AS
P),rownames(geno.JAG)))

#-- main code I repeated it w/ j in 1:1000, 2001:3000, and 
3001:4000, 
i.e., adding a 1000 of cols to geno each time

system.time(
#   for(j in 1:(ncol(geno.GASP      ))){
   for(j in 3001:(4000              )){
     gt.GASP<-geno.GASP[[j]]
      for(l in 1:length(gt.GASP at levels)){
        levels(gt.GASP)[l] <-
switch(gt.GASP at levels[l],AA="0",AB="1",BB="2")
      }
      gt.JAG <-geno.JAG [[j]]
#      for(l in 1:length(gt.JAG @levels)){
#        levels(gt.JAG )[l] <- switch(gt.JAG
@levels[l],AA="0",AB="1",BB="2")
#      }
      geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
###               factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
                         ,as.numeric(factor(gt.JAG, levels=0:2))-1
                         )
                       ,levels=0:2
                       )
   }
)

Times (each one is for a 1000 cols!):
[1] 26.673  0.032 26.705  0.000  0.000 [1] 77.186  0.037 
77.225  0.000  
0.000
[1] 128.165   0.042 128.209   0.000   0.000
[1] 180.940   0.047 180.989   0.000   0.000

See the big diff and the scaling I mentioned above?

Further more I removed geno[[j]] assignment leaving the operation 
though, i.e., replaced it with ### line above. Times:

[1] 0.857 0.008 0.865 0.000 0.000

Huh!? What the heck! That's my second question :-) Any ideas?

I still believe my method is near optimal. Of course I have 
to somehow 
get rid of the assignment bottleneck.

For now the lesson is: "God bless lists"

Here is my final solution:

system.time({
+     geno.GASP.L<-lapply(geno.GASP
+                        ,function(x){
+                           for(l in 
1:length(x at levels)){levels(x)[l] 
+ <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+                           factor(x,levels=0:2)
+                         }
+                  )
+     geno.JAG.L <-lapply(geno.JAG
+                        ,function(x){
+ #                         for(l in 
1:length(x at levels)){levels(x)[l] <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+                           factor(x,levels=0:2)
+                         }
+                  )
+ })
[1] 192.800   1.566 194.413   0.000   0.000   !!!!!!!!! :-)))))
system.time({
+     class    (geno.GASP.L)<-"data.frame"
+     row.names(geno.GASP.L)<-row.names(geno.GASP)
+     class    (geno.JAG.L )<-"data.frame"
+     row.names(geno.JAG.L )<-row.names(geno.JAG )
+ })
[1] 12.156  0.001 12.155  0.000  0.000
system.time({
+     geno<-rbind(geno.GASP.L,geno.JAG.L)
+ })
[1] 1542.340    9.072 2066.310    0.000    0.000

I logged my notes here as I was trying various things. Partly the 
reason is my two questions:

"What was wrong with me?" and
"What the heck?!" remember above? :-)))

which  still remain unanswered :-(

I would have had a lot of fun if I had not to have this done by ...
Yesterday :-))

Thanks a lot for the help

Latchezar

dim(genoT)
class(genoT)
system.time(out <- lapply(genoT, function(x) match(x, 
c("AA", "AB",
"BB"))-1))
##
##
    user  system elapsed
119.288   0.004 119.339

(for all 240K)

best,
b

ps: note that "out" is a list.

On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote:

Hi,

-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
Sent: Friday, July 20, 2007 12:25 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

it looks like that whatever method you used to genotype the
1002 samples on the STY array gave you a transposed matrix of 
genotype calls. :-)
It only looks like :-)

Otherwise it is correctly created dataframe of 1002 
samples X (big
number) of columns (SNP genotypes). It worked perfectly until I 
decided to put together to cohorts independently processed in R 
already. I got stuck with my lack of foreseeing. 
Otherwise I would 
have put 3 dummy lines w/ AA,AB, and AB on each one to make
sure all 3
genotypes are present and that's it! Lesson for the future :-)

Maybe I am not using columns and rows appropriately 
here but the 
dataframe is correct (I have not used FORTRAN since 
FORTRAN IV ;-)
- as
str says 1002 observ. of (big number) vars.

i'd use:

genoT = read.table(yourFile, stringsAsFactors = FALSE)

as a starting point... but I don't think that would be
efficient (as
you'd need to fix one column at a time - lapply).
No it was not efficient at all. 'matter of fact nothing is more 
efficient then loading already read data, alas :-(

i'd preprocess yourFile before trying to load it:

cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e 
's/BB/3/ g' > outFile

and, now, in R:

genoT = read.table(outFile, header=TRUE)
... Too late ;-) As it must be clear now I have two
dataframes I want
to put together with rbind(geno1,geno2). The issue again is 
"uniformization" of factor variables w/ missing factors -
they ended
up like levels AA,BB on one of the and levels AB,BB on the
other which
means as.numeric of AA is 1 on the 1st and as.numeric 
of AB is 1 
on the second - complete mess. That's why I tried to make both
uniform,
i.e.
levels "AA","AB", and "BB" for every SNP and then rbind works.

In any case my 1st questions remains: "What's wrong 
with me?" :-)
Thanks,
Latchezar

b

On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:

Hello,

This is a speed question. I have a dataframe genoT:

dim(genoT)
[1]   1002 238304

str(genoT)
'data.frame':   1002 obs. of  238304 variables:
 $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
3 3 3 3 3
...
 $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1
1 1 2 2 2
...
 $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1
1 1 1 1 1
...
 $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
3 3 3 3 3
...
 $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2
3 2 3 3 1
...
 $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA
1 NA 2 1 1
2 1
...
 $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
1 1 1 1 2
...
 $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3
3 3 3 3 2
...
 $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
1 1 1 1 2
...
 $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2
1 2 1 1 3
...
 $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 
2 2 3 3 3 2 
1
2 2 3
...
 $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 
3 3 3 3 3 3 
3
3 3 3
...
 $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 
2 2 2 1 1 1 
2
2 2 2
...
 $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 
1 1 1 1 1 
1
1 ...
 $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 
2 3 2 2 3 2 
2
1 1 2
...
 $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 
1 1 1 1 1 1 
1
1 1 1
...
 $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 
1 1 2 1 1 1 
1
1 1 1
...
 $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 
1 1 1 1 1 
1
1 ...
 $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 
1 1 1 1 1 2 
1
1 1 2
...
 $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2
2 2 NA 1 NA
2
1 ...
 $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 
1 2 2 1 1 1 
3
1 1 1
...
 $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 
2 2 2 2 2 
2
2 ...
 $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 
1 1 1 1 1 
1
1 ...
 $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 
1 1 1 1 1 1 
1
2 2 1
...

Its columns are factors with different number of levels
(from 1 to 3 -
that's what I got from read.table, i.e., it dropped missing
levels). I
want to convert it to uniform factors with 3 levels. The
1st 10 rows
above show already converted columns and the rest are not yet 
converted.
Here's my attempt wich is a complete failure as speed:

system.time(
+     for(j in 1:(10         )){ #-- this is to try 1st
10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+        gt<-genoT[[j]]          #-- this is to avoid 
2D indices
+        for(l in 1:length(gt at levels)){
+          levels(gt)[l] <-
switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+          genoT[[j]]<-factor(gt,levels=0:2)   #-- 
make a 3-level
factor
and put it back
+        }
+     }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making
a factor
of
1002 element vector x 10 is a "negligible" amount of 
operations 
needed.

So, what's wrong with me? Any idea how to accelerate
significantly the
transformation or (to go to the very beginning) to make
read.table use
a fixed set of levels ("AA","AB", and "BB") and not 
to drop any
(missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz)
so it's not
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine, 
Winston-Salem, North 
Carolina, USA

______________________________________________
R-help at stat.math.ethz.ch mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-
guide.html and provide commented, minimal, self-contained, 
reproducible code.

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

The problem is in the way that 'as.data.frame' works.  Use Rprof on a
small list and you will see where it is spending its time.

Now if you are really sure that all your data is consistent with being
a data frame,
you can create your own dataframe structure your self.  Not that I
would advocate it, but if you look at the output of 'dput' on a
dataframe, you can construct your own.

Here it took 20 seconds to create the test data with a list of 50,000
and only 2 seconds to create the data frame from that.
set.seed(123)
n <- 50000
system.time({
+ genoT <- lapply(1:n, function(i) factor(sample(c("AA",
+ "AB", "BB"), 1000, prob=c(1000, 1, 1), rep=T)))
+ })
   user  system elapsed
  20.85    0.12   22.83
names(genoT) = paste("snp", 1:n, sep="")

# create your own data frame structure -- if you are real sure of your data

system.time(genoTz <- structure(genoT, .Names=names(genoT),
+     row.names=c(NA, -length(genoT[[1]])), class='data.frame'))
   user  system elapsed
   2.00    0.08    2.11
str(genoTz)
'data.frame':   1000 obs. of  50000 variables:
 $ snp1    : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp2    : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp3    : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp4    : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp5    : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp6    : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp7    : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp8    : Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp9    : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp10   : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp11   : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...

Jim,

No, this is _not the problem. If you go to my 1st mail I have a monster
(at least was when I purchased it) with 32GB (sic :-) of RAM and 4 dual
core AMD64 285 (the fastest at that time and still pretty fast now :-)

The machine stats paging when I run 2 copies of R working on two things
like that :-). If you look at my last e-mail I found a solution but
still have no clue why the heck x<-as.data.frame(y) where why is a list
of the same columns take real for ever and this the thing that killed me
before.

Thanks,
Latchezar

-----Original Message-----
From: jim holtman [mailto:jholtman at gmail.com]
Sent: Saturday, July 21, 2007 5:33 PM
To: Latchezar Dimitrov
Cc: Benilton Carvalho; r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

One of the problems is that you are probably paging on your
system with an object that size (240000 x 1000).  This is
about 1GB for a single object:

set.seed(123)
n <- 240000
system.time({
+ genoT <- lapply(1:n, function(i) factor(sample(c("AA", "AB", "BB"),
+ 1000, prob=c(1000, 1, 1), rep=T)))
+ })
   user  system elapsed
  95.00    0.61  104.71
names(genoT) = paste("snp", 1:n, sep="")

object.size(genoT)
[1] 1045258752

I can create it on my 2GB machine as a list, but have
problems converting it to a dataframe because I don't have
enough memory.

So unless you have at least 4GB on your system, it might take
a long time.  Look at your performance measurements on your
system and see if you have run out of physical memory and are paging.

On 7/21/07, Latchezar Dimitrov <ldimitro at wfubmc.edu> wrote:
Hi,

Thanks for the help. My 1st question still unanswered though :-)
Please see bellow

-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
Sent: Friday, July 20, 2007 3:30 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

set.seed(123)
genoT = lapply(1:240000, function(i) factor(sample(c("AA", "AB",
"BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
names(genoT) = paste("snp", 1:240000, sep="") genoT =
as.data.frame(genoT)
Now this _is the problem. Everything before converting to
data.frame
worked almost instantaneously however as.data.frame runs forever.
Obviously there is some scalability memory management issue. When I
tried my own method but creating a new result (instead of modifying
the
old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I
figured 300,000 cols should be ~1000s. Nope! It ran for about
50,000(!)s to finish about 42,000 cols only.

BTW, what ver. of R is yours?

Now here's what I "discovered" further.

#-- create a 1-col frame:
   geno   <-

data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.G
AS
P),rownames(geno.JAG)))

#-- main code I repeated it w/ j in 1:1000, 2001:3000, and
3001:4000,
i.e., adding a 1000 of cols to geno each time

system.time(
#   for(j in 1:(ncol(geno.GASP      ))){
   for(j in 3001:(4000              )){
     gt.GASP<-geno.GASP[[j]]
      for(l in 1:length(gt.GASP at levels)){
        levels(gt.GASP)[l] <-
switch(gt.GASP at levels[l],AA="0",AB="1",BB="2")
      }
      gt.JAG <-geno.JAG [[j]]
#      for(l in 1:length(gt.JAG @levels)){
#        levels(gt.JAG )[l] <- switch(gt.JAG
@levels[l],AA="0",AB="1",BB="2")
#      }
      geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
###               factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
                         ,as.numeric(factor(gt.JAG, levels=0:2))-1
                         )
                       ,levels=0:2
                       )
   }
)

Times (each one is for a 1000 cols!):
[1] 26.673  0.032 26.705  0.000  0.000 [1] 77.186  0.037
77.225  0.000
0.000
[1] 128.165   0.042 128.209   0.000   0.000
[1] 180.940   0.047 180.989   0.000   0.000

See the big diff and the scaling I mentioned above?

Further more I removed geno[[j]] assignment leaving the operation
though, i.e., replaced it with ### line above. Times:

[1] 0.857 0.008 0.865 0.000 0.000

Huh!? What the heck! That's my second question :-) Any ideas?

I still believe my method is near optimal. Of course I have
to somehow
get rid of the assignment bottleneck.

For now the lesson is: "God bless lists"

Here is my final solution:

system.time({
+     geno.GASP.L<-lapply(geno.GASP
+                        ,function(x){
+                           for(l in
1:length(x at levels)){levels(x)[l]
+ <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+                           factor(x,levels=0:2)
+                         }
+                  )
+     geno.JAG.L <-lapply(geno.JAG
+                        ,function(x){
+ #                         for(l in
1:length(x at levels)){levels(x)[l] <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+                           factor(x,levels=0:2)
+                         }
+                  )
+ })
[1] 192.800   1.566 194.413   0.000   0.000   !!!!!!!!! :-)))))
system.time({
+     class    (geno.GASP.L)<-"data.frame"
+     row.names(geno.GASP.L)<-row.names(geno.GASP)
+     class    (geno.JAG.L )<-"data.frame"
+     row.names(geno.JAG.L )<-row.names(geno.JAG )
+ })
[1] 12.156  0.001 12.155  0.000  0.000
system.time({
+     geno<-rbind(geno.GASP.L,geno.JAG.L)
+ })
[1] 1542.340    9.072 2066.310    0.000    0.000

I logged my notes here as I was trying various things. Partly the
reason is my two questions:

"What was wrong with me?" and
"What the heck?!" remember above? :-)))

which  still remain unanswered :-(

I would have had a lot of fun if I had not to have this done by ...
Yesterday :-))

Thanks a lot for the help

Latchezar

dim(genoT)
class(genoT)
system.time(out <- lapply(genoT, function(x) match(x,
c("AA", "AB",
"BB"))-1))
##
##
    user  system elapsed
119.288   0.004 119.339

(for all 240K)

best,
b

ps: note that "out" is a list.

On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote:

Hi,

-----Original Message-----
From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
Sent: Friday, July 20, 2007 12:25 AM
To: Latchezar Dimitrov
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Dataframe of factors transform speed?

it looks like that whatever method you used to genotype the
1002 samples on the STY array gave you a transposed matrix of
genotype calls. :-)
It only looks like :-)

Otherwise it is correctly created dataframe of 1002
samples X (big
number) of columns (SNP genotypes). It worked perfectly until I
decided to put together to cohorts independently processed in R
already. I got stuck with my lack of foreseeing.
Otherwise I would
have put 3 dummy lines w/ AA,AB, and AB on each one to make
sure all 3
genotypes are present and that's it! Lesson for the future :-)

Maybe I am not using columns and rows appropriately
here but the
dataframe is correct (I have not used FORTRAN since
FORTRAN IV ;-)
- as
str says 1002 observ. of (big number) vars.

i'd use:

genoT = read.table(yourFile, stringsAsFactors = FALSE)

as a starting point... but I don't think that would be
efficient (as
you'd need to fix one column at a time - lapply).
No it was not efficient at all. 'matter of fact nothing is more
efficient then loading already read data, alas :-(

i'd preprocess yourFile before trying to load it:

cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e
's/BB/3/ g' > outFile

and, now, in R:

genoT = read.table(outFile, header=TRUE)
... Too late ;-) As it must be clear now I have two
dataframes I want
to put together with rbind(geno1,geno2). The issue again is
"uniformization" of factor variables w/ missing factors -
they ended
up like levels AA,BB on one of the and levels AB,BB on the
other which
means as.numeric of AA is 1 on the 1st and as.numeric
of AB is 1
on the second - complete mess. That's why I tried to make both
uniform,
i.e.
levels "AA","AB", and "BB" for every SNP and then rbind works.

In any case my 1st questions remains: "What's wrong
with me?" :-)
Thanks,
Latchezar

b

On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:

Hello,

This is a speed question. I have a dataframe genoT:

dim(genoT)
[1]   1002 238304

str(genoT)
'data.frame':   1002 obs. of  238304 variables:
 $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
3 3 3 3 3
...
 $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1
1 1 2 2 2
...
 $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1
1 1 1 1 1
...
 $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
3 3 3 3 3
...
 $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2
3 2 3 3 1
...
 $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA
1 NA 2 1 1
2 1
...
 $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
1 1 1 1 2
...
 $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3
3 3 3 3 2
...
 $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
1 1 1 1 2
...
 $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2
1 2 1 1 3
...
 $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB":
2 2 3 3 3 2
1
2 2 3
...
 $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB":
3 3 3 3 3 3
3
3 3 3
...
 $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB":
2 2 2 1 1 1
2
2 2 2
...
 $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1
1 1 1 1 1
1
1 ...
 $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB":
2 3 2 2 3 2
2
1 1 2
...
 $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB":
1 1 1 1 1 1
1
1 1 1
...
 $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB":
1 1 2 1 1 1
1
1 1 1
...
 $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1
1 1 1 1 1
1
1 ...
 $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB":
1 1 1 1 1 2
1
1 1 2
...
 $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2
2 2 NA 1 NA
2
1 ...
 $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB":
1 2 2 1 1 1
3
1 1 1
...
 $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2
2 2 2 2 2
2
2 ...
 $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1
1 1 1 1 1
1
1 ...
 $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB":
1 1 1 1 1 1
1
2 2 1
...

Its columns are factors with different number of levels
(from 1 to 3 -
that's what I got from read.table, i.e., it dropped missing
levels). I
want to convert it to uniform factors with 3 levels. The
1st 10 rows
above show already converted columns and the rest are not yet
converted.
Here's my attempt wich is a complete failure as speed:

system.time(
+     for(j in 1:(10         )){ #-- this is to try 1st
10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+        gt<-genoT[[j]]          #-- this is to avoid
2D indices
+        for(l in 1:length(gt at levels)){
+          levels(gt)[l] <-
switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+          genoT[[j]]<-factor(gt,levels=0:2)   #--
make a 3-level
factor
and put it back
+        }
+     }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making
a factor
of
1002 element vector x 10 is a "negligible" amount of
operations
needed.

So, what's wrong with me? Any idea how to accelerate
significantly the
transformation or (to go to the very beginning) to make
read.table use
a fixed set of levels ("AA","AB", and "BB") and not
to drop any
(missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz)
so it's not
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine,
Winston-Salem, North
Carolina, USA

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-
guide.html and provide commented, minimal, self-contained,
reproducible code.

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?