Skip to content

How to transpose it in a fast way?

12 messages · Peter Langfelder, Ista Zahn, Peter Dalgaard +6 more

#
Dear all:

I have a big data file of 60000 columns and 60000 rows like that:

AA AC AA AA .......AT
CC CC CT CT.......TC
..........................
.........................

I want to transpose it and the output is a new like that
AA CC ............
AC CC............
AA CT.............
AA CT.........
....................
....................
AT TC.............

The keypoint is  I can't read it into R by read.table() because the
data is too large,so I try that:
c<-file("silygenotype.txt","r")
geno_t<-list()
repeat{
  line<-readLines(c,n=1)
  if (length(line)==0)break  #end of file
  line<-unlist(strsplit(line,"\t"))
geno_t<-cbind(geno_t,line)
}
 write.table(geno_t,"xxx.txt")

It works but it is too slow ,how to optimize it???

Thank you

Yao He
?????????????????????????
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1988 at gmail.com
??????????????????????????
#
On Wed, Mar 6, 2013 at 4:18 PM, Yao He <yao.h.1988 at gmail.com> wrote:
I hate to be negative, but this will also not work on a 60000x 60000
matrix. At some point R will complain either about the lack of memory
or about you trying to allocate a vector that is too long.

I think your best bet is to look at file-backed data packages (for
example, package bigmemory). Look at this URL:
http://cran.r-project.org/web/views/HighPerformanceComputing.html and
scroll down to  Large memory and out-of-memory data. Some of the
packages may have the functionality you are looking for and may do it
faster than your code.

If this doesn't help, you _may_ be able to make your code work, albeit
slowly, if you replace the cbind() by data.frame. cbind() will in this
case produce a matrix, and matrices are limited to 2^31 elements,
which is less than 60000 times 60000. A data.frame is a special type
of list and so _may_ be able to handle that many elements, given
enough system RAM. There are experts on this list who will correct me
if I'm wrong.

If you are on a linux system, you can use split (type man split at the
shell prompt to see help) to split the file into smaller chunks of say
5000 lines or so. Process each file separately, write it into a
separate output file, then use the linux utility paste to "paste" the
files side-by-side into the final output.

Further, if you want to make it faster, do not grow geno_t by
cbind'ing a new column to it in each iteration. Pre-allocate a matrix
or data frame of an appropriate number of rows and columns and fill it
out as you go. But it will still be slow, which I think is due to the
inherent slowness of readLines and possibly strsplit.

HTH,

Peter
#
On Wed, Mar 6, 2013 at 7:56 PM, Peter Langfelder
<peter.langfelder at gmail.com> wrote:
Maybe this depends on the R version. I have not tried it, but the dev
version of R can handle much larger vectors. See
http://stat.ethz.ch/R-manual/R-devel/library/base/html/LongVectors.html

Yau He, if you are feeling adventurous you could give the development
version of R a try.

Best,
Ista
1 day later
#
On Mar 7, 2013, at 01:18 , Yao He wrote:

            
As others have pointed out, that's a lot of data! 

You seem to have the right idea: If you read the columns line by line there is nothing to transpose. A couple of points, though:

- The cbind() is a potential performance hit since it copies the list every time around. geno_t <- vector("list", 60000) and then 
geno_t[[i]] <- <etc>

- You might use scan() instead of readLines, strsplit

- Perhaps consider the data type as you seem to be reading strings with 16 possible values (I suspect that R already optimizes string storage to make this point moot, though.)
#
Perhaps you could process this with a unix/Linux utility "Awk", before reading the file into R.
-Sohail
#
You could use the fact that scan reads the data rowwise, and the fact  
that arrays are stored columnwise:

# generate a small example dataset
exampl <- array(letters[1:25], dim=c(5,5))
write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE,
     sep="\t", quote=FALSE)

# and read...
d <- scan("example.dat", what=character())
d <- array(d, dim=c(5,5))

t(exampl) == d


Although this is probably faster, it doesn't help with the large size.  
You could used the n option of scan to read chunks/blocks and feed  
those to, for example, an ff array (which you ideally have  
preallocated).

HTH,

Jan




peter dalgaard <pdalgd at gmail.com> schreef:
#
On Mar 8, 2013, at 6:01 AM, Jan van der Laan wrote:

            
This might avoid creation of some of the intermediate copies:

MASS::write.matrix( matrix( scan("example.dat", what=character()), 5,5), file="fil.out")

I tested it up to a 5000 x 5000 file:
Read 25000000 items
Not sure of the exact timing. Probably 5-10 minutes. The exampl-object takes 200,001,400 bytes. and did not noticeably stress my machine. Most of my RAM remains untouched. I'm going out on errands and will run timing on a 10K x 10K test case within a system.time() enclosure. Scan did report successfully reading 100000000 items fairly promptly.
#
On Mar 8, 2013, at 9:31 AM, David Winsemius wrote:

            
Read 100000000 items
    user   system  elapsed 
 487.100  912.613 1415.228
Read 250000 items
   user  system elapsed 
  1.184   2.507   3.834 

And so it seems to scale linearly:
[1] 1533.6
David Winsemius
Alameda, CA, USA
#
On Mar 8, 2013, at 10:59 AM, David Winsemius wrote:

            
However, another posting today reminds us that this would best be attempted in a version of R that can handle matrices of that are larger than 2^15-1:
[1] TRUE
[1] FALSE

R 3.0 is scheduled for release soon and you can compile it from sources if your machine is properly equipped. It has larger integers, and I _think_ may support such larger matrices.
#
On 03/08/2013 06:01 AM, Jan van der Laan wrote:
I think it's worth asking what the overall goal is; all we get from this 
exercise is another large file that we can't easily manipulate in R!

But nothing like a little challenge. The idea I think would be to transpose in 
chunks of rows by scanning in some number of rows and writing to a temporary file

     tpose1 <- function(fin, nrowPerChunk, ncol) {
         v <- scan(fin, character(), nmax=ncol * nrowPerChunk)
         m <- matrix(v, ncol=ncol, byrow=TRUE)
         fout <- tempfile()
         write(m, fout, nrow(m), append=TRUE)
         fout
     }

Apparently the data is 60k x 60k, so we could maybe easily read 60k x 10k at a 
time from some file fl <- "big.txt"

     ncol <- 60000L
     nrowPerChunk <- 10000L
     nChunks <- ncol / nrowPerChunk

     fin <- file(fl); open(fin)
     fls <- replicate(nChunks, tpose1(fin, nrowPerChunk, ncol))
     close(fin)

'fls' is now a vector of file paths, each containing a transposed slice of the 
matrix. The next task is to splice these together. We could do this by taking a 
slice of rows from each file, cbind'ing them together, and writing to an output

     splice <- function(fout, cons, nrowPerChunk, ncol) {
         slices <- lapply(cons, function(con) {
             v <- scan(con, character(), nmax=nrowPerChunk * ncol)
             matrix(v, nrowPerChunk, byrow=TRUE)
         })
         m <- do.call(cbind, slices)
         write(t(m), fout, ncol(m), append=TRUE)
     }

We'd need to use open connections as inputs and output

     cons <- lapply(fls, file); for (con in cons) open(con)
     fout <- file("big_transposed.txt"); open(fout, "w")
     xx <- replicate(nChunks, splice(fout, cons, nrowPerChunk, nrowPerChunk))
     for (con in cons) close(con)
     close(fout)

As another approach, it looks like the data are from genotypes. If they really 
only consist of pairs of A, C, G, T, then two pairs e.g., 'AA' 'CT' could be 
encoded as a single byte

     alf <- c("A", "C", "G", "T")
     nms <- outer(alf, alf, paste0)
     map <- outer(setNames(as.raw(0:15), nms),
                  setNames(as.raw(bitwShiftL(0:15, 4)), nms),
                  "|")

with e.g.,

 > map[matrix(c("AA", "CT"), ncol=2)]
[1] d0

This translates the problem of representing the 60k x 60k array as a 3.6 billion 
element vector of 60k * 60k * 8 bytes (approx. 30 Gbytes) to one of 60k x 30k = 
1.8 billion elements (fits in R-2.15 vectors) of approx 1.8 Gbyte (probably 
usable in an 8 Gbyte laptop).

Personally, I would probably put this data in a netcdf / rdf5 file. Perhaps I'd 
use snpStats or GWAStools in Bioconductor http://bioconductor.org.

Martin

  
    
3 days later
#
Thanks for everybody's help!

I learn a lot from this discuss!



2013/3/10 jim holtman <jholtman at gmail.com>: