Skip to content
Prev 319094 / 398506 Next

How to transpose it in a fast way?

On 03/08/2013 06:01 AM, Jan van der Laan wrote:
I think it's worth asking what the overall goal is; all we get from this 
exercise is another large file that we can't easily manipulate in R!

But nothing like a little challenge. The idea I think would be to transpose in 
chunks of rows by scanning in some number of rows and writing to a temporary file

     tpose1 <- function(fin, nrowPerChunk, ncol) {
         v <- scan(fin, character(), nmax=ncol * nrowPerChunk)
         m <- matrix(v, ncol=ncol, byrow=TRUE)
         fout <- tempfile()
         write(m, fout, nrow(m), append=TRUE)
         fout
     }

Apparently the data is 60k x 60k, so we could maybe easily read 60k x 10k at a 
time from some file fl <- "big.txt"

     ncol <- 60000L
     nrowPerChunk <- 10000L
     nChunks <- ncol / nrowPerChunk

     fin <- file(fl); open(fin)
     fls <- replicate(nChunks, tpose1(fin, nrowPerChunk, ncol))
     close(fin)

'fls' is now a vector of file paths, each containing a transposed slice of the 
matrix. The next task is to splice these together. We could do this by taking a 
slice of rows from each file, cbind'ing them together, and writing to an output

     splice <- function(fout, cons, nrowPerChunk, ncol) {
         slices <- lapply(cons, function(con) {
             v <- scan(con, character(), nmax=nrowPerChunk * ncol)
             matrix(v, nrowPerChunk, byrow=TRUE)
         })
         m <- do.call(cbind, slices)
         write(t(m), fout, ncol(m), append=TRUE)
     }

We'd need to use open connections as inputs and output

     cons <- lapply(fls, file); for (con in cons) open(con)
     fout <- file("big_transposed.txt"); open(fout, "w")
     xx <- replicate(nChunks, splice(fout, cons, nrowPerChunk, nrowPerChunk))
     for (con in cons) close(con)
     close(fout)

As another approach, it looks like the data are from genotypes. If they really 
only consist of pairs of A, C, G, T, then two pairs e.g., 'AA' 'CT' could be 
encoded as a single byte

     alf <- c("A", "C", "G", "T")
     nms <- outer(alf, alf, paste0)
     map <- outer(setNames(as.raw(0:15), nms),
                  setNames(as.raw(bitwShiftL(0:15, 4)), nms),
                  "|")

with e.g.,

 > map[matrix(c("AA", "CT"), ncol=2)]
[1] d0

This translates the problem of representing the 60k x 60k array as a 3.6 billion 
element vector of 60k * 60k * 8 bytes (approx. 30 Gbytes) to one of 60k x 30k = 
1.8 billion elements (fits in R-2.15 vectors) of approx 1.8 Gbyte (probably 
usable in an 8 Gbyte laptop).

Personally, I would probably put this data in a netcdf / rdf5 file. Perhaps I'd 
use snpStats or GWAStools in Bioconductor http://bioconductor.org.

Martin