Skip to content

tapply for enormous (>2^31 row) matrices

4 messages · Matthew Keller, ilai, Gabor Grothendieck

#
Hi all,

SETUP:
I have pairwise data on 22 chromosomes. Data matrix X for a given
chromosome looks like this:

1 13 58 1.12
6 142 56 1.11
18 307 64 3.13
22 320 58 0.72

Where column 1 is person ID 1, column 2 is person ID 2, column 3 can
be ignored, and column 4 is how much chromosomal sharing those two
individuals have in some small portion of the chromosome. There are
9000 individual people, and therefore ~ (9000^2)/2 pairwise matches at
each small location on the chromosome, so across an entire chromosome,
these matrices are VERY large (e.g., 3 billion rows, which is > the
2^31 vector size limitation in R). I have access to a server with 64
bit R, 1TB RAM and 80 processors.

PROBLEM:
A pair of individuals (e.g., person 1 and 13 from the first row above)
will show up multiple times in a given file. I want to sum column 4
across each pair of individuals. If I could bring the matrix into R, I
could use tapply() to accomplish this by indexing on
paste(X[,1],X[,2]), but the matrix doesn't fit into R. I have been
trying to use bigmemory and bigtabulate packages in R, but when I try
to use the bigsplit function, R never completes the operation (after a
day, I killed the process). In particular, I did this:

X <- read.big.matrix("file.loc.X",sep=" ",type="double")
hap.indices <- bigsplit(X,1:2) #this runs for too long to be useful on
these matrices
#I was then going to use foreach loop to sum across the splits
identified by bigsplit

SO - does anyone have ideas on how to deal with this problem - i.e.,
how to use a tapply() like function on an enormous matrix? This isn't
necessarily a bigtabulate question (although if I screwed up using
bigsplit, let me know). If another package (e.g., an SQL package) can
do something like this efficiently, I'd like to hear about it and your
experiences using it.

Thank you in advance,

Matt
#
On Tue, Feb 21, 2012 at 4:04 PM, Matthew Keller <mckellercran at gmail.com> wrote:

            
How about just using foreach earlier in the process ? e.g. split
file.loc.X to (80) sub files and then run
read.big.matrix/bigsplit/sum inside %dopar%

If splitting X beforehand is a problem, you could also use ?scan to
read in different chunks of the file, something like (untested
obviously):
# for X a matrix 800x4
lineind<- seq(1,800,100)  # create an index vec for the lines to read
ReducedX<- foreach(i = 1:8) %dopar%{
  x <- scan('file.loc.X',list(double(0),double(0),double(0),double(0)),skip=lineind[i],nlines=100)
... do your thing on x (aggregate/tapply etc.)
  }

Hope this helped
Elai.
#
Thank you all very much for your help (on both the r-help and the
bioconductor listserves).

Benilton - I couldn't get sqldf to install on the server I'm using
(error is: Error : package 'gsubfn' does not have a name space). I
think this was a problem for R 2.13, and I'm trying to get the admin's
to install a more up-to-date version. I know that I need to probably
learn a modicum of SQL given the sizes of datasets I'm using now.

I ended up using a modified version of Herv? Pag?s' excellent code
(thank you!). I got a huge (40-fold) speed bump by using the
data.table package for indexing/aggregate steps, making an hours long
job a minutes long job. SO - read.table is hugely useful if you're
dealing with indexing/apply-family functions on huge datasets. By the
way, I'm not sure why, but read.table was a bit faster than scan for
this problem... Here is the code for others:


require(data.table)

computeAllPairSums <- function(filename, nbindiv,nrows.to.read)
{
   con <- file(filename, open="r")
   on.exit(close(con))
   ans <- matrix(numeric(nbindiv * nbindiv), nrow=nbindiv)
   chunk <- 0L
   while (TRUE) {
       #read.table faster than scan
       df0 <- read.table(con,col.names=c("ID1", "ID2", "ignored", "sharing"),
                colClasses=c("integer", "integer", "NULL",
"numeric"),nrows=nrows.to.read,comment.char="")

       DT <- data.table(df0)
       setkey(DT,ID1,ID2)
       ss <- DT[,sum(sharing),by="ID1,ID2"]

       if (nrow(df0) == 0L)
           break

       chunk <- chunk + 1L
       cat("Processing chunk", chunk, "... ")

      idd <- as.matrix(subset(ss,select=1:2))
      newvec <- as.vector(as.matrix(subset(ss,select=3)))
      ans[idd] <- ans[idd] + newvec

         cat("OK\n")
     }
   ans
 }
On Wed, Feb 22, 2012 at 3:20 PM, ilai <keren at math.montana.edu> wrote:

  
    
3 days later
#
On Thu, Feb 23, 2012 at 11:39 AM, Matthew Keller <mckellercran at gmail.com> wrote:
Right. See the troubleshooting section of the sqldf home page:
http://code.google.com/p/sqldf/#Troubleshooting