[Bioc-devel] Memory limits for individual objects
GGtools uses snpMatrix snp.matrix instances extensively. A list of chromosome-specific snp.matrices is used. You might be able decompose the data in various ways to keep very large quantities of genotype data in an object without having a single entity with more than 2e9 elements? on a machine with adequate ram, list(integer(2e9), integer(2e9)) can be constructed.
On Wed, Jul 29, 2009 at 8:06 AM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
Hi Tim -- Tim Rayner <tfrayner at gmail.com> writes:
Hi, I'm running R 2.9.0 on Mac OS X, and attempting to import a large number of SNP calls into a snp.matrix object as supported by the snpMatrix package. I've run into a problem where I'd like the final matrix object to contain around 5e9 elements, but of course the maximum vector (and matrix) size in R is 2^31-1 (approx. 2e9), and I get the dreaded "allocMatrix: too many elements specified" error message. An obvious workaround is to split the analysis up into parts which will fit within this limit, but I feel I should ask whether there's a better way. I'm using a 64-bit build of R and I was wondering whether anyone had experience changing the indexing of R vectors and matrices from signed 32-bit integers to signed 64-bit integers? Or should I just head over to R-devel directly (in which case, apologies for the mispost)? There was a not-terribly-helpful exchange regarding this question on the main Bioconductor list last year: http://www.nabble.com/allocMatrix-limits-td18763791.html
I'm not speaking with too much authority here, not having looked in detail into snpMatrix. It would be a significant task (not impossible, very challenging if one aims for portability) to change this limitation in R. The fastest way forward is probably to use some on-disk storage (I like the ncdf package for large numeric matrices) coupled with access to slices at a time. ?Alternatively you can manage your own memory via external pointers, etc., but this requires that you implement whatever matrix-like functionality you want, losing all of the hard work others have done. There are packages in the bioc repository that have addressed these issues to one degree or another, including BufferedMatrix and externalVector. An interesting activity might augment the snpMatrix package with external pointer memory management, because as you say this really is a case where the data quickly hit the R limit, and restricting focus to snpMatrix delimits the functionality that would need to be provided by your code. Another possibility is to scrap the strict 'matrix' representation; maybe the Rle class from IRanges would be a very effective compression tool, leading to straight-forward and efficient algorithms for basic calculations, and allowing not too expensive expansion of slices of the Rle to full vectors for more elaborate computation. Martin
Since the snpMatrix package stores each element as type 'raw', the final memory consumption for the snp.matrix object should only be a handful of gigabytes, readily available in modern desktop computers. It would be nice to be able to use that memory. Many thanks, Tim Rayner Bioinformatician - Smith Lab Cambridge Institute for Medical Research University of Cambridge
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Vincent Carey, PhD Biostatistics, Channing Lab 617 525 2265