[Bioc-devel] strange behavior on memory usage
Hi Vince, et al.
it seems to me the problem is bigger than just fixing the "show" method
and caching (duplicating) e.g. the dimension information in extra slots.
I am a bit worried that if "getExpData" is such a memory hog the whole
eSet class becomes much less useful - and people might be tempted to
revert back to using simple matrices for performance-critical
computations. Is there a better way to do this avoiding such overhead
with "getExpData" in the first place? (I guess we might need somebody
who understands the memory management in R and perhaps even can write
some of the necessary infrastructure in C.)
What I don't understand in Benilton's Email (one of the many things) is
this "ps: i just noticed that using dim(exprs(x)) in show() reduces the
memory usage from 6GB to 3.5GB... " but the implementation of exprs() is
setMethod("exprs", "eSet",
function(object) getExpData(object, "exprs")
)
i.e. it just calls getExpData:
setMethod("getExpData", c("eSet", "character"),
function(object, name) {
object at eList[[name]] })
Best,
Wolfgang
Vincent Carey 525-2265 wrote:
hi everyone, i was wondering if anybody could give me a hint of what causes a strange behavior on memory usage when using oligo/makePlatformDesign packages. i'm reading a bunch of (affy) SNP chips:
x = read.celfiles(list.celfiles())
-> at this point the R process uses around 2GB
-> which does not look bad, since i'm reading 90 samples
show(x)
-> now the R process uses around 6GB
-> how can i improve the code so it does not uses so much memory?
-> the information i'm using at this step comes basically from
-> dim(getExpData(x, "exprs"))
I have not tried to reproduce this yet for lack of time. But it seems to me that the principle we need to establish here is: for any massive data structure, we need to put relevant metadata in slots, and interrogate only those slots. I don't know what dim() or getExpData() are doing, but my guess is that they are making some copies of something that they shouldn't need. you mention an issue with str() also -- now perhaps we need to write an oligobatch method for str that doesn't poke around too much? not sure Let's put the necessary dimension data in slots and be sure to update those slots whenever subsetting is done. And anything that show() needs should likewise be available without doing anything to the potentially massive datastructures. A couple of other points: 1) I noticed that a pdmapping environment has X and Y as vectors of integers. These are pretty big. Is it possible to use i2xy and xy2i software to get rid of these completely? these functions can be put into the environment, and the necessary offsets can be updated whenever a subset is done using a closure construct 2) installed package footprints with large .rda structures can be enormous, approaching 1GB. We can use save(...,compress=TRUE) to reduce the installed footprint and the usage overhead at load time seems quite acceptable. I got the pdmapping50khind240.rda down from 440MB to 60MB with this method. I understand that compress=TRUE has no impact on the compressed preinstallation package size. I am concerned about postinstall footprints.
gc()
-> back to 2GB
in the above, 'x' is an oligoBatch object (which contains eSet, details at the
end of this message).
any suggestion?
thanks a lot,
benilton
ps: i just noticed that using dim(exprs(x)) in show() reduces the memory usage
from 6GB to 3.5GB... and using str(x) increases it to 10.5GB.
-----------------------------------------------------------------------------
R version 2.2.0, 2005-07-26, x86_64-unknown-linux-gnu
attached base packages:
[1] "tools" "methods" "stats" "graphics" "grDevices" "utils"
[7] "datasets" "base"
other attached packages:
oligo reposTools Biobase
"0.0.7" "1.6.0" "1.6.6"
-------------------------------------------------------------------------------
str(x)
Formal class 'oligoBatch' [package "oligo"] with 8 slots ..@ manufacturer: chr "Affymetrix" ..@ platform : chr "Mapping50K_Hind240" ..@ eList :Formal class 'exprList' [package "Biobase"] with 2 slots .. .. ..@ eMetadata:`data.frame': 0 obs. of 0 variables .. .. ..@ eList :List of 1 .. .. .. ..$ exprs: num [1:2560000, 1:90] 1369 65472 ... .. .. .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. .. .. ..$ : NULL .. .. .. .. .. ..$ : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ... ..@ description :Formal class 'MIAME' [package "Biobase"] with 11 slots .. .. ..@ name : chr "" .. .. ..@ lab : chr "" .. .. ..@ contact : chr "" .. .. ..@ title : chr "" .. .. ..@ abstract : chr "" .. .. ..@ url : chr "" .. .. ..@ samples : list() .. .. ..@ hybridizations: list() .. .. ..@ normControls : list() .. .. ..@ preprocessing :List of 2 .. .. .. ..$ filenames : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ... .. .. .. ..$ oligoversion: chr NA .. .. ..@ other : list() ..@ annotation : chr "" ..@ sampleNames : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ... ..@ notes : chr "" ..@ phenoData :Formal class 'phenoData' [package "Biobase"] with 3 slots .. .. ..@ pData :`data.frame': 90 obs. of 1 variable: .. .. .. ..$ sample: int [1:90] 1 2 3 4 5 6 7 8 9 10 ... .. .. ..@ varLabels :List of 1 .. .. .. ..$ sample: chr "arbitrary numbering" .. .. ..@ varMetadata:`data.frame': 0 obs. of 0 variables
Best regards Wolfgang ------------------------------------- Wolfgang Huber European Bioinformatics Institute European Molecular Biology Laboratory Cambridge CB10 1SD England Phone: +44 1223 494642 Fax: +44 1223 494486 Http: www.ebi.ac.uk/huber