[Bioc-devel] strange behavior on memory usage

Tue, Aug 23, 2005 2:47 PM

Hi Vince, et al.

it seems to me the problem is bigger than just fixing the "show" method
and caching (duplicating) e.g. the dimension information in extra slots.
I am a bit worried that if "getExpData" is such a memory hog the whole
eSet class becomes much less useful - and people might be tempted to
revert back to using simple matrices for performance-critical
computations. Is there a better way to do this avoiding such overhead
with "getExpData" in the first place? (I guess we might need somebody
who understands the memory management in R and perhaps even can write
some of the necessary infrastructure in C.)

What I don't understand in Benilton's Email (one of the many things) is
this "ps: i just noticed that using dim(exprs(x)) in show() reduces the
memory usage from 6GB to 3.5GB... " but the implementation of exprs() is

setMethod("exprs", "eSet",
           function(object) getExpData(object, "exprs")
           )

i.e. it just calls getExpData:

setMethod("getExpData", c("eSet", "character"),
           function(object, name) {
               object at eList[[name]] })

  Best,
  Wolfgang

Vincent Carey 525-2265 wrote:

hi everyone,

i was wondering if anybody could give me a hint of what causes a strange
behavior on memory usage when using oligo/makePlatformDesign packages.

i'm reading a bunch of (affy) SNP chips:

x = read.celfiles(list.celfiles())

    -> at this point the R process uses around 2GB
    -> which does not look bad, since i'm reading 90 samples

show(x)

    -> now the R process uses around 6GB
    -> how can i improve the code so it does not uses so much memory?
    -> the information i'm using at this step comes basically from
    ->       dim(getExpData(x, "exprs"))


I have not tried to reproduce this yet for lack of time.  But it
seems to me that the principle we need to establish here is:
for any massive data structure, we need to put relevant metadata in slots,
and interrogate only those slots.  I don't know what dim() or getExpData()
are doing, but my guess is that they are making some copies of something
that they shouldn't need.  you mention an issue with str() also -- now
perhaps we need to write an oligobatch method for str that doesn't
poke around too much?  not sure

Let's put the necessary dimension data in slots and be sure to update
those slots whenever subsetting is done.  And anything that show() needs
should likewise be available without doing anything to the potentially
massive datastructures.

A couple of other points:
1) I noticed that a pdmapping environment has X and Y as vectors of integers.
These are pretty big.  Is it possible to use i2xy and xy2i software to get
rid of these completely?  these functions can be put into the environment,
and the necessary offsets can be updated whenever a subset is done using
a closure construct
2) installed package footprints with large .rda structures can be enormous, approaching
1GB.  We can use save(...,compress=TRUE) to reduce the installed footprint
and the usage overhead at load time seems quite acceptable.  I got the
pdmapping50khind240.rda down from 440MB to 60MB with this method.  I understand
that compress=TRUE has no impact on the compressed preinstallation package size.
I am concerned about postinstall footprints.

gc()

    -> back to 2GB

in the above, 'x' is an oligoBatch object (which contains eSet, details at the
end of this message).

any suggestion?

thanks a lot,

benilton

ps: i just noticed that using dim(exprs(x)) in show() reduces the memory usage
from 6GB to 3.5GB... and using str(x) increases it to 10.5GB.

-----------------------------------------------------------------------------
R version 2.2.0, 2005-07-26, x86_64-unknown-linux-gnu

attached base packages:
[1] "tools"     "methods"   "stats"     "graphics"  "grDevices" "utils"
[7] "datasets"  "base"

other attached packages:
     oligo reposTools    Biobase
   "0.0.7"    "1.6.0"    "1.6.6"
-------------------------------------------------------------------------------

str(x)

Formal class 'oligoBatch' [package "oligo"] with 8 slots
  ..@ manufacturer: chr "Affymetrix"
  ..@ platform    : chr "Mapping50K_Hind240"
  ..@ eList       :Formal class 'exprList' [package "Biobase"] with 2 slots
  .. .. ..@ eMetadata:`data.frame':     0 obs. of  0 variables
  .. .. ..@ eList    :List of 1
  .. .. .. ..$ exprs: num [1:2560000, 1:90]  1369 65472  ...
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : NULL
  .. .. .. .. .. ..$ : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
  ..@ description :Formal class 'MIAME' [package "Biobase"] with 11 slots
  .. .. ..@ name          : chr ""
  .. .. ..@ lab           : chr ""
  .. .. ..@ contact       : chr ""
  .. .. ..@ title         : chr ""
  .. .. ..@ abstract      : chr ""
  .. .. ..@ url           : chr ""
  .. .. ..@ samples       : list()
  .. .. ..@ hybridizations: list()
  .. .. ..@ normControls  : list()
  .. .. ..@ preprocessing :List of 2
  .. .. .. ..$ filenames   : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
  .. .. .. ..$ oligoversion: chr NA
  .. .. ..@ other         : list()
  ..@ annotation  : chr ""
  ..@ sampleNames : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
  ..@ notes       : chr ""
  ..@ phenoData   :Formal class 'phenoData' [package "Biobase"] with 3 slots
  .. .. ..@ pData      :`data.frame':   90 obs. of  1 variable:
  .. .. .. ..$ sample: int [1:90] 1 2 3 4 5 6 7 8 9 10 ...
  .. .. ..@ varLabels  :List of 1
  .. .. .. ..$ sample: chr "arbitrary numbering"
  .. .. ..@ varMetadata:`data.frame':   0 obs. of  0 variables

Best regards
   Wolfgang

-------------------------------------
Wolfgang Huber
European Bioinformatics Institute
European Molecular Biology Laboratory
Cambridge CB10 1SD
England
Phone: +44 1223 494642
Fax:   +44 1223 494486
Http:  www.ebi.ac.uk/huber

[Bioc-devel] strange behavior on memory usage

Thread (9 messages)