Skip to content

[Bioc-devel] serializing pairwise alignment objects

7 messages · Hahne, Florian, Wolfgang Huber, Benilton Carvalho +1 more

#
Hi all,
I just realized that serialized PairwiseAlignmentsSingleSubject objects
grow ridiculously large:

x <- "xxxabcdefghijklmnopqyyy"
y <- "abcdhijkzzzzlmnpqr"
pa <- pairwiseAlignment(x,y)
save(pa, file="~/tmp/pa.rda")
file.info("~/tmp/pa.rda")
                 size isdir mode               mtime               ctime
~/tmp/pa.rda 22651025 FALSE  644 2012-11-02 09:23:09 2012-11-02 09:23:09
                           atime   uid   gid    uname   grname
~/tmp/pa.rda 2012-11-02 09:23:07 11281 11281 hahnefl1 hahnefl1



22 MB for this trivial alignment seems to be a little excessive.

Interestingly, the object itself has a quite impressive memory footprint:
object.size(pa)
35308996 bytes


Any idea what is going on here? Look like a memory leak to me.


Florian

sessionInfo()
R version 2.15.1 RC (2012-06-21 r59599)
Platform: i386-apple-darwin11.4.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.26.2   IRanges_1.16.2      BiocGenerics_0.4.0
[4] BiocInstaller_1.8.2

loaded via a namespace (and not attached):
[1] parallel_2.15.1 stats4_2.15.1   tools_2.15.1



--
#
Hi,

I can reproduce this on more recent versions of everything:
R Under development (unstable) (2012-10-31 r61057)
Platform: x86_64-apple-darwin12.2.0/x86_64 (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] Biostrings_2.27.5  IRanges_1.17.7     BiocGenerics_0.5.1 fortunes_1.5-0    

loaded via a namespace (and not attached):
[1] stats4_2.16.0

Best wishes
	Wolfgang

Il giorno Nov 2, 2012, alle ore 9:32 AM, "Hahne, Florian" <florian.hahne at novartis.com> ha scritto:
#
Hi,

Looks like Benilton is right:

   > slotNames(pa)
   [1] "pattern"           "subject"           "type"
   [4] "score"             "substitutionArray" "gapOpening"
   [7] "gapExtension"
   > sapply(slotNames(pa), function(sname) object.size(slot(pa, sname)))
             pattern           subject              type             score
               17056             17056                96                48
   substitutionArray        gapOpening      gapExtension
            35295336                48                48

I'm not sure why the substitutionArray would need to be stored in the
returned object (what downstream method use it?). Would need to check.

H.
On 11/02/2012 09:41 AM, Benilton Carvalho wrote:

  
    
2 days later
#
Indeed. I did not look the far into the implementation, it just seemed odd
to me that the objects got that inflated. scoreOnly is not really that
helpful if you want to deal with the actual alignments. The only
reasonable application I see for it is if you want to rank a bunch of
sequences by pairwise similarity. This gigantic memory footprint is really
breaking things once you start doing a lot of these pairwise alignment
operations in parallel. mclapply complains about not being able to turn
such large objects into a raw vector, and serializing to disk quickly
fills your hard drive. You also loose a lot of the time gained by parallel
processing just by writing and loading gigabytes of data...
I don't know enough about the internals of the PairwiseAlignments classes,
but it seems that there must be a way to avoid having this huge array as
part of the object. As a quick and dirty fix for now I just replaced the
substitutionArray slot with an empty matrix and all the downstream
operations that I wanted to do still work. Would be great if you could
take a look into this, Herve.
Thanks,
Florian
1 day later
#
Hi Florian,

I just removed the 'substitutionArray' slot from PairwiseAlignments
objects in Biostrings 2.27.7. The slot didn't seem to be used/needed
by any downstream method.

   > packageVersion("Biostrings")
   [1] ?2.27.7?
   > x <- "xxxabcdefghijklmnopqyyy"
   > y <- "abcdhijkzzzzlmnpqr"
   > pa <- pairwiseAlignment(x, y)
   > slotNames(pa)
   [1] "pattern"      "subject"      "type"         "score" 
"gapOpening"
   [6] "gapExtension"
   > validObject(pa)
   [1] TRUE
   > object.size(pa)
   35528 bytes

... instead of 35308996 bytes! 3 orders of magnitude smaller :-)

Cheers,
H.
On 11/05/2012 03:45 AM, Hahne, Florian wrote:

  
    
#
Great Herve,
thanks a lot!
Florian