Hi all,
I just realized that serialized PairwiseAlignmentsSingleSubject objects
grow ridiculously large:
x <- "xxxabcdefghijklmnopqyyy"
y <- "abcdhijkzzzzlmnpqr"
pa <- pairwiseAlignment(x,y)
save(pa, file="~/tmp/pa.rda")
file.info("~/tmp/pa.rda")
size isdir mode mtime ctime
~/tmp/pa.rda 22651025 FALSE 644 2012-11-02 09:23:09 2012-11-02 09:23:09
atime uid gid uname grname
~/tmp/pa.rda 2012-11-02 09:23:07 11281 11281 hahnefl1 hahnefl1
22 MB for this trivial alignment seems to be a little excessive.
Interestingly, the object itself has a quite impressive memory footprint:
object.size(pa)
35308996 bytes
Any idea what is going on here? Look like a memory leak to me.
Florian
sessionInfo()
R version 2.15.1 RC (2012-06-21 r59599)
Platform: i386-apple-darwin11.4.0/i386 (32-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.26.2 IRanges_1.16.2 BiocGenerics_0.4.0
[4] BiocInstaller_1.8.2
loaded via a namespace (and not attached):
[1] parallel_2.15.1 stats4_2.15.1 tools_2.15.1
--
[Bioc-devel] serializing pairwise alignment objects
7 messages · Hahne, Florian, Wolfgang Huber, Benilton Carvalho +1 more
Hi, I can reproduce this on more recent versions of everything:
sessionInfo()
R Under development (unstable) (2012-10-31 r61057) Platform: x86_64-apple-darwin12.2.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Biostrings_2.27.5 IRanges_1.17.7 BiocGenerics_0.5.1 fortunes_1.5-0 loaded via a namespace (and not attached): [1] stats4_2.16.0 Best wishes Wolfgang Il giorno Nov 2, 2012, alle ore 9:32 AM, "Hahne, Florian" <florian.hahne at novartis.com> ha scritto:
Hi all,
I just realized that serialized PairwiseAlignmentsSingleSubject objects
grow ridiculously large:
x <- "xxxabcdefghijklmnopqyyy"
y <- "abcdhijkzzzzlmnpqr"
pa <- pairwiseAlignment(x,y)
save(pa, file="~/tmp/pa.rda")
file.info("~/tmp/pa.rda")
size isdir mode mtime ctime
~/tmp/pa.rda 22651025 FALSE 644 2012-11-02 09:23:09 2012-11-02 09:23:09
atime uid gid uname grname
~/tmp/pa.rda 2012-11-02 09:23:07 11281 11281 hahnefl1 hahnefl1
22 MB for this trivial alignment seems to be a little excessive.
Interestingly, the object itself has a quite impressive memory footprint:
object.size(pa)
35308996 bytes
Any idea what is going on here? Look like a memory leak to me.
Florian
sessionInfo()
R version 2.15.1 RC (2012-06-21 r59599)
Platform: i386-apple-darwin11.4.0/i386 (32-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.26.2 IRanges_1.16.2 BiocGenerics_0.4.0
[4] BiocInstaller_1.8.2
loaded via a namespace (and not attached):
[1] parallel_2.15.1 stats4_2.15.1 tools_2.15.1
--
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20121102/6e7faec4/attachment.pl>
Hi,
Looks like Benilton is right:
> slotNames(pa)
[1] "pattern" "subject" "type"
[4] "score" "substitutionArray" "gapOpening"
[7] "gapExtension"
> sapply(slotNames(pa), function(sname) object.size(slot(pa, sname)))
pattern subject type score
17056 17056 96 48
substitutionArray gapOpening gapExtension
35295336 48 48
I'm not sure why the substitutionArray would need to be stored in the
returned object (what downstream method use it?). Would need to check.
H.
On 11/02/2012 09:41 AM, Benilton Carvalho wrote:
Ditto. But isn't it just the result of the resulting object 'pa' containing the substitutionArray slot (100 x 100 x 441 array of doubles)? Maybe scoreOnly=TRUE is relevant in some cases? b On 2 November 2012 15:53, Wolfgang Huber <whuber at embl.de> wrote:
Hi, I can reproduce this on more recent versions of everything:
sessionInfo()
R Under development (unstable) (2012-10-31 r61057)
Platform: x86_64-apple-darwin12.2.0/x86_64 (64-bit)
locale:
[1] C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Biostrings_2.27.5 IRanges_1.17.7 BiocGenerics_0.5.1 fortunes_1.5-0
loaded via a namespace (and not attached):
[1] stats4_2.16.0
Best wishes
Wolfgang
Il giorno Nov 2, 2012, alle ore 9:32 AM, "Hahne, Florian" <
florian.hahne at novartis.com> ha scritto:
Hi all,
I just realized that serialized PairwiseAlignmentsSingleSubject objects
grow ridiculously large:
x <- "xxxabcdefghijklmnopqyyy"
y <- "abcdhijkzzzzlmnpqr"
pa <- pairwiseAlignment(x,y)
save(pa, file="~/tmp/pa.rda")
file.info("~/tmp/pa.rda")
size isdir mode mtime ctime
~/tmp/pa.rda 22651025 FALSE 644 2012-11-02 09:23:09 2012-11-02 09:23:09
atime uid gid uname grname
~/tmp/pa.rda 2012-11-02 09:23:07 11281 11281 hahnefl1 hahnefl1
22 MB for this trivial alignment seems to be a little excessive.
Interestingly, the object itself has a quite impressive memory footprint:
object.size(pa)
35308996 bytes
Any idea what is going on here? Look like a memory leak to me.
Florian
sessionInfo()
R version 2.15.1 RC (2012-06-21 r59599)
Platform: i386-apple-darwin11.4.0/i386 (32-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.26.2 IRanges_1.16.2 BiocGenerics_0.4.0
[4] BiocInstaller_1.8.2
loaded via a namespace (and not attached):
[1] parallel_2.15.1 stats4_2.15.1 tools_2.15.1
--
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
2 days later
Indeed. I did not look the far into the implementation, it just seemed odd to me that the objects got that inflated. scoreOnly is not really that helpful if you want to deal with the actual alignments. The only reasonable application I see for it is if you want to rank a bunch of sequences by pairwise similarity. This gigantic memory footprint is really breaking things once you start doing a lot of these pairwise alignment operations in parallel. mclapply complains about not being able to turn such large objects into a raw vector, and serializing to disk quickly fills your hard drive. You also loose a lot of the time gained by parallel processing just by writing and loading gigabytes of data... I don't know enough about the internals of the PairwiseAlignments classes, but it seems that there must be a way to avoid having this huge array as part of the object. As a quick and dirty fix for now I just replaced the substitutionArray slot with an empty matrix and all the downstream operations that I wanted to do still work. Would be great if you could take a look into this, Herve. Thanks, Florian
On 11/2/12 7:02 PM, "Herv? Pag?s" <hpages at fhcrc.org> wrote:
>Hi,
>
>Looks like Benilton is right:
>
> > slotNames(pa)
> [1] "pattern" "subject" "type"
> [4] "score" "substitutionArray" "gapOpening"
> [7] "gapExtension"
> > sapply(slotNames(pa), function(sname) object.size(slot(pa, sname)))
> pattern subject type score
> 17056 17056 96 48
> substitutionArray gapOpening gapExtension
> 35295336 48 48
>
>I'm not sure why the substitutionArray would need to be stored in the
>returned object (what downstream method use it?). Would need to check.
>
>H.
>
>
>On 11/02/2012 09:41 AM, Benilton Carvalho wrote:
>> Ditto.
>>
>> But isn't it just the result of the resulting object 'pa' containing the
>> substitutionArray slot (100 x 100 x 441 array of doubles)? Maybe
>> scoreOnly=TRUE is relevant in some cases?
>>
>> b
>>
>>
>> On 2 November 2012 15:53, Wolfgang Huber <whuber at embl.de> wrote:
>>
>>> Hi,
>>>
>>> I can reproduce this on more recent versions of everything:
>>>
>>>> sessionInfo()
>>> R Under development (unstable) (2012-10-31 r61057)
>>> Platform: x86_64-apple-darwin12.2.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.27.5 IRanges_1.17.7 BiocGenerics_0.5.1
>>>fortunes_1.5-0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] stats4_2.16.0
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>> Il giorno Nov 2, 2012, alle ore 9:32 AM, "Hahne, Florian" <
>>> florian.hahne at novartis.com> ha scritto:
>>>
>>>> Hi all,
>>>> I just realized that serialized PairwiseAlignmentsSingleSubject
>>>>objects
>>>> grow ridiculously large:
>>>>
>>>> x <- "xxxabcdefghijklmnopqyyy"
>>>> y <- "abcdhijkzzzzlmnpqr"
>>>> pa <- pairwiseAlignment(x,y)
>>>> save(pa, file="~/tmp/pa.rda")
>>>> file.info("~/tmp/pa.rda")
>>>> size isdir mode mtime
>>>>ctime
>>>> ~/tmp/pa.rda 22651025 FALSE 644 2012-11-02 09:23:09 2012-11-02
>>>>09:23:09
>>>> atime uid gid uname grname
>>>> ~/tmp/pa.rda 2012-11-02 09:23:07 11281 11281 hahnefl1 hahnefl1
>>>>
>>>>
>>>>
>>>> 22 MB for this trivial alignment seems to be a little excessive.
>>>>
>>>> Interestingly, the object itself has a quite impressive memory
>>>>footprint:
>>>> object.size(pa)
>>>> 35308996 bytes
>>>>
>>>>
>>>> Any idea what is going on here? Look like a memory leak to me.
>>>>
>>>>
>>>> Florian
>>>>
>>>> sessionInfo()
>>>> R version 2.15.1 RC (2012-06-21 r59599)
>>>> Platform: i386-apple-darwin11.4.0/i386 (32-bit)
>>>>
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> other attached packages:
>>>> [1] Biostrings_2.26.2 IRanges_1.16.2 BiocGenerics_0.4.0
>>>> [4] BiocInstaller_1.8.2
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] parallel_2.15.1 stats4_2.15.1 tools_2.15.1
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>--
>Herv? Pag?s
>
>Program in Computational Biology
>Division of Public Health Sciences
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N, M1-B514
>P.O. Box 19024
>Seattle, WA 98109-1024
>
>E-mail: hpages at fhcrc.org
>Phone: (206) 667-5791
>Fax: (206) 667-1319
>
>_______________________________________________
>Bioc-devel at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/bioc-devel
1 day later
Hi Florian,
I just removed the 'substitutionArray' slot from PairwiseAlignments
objects in Biostrings 2.27.7. The slot didn't seem to be used/needed
by any downstream method.
> packageVersion("Biostrings")
[1] ?2.27.7?
> x <- "xxxabcdefghijklmnopqyyy"
> y <- "abcdhijkzzzzlmnpqr"
> pa <- pairwiseAlignment(x, y)
> slotNames(pa)
[1] "pattern" "subject" "type" "score"
"gapOpening"
[6] "gapExtension"
> validObject(pa)
[1] TRUE
> object.size(pa)
35528 bytes
... instead of 35308996 bytes! 3 orders of magnitude smaller :-)
Cheers,
H.
On 11/05/2012 03:45 AM, Hahne, Florian wrote:
Indeed. I did not look the far into the implementation, it just seemed odd to me that the objects got that inflated. scoreOnly is not really that helpful if you want to deal with the actual alignments. The only reasonable application I see for it is if you want to rank a bunch of sequences by pairwise similarity. This gigantic memory footprint is really breaking things once you start doing a lot of these pairwise alignment operations in parallel. mclapply complains about not being able to turn such large objects into a raw vector, and serializing to disk quickly fills your hard drive. You also loose a lot of the time gained by parallel processing just by writing and loading gigabytes of data... I don't know enough about the internals of the PairwiseAlignments classes, but it seems that there must be a way to avoid having this huge array as part of the object. As a quick and dirty fix for now I just replaced the substitutionArray slot with an empty matrix and all the downstream operations that I wanted to do still work. Would be great if you could take a look into this, Herve. Thanks, Florian
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Great Herve, thanks a lot! Florian
On 11/7/12 3:06 AM, "Herv? Pag?s" <hpages at fhcrc.org> wrote:
>Hi Florian,
>
>I just removed the 'substitutionArray' slot from PairwiseAlignments
>objects in Biostrings 2.27.7. The slot didn't seem to be used/needed
>by any downstream method.
>
> > packageVersion("Biostrings")
> [1] ?2.27.7?
> > x <- "xxxabcdefghijklmnopqyyy"
> > y <- "abcdhijkzzzzlmnpqr"
> > pa <- pairwiseAlignment(x, y)
> > slotNames(pa)
> [1] "pattern" "subject" "type" "score"
>"gapOpening"
> [6] "gapExtension"
> > validObject(pa)
> [1] TRUE
> > object.size(pa)
> 35528 bytes
>
>... instead of 35308996 bytes! 3 orders of magnitude smaller :-)
>
>Cheers,
>H.
>
>
>On 11/05/2012 03:45 AM, Hahne, Florian wrote:
>> Indeed. I did not look the far into the implementation, it just seemed
>>odd
>> to me that the objects got that inflated. scoreOnly is not really that
>> helpful if you want to deal with the actual alignments. The only
>> reasonable application I see for it is if you want to rank a bunch of
>> sequences by pairwise similarity. This gigantic memory footprint is
>>really
>> breaking things once you start doing a lot of these pairwise alignment
>> operations in parallel. mclapply complains about not being able to turn
>> such large objects into a raw vector, and serializing to disk quickly
>> fills your hard drive. You also loose a lot of the time gained by
>>parallel
>> processing just by writing and loading gigabytes of data...
>> I don't know enough about the internals of the PairwiseAlignments
>>classes,
>> but it seems that there must be a way to avoid having this huge array as
>> part of the object. As a quick and dirty fix for now I just replaced the
>> substitutionArray slot with an empty matrix and all the downstream
>> operations that I wanted to do still work. Would be great if you could
>> take a look into this, Herve.
>> Thanks,
>> Florian
>>
>
>--
>Herv? Pag?s
>
>Program in Computational Biology
>Division of Public Health Sciences
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N, M1-B514
>P.O. Box 19024
>Seattle, WA 98109-1024
>
>E-mail: hpages at fhcrc.org
>Phone: (206) 667-5791
>Fax: (206) 667-1319