[Bioc-devel] serializing pairwise alignment objects
Hi Florian,
I just removed the 'substitutionArray' slot from PairwiseAlignments
objects in Biostrings 2.27.7. The slot didn't seem to be used/needed
by any downstream method.
> packageVersion("Biostrings")
[1] ?2.27.7?
> x <- "xxxabcdefghijklmnopqyyy"
> y <- "abcdhijkzzzzlmnpqr"
> pa <- pairwiseAlignment(x, y)
> slotNames(pa)
[1] "pattern" "subject" "type" "score"
"gapOpening"
[6] "gapExtension"
> validObject(pa)
[1] TRUE
> object.size(pa)
35528 bytes
... instead of 35308996 bytes! 3 orders of magnitude smaller :-)
Cheers,
H.
On 11/05/2012 03:45 AM, Hahne, Florian wrote:
Indeed. I did not look the far into the implementation, it just seemed odd to me that the objects got that inflated. scoreOnly is not really that helpful if you want to deal with the actual alignments. The only reasonable application I see for it is if you want to rank a bunch of sequences by pairwise similarity. This gigantic memory footprint is really breaking things once you start doing a lot of these pairwise alignment operations in parallel. mclapply complains about not being able to turn such large objects into a raw vector, and serializing to disk quickly fills your hard drive. You also loose a lot of the time gained by parallel processing just by writing and loading gigabytes of data... I don't know enough about the internals of the PairwiseAlignments classes, but it seems that there must be a way to avoid having this huge array as part of the object. As a quick and dirty fix for now I just replaced the substitutionArray slot with an empty matrix and all the downstream operations that I wanted to do still work. Would be great if you could take a look into this, Herve. Thanks, Florian
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319