[Bioc-devel] XVector: abstraction
Hi Michael, The OnDiskXRaw virtual class (if this is what you're referring to) is still a very early work-in-progress. The idea is to experiment with on-disk representation of atomic vectors and direct random access to subsequences of the vector. The exact storage mode is implemented by concrete subclasses (currently only DirectRaw and SerializedRaw). OnDiskXRaw is actually analog to SharedRaw except that with the latter the "shared" sequence of bytes resides in memory. If we had "on-disk" support for all atomic vectors, it sounds like it would then be easy to support "on-disk" versions of higher-level objects like IRanges or GRanges. They would be defined as their "in-memory" counterpart except that the slots that are atomic vectors in the "in-memory" version would just need to be replaced by "on-disk" atomic vectors. "On-disk" versions of DNAString (and even DNAStringSet) objects could also easily be implemented e.g. by just making the "shared" slot an OnDiskXRaw object instead of a SharedRaw object. Putting SharedRaw and OnDiskXRaw under the same umbrella (i.e. under a virtual class) and using that virtual class to specify the slot of higher-level objects like DNAString is tempting but realistically we don't operate on an on-disk object like we do on an in-memory object. Having an "on-disk" version of DNAString with direct random access was in fact the initial motivation for OnDiskXRaw. The use case for this was to support direct random access in BSgenome objects without having to change the way the chromosomes are stored on disk (they're stored as serialized raw vectors). I've finally implemented this feature (will soon be pushed to BioC devel) but I changed the storage and didn't use OnDiskXRaw in the end. H.
On 12/05/2013 06:43 AM, Michael Lawrence wrote:
A nice goal for the XVector package would be full implementation of the R vector API on top of the already existing memory-sharing (rather than memory-duplicating) data structures. The actual storage mode of the data should be obviously be abstracted, e.g., on-disk should be treated the same as the externalptr representation. Much of the implementation will need to be in C, unless we want to pay the price of extracting things into ordinary R vectors. Should the abstraction be therefore dropped down to the C level, so that the implementations can more easily share from each other? Anything to gain here from the externalVector package? [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319