[Bioc-devel] XVector: abstraction
On 12/09/2013 05:39 AM, Michael Lawrence wrote:
Any thoughts about using mmap(), so that SharedRaw and OnDiskRaw just operate on a pointer as the abstraction?
Martin mentioned mmap to me for this project but I had some concerns about Windows compatibility. Are there CRAN or BioC packages that use it? Would be interesting to have a look at them. H.
Michael
On Sun, Dec 8, 2013 at 11:39 PM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:
Hi Michael,
The OnDiskXRaw virtual class (if this is what you're referring to)
is still a very early work-in-progress. The idea is to experiment
with on-disk representation of atomic vectors and direct random access
to subsequences of the vector. The exact storage mode is implemented by
concrete subclasses (currently only DirectRaw and SerializedRaw).
OnDiskXRaw is actually analog to SharedRaw except that with the latter
the "shared" sequence of bytes resides in memory.
If we had "on-disk" support for all atomic vectors, it sounds like it
would then be easy to support "on-disk" versions of higher-level
objects like IRanges or GRanges. They would be defined as their
"in-memory" counterpart except that the slots that are atomic vectors
in the "in-memory" version would just need to be replaced by "on-disk"
atomic vectors. "On-disk" versions of DNAString (and even DNAStringSet)
objects could also easily be implemented e.g. by just making the
"shared" slot an OnDiskXRaw object instead of a SharedRaw object.
Putting SharedRaw and OnDiskXRaw under the same umbrella (i.e. under
a virtual class) and using that virtual class to specify the slot of
higher-level objects like DNAString is tempting but realistically we
don't operate on an on-disk object like we do on an in-memory object.
Having an "on-disk" version of DNAString with direct random access was
in fact the initial motivation for OnDiskXRaw. The use case for this
was to support direct random access in BSgenome objects without having
to change the way the chromosomes are stored on disk (they're stored
as serialized raw vectors). I've finally implemented this feature (will
soon be pushed to BioC devel) but I changed the storage and didn't use
OnDiskXRaw in the end.
H.
On 12/05/2013 06:43 AM, Michael Lawrence wrote:
A nice goal for the XVector package would be full implementation
of the R
vector API on top of the already existing memory-sharing (rather
than
memory-duplicating) data structures. The actual storage mode of
the data
should be obviously be abstracted, e.g., on-disk should be
treated the same
as the externalptr representation. Much of the implementation
will need to
be in C, unless we want to pay the price of extracting things
into ordinary
R vectors. Should the abstraction be therefore dropped down to
the C level,
so that the implementations can more easily share from each
other? Anything
to gain here from the externalVector package?
[[alternative HTML version deleted]]
_________________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319