idea for "virtual matrix/array" class
On Mon, 23 Aug 2004, Tony Plate wrote:
One idea I was thinking about was to have a new class of object that
referred to data in a file on disk, and which had all the standard methods
of matrices and arrays, i.e., subsetting ("["), dim, dimnames, etc. The
object in memory would only store the array attributes, while the actual
array data (the elements) would reside in a file. When some extraction
method was called, it would access data in the file and return the
appropriate data. With sensible use of seek operations, the data access
could probably be quite fast. The file format of the object on disk could
possibly be the standard serialized binary format as used in .RData
files. Of course, if the object was larger than would fit in memory, then
trying to extract too large a subarray would exhaust memory, but it should
be possible to efficiently extract reasonably sized subarrays. To be more
useful, one would want want apply() to work with such arrays. That would
be doable, either by creating a new method for apply, or possibly just for
aperm.
This is what RPgSql does with proxy dataframes and what I did (read-only) for netCDF access. It's a good idea if you have a data format for which random access is fairly fast. I'm not sure that the standard serialized binary format satisfies this. Fixed-format text files would work, but free-format ones wouldn't -- seek() only helps when you can work out where to seek without reading all the data. -thomas