One way I think this could be handled is to leave the basic data types
presented by R to the user unchanged, but to offer different
implementation choices for those user data types. For example, an
"array of structures" is presently loaded into memory in its entirety
when accessed. For large data objects, this isn't ideal. An
alternative implementation would be to store such a big array on disk
and load into memory only the parts that are really needed. In other
words, the incore representation would simply be a cache of the entire
data object. Of course, once this is done you could also vary the
external representation of objects. For example, instead of storing
each array element next to each other, it often could be advantageous
to store the fields of the array next to each other (so that
operations like "compute the average of the .age field" could be
performed efficiently). Yet another variation might be to add
on-the-fly compression/decompression to minimize the size of the
external data file.
If this approach were taken, I'd imagine that R would continue to use
the "store entirely in memory" approach by default to maintain
backwards compatibility. At the same time, a few new functions could
be introduced that would allow precise control over how the object is
implemented. So when the user wants to deal with a large object, it
would create the object, set its implementation to something suitable
(e.g., cache-only, field-sequential layout, on-the-fly compression)
and then continue to use the object as usual.
Since I'm not familiar with the internals of R, I have no idea how
easy/hard this would be and I'd therefore appreciate hearing your
opinion on whether you think this would be a valuable and doable
extension.
In any case, thanks for working on R! I was excited to find that I
now have to option to use the S language on my Linux systems!
Cheers,
--david
--
David Mosberger, Ph.D; HP Labs; 1501 Page Mill Rd MS 1U17; Palo Alto, CA 94304
davidm@hpl.hp.com voice (650) 236-2575 fax 857-5100