Skip to content

Implementing a single-precision class with raw

4 messages · Colin A. Smith, Brian Ripley

#
A package that I develop (xcms) sometimes needs to read and process 
vectors several hundreds of megabytes in size. (They only represent 
parts of a large data sets which can approach nearly 100GB.) 
Unfortunately, R sometimes hits the 2GB memory limit of Win32. To help 
cut the memory footprint in half, I'm implementing a "float" class as a 
subclass of "raw". Because almost all the computation on the large 
vectors is done in C code, having a somewhat limited single-precision 
data type is okay.

I've run into a limitation with the .C() function where it does not 
handle raw vectors, which it will do in 2.2.0. In the meantime, I'm 
using the .Call() function to access the raw vectors. However, there 
don't seem to be any macros for handling raw vectors in Rdefines.h. 
I've made a guess at what those macros would be and was wondering 
whether my guesses were correct and/or might be included in 2.2.0:

#define NEW_RAW(n) allocVector(RAWSXP,n)
#define RAW_POINTER(x) (RAW(x))
#define AS_RAW(x) coerceVector(x,RAWSXP)

I'm not sure whether coerceVector(x,RAWSXP) will actually work. Also, 
there isn't an Rf_isRaw() function, which would be useful for an 
IS_RAW(x) macro.

Another issue with the "float" class is that it will run into endian 
issues if it ever gets saved to disk and moved cross-platform. I don't 
really anticipate that happening but it might be nice to incorporate 
serialization hooks if possible. Are there any facilities in R for 
doing that?

Thanks for any feedback or suggestions.

-Colin

http://abagyan.scripps.edu/~csmith/float.R
http://abagyan.scripps.edu/~csmith/float.c

2 days later
#
On Fri, 19 Aug 2005, Colin A. Smith wrote:

            
The rw-FAQ explains why that is _not_ the limit!
Why via "raw"?

I believe the intention is that this sort of thing be done via external 
references, but as float and int are the same size on all current 
platforms, I would have considered R integers for storage. Then for 
example subsetting would work and you had a 4x larger size limit on 64-bit 
platforms.  (You would also have got automatic handling of endianness.)
That is just not true!
So?  We recommend using Rinternals.h: Rdefines.h is a compatibility 
wrapper for macros from S4.  The raw type has not attempted to be 
compatible with S4, and we are not aware of any user who has compiled S4 
code using raw vectors that (s)he wishes to port to R.

(The R-exts.texi manual has been rather too optimistic about Rdefines.h: 
as you need to use SET_STRING_ELT and SET_VECTOR_ELT in R, you are rather 
limited as to what you can do in S4 style.  This has been so since R 1.2.0 
and Rdefines.h has hardly been updated since.)
You should have read the code to find out (people answering your comment 
would have had to).  It will `actually work', but it may not do whatever 
it is that you expect.  (It interprets its input as integer (decimal if a 
string) representations of the bytes.)

This is in contrast to S, where I have no idea precisely what AS_RAW is 
supposed to do and no code to read.  (as(, "raw") seems to do wierd and 
unpredictable things, though, and the Green Book suggests that coercion 
probably is not intended to work.)

For completeness I have added my (informed) guesses to Rdefines.h in 
R-devel.
Why would this be necessary?  TYPEOF(x) == RAWSXP is all that is needed.
See the comment above.
2 days later
#
On Aug 22, 2005, at 3:38 , Prof Brian Ripley wrote:

            
Sorry to be uninformed, I see from the Microsoft documentation that the 
limit can be increased to 3GB when you change the boot parameters of 
Windows and a flag in the R binary.
I was concerned that on some platforms (particularly 64 bit ones) an 
integer may actually be 64 bits. However, if all current platforms have 
standardized on float and int being the same size, then using the 
integer data type would be more elegant and solve any endian issues. 
(Is that really the case?) Another motivation for using raw was so 
other R functions (particularly internal ones) wouldn't assume the data 
was integer and produce erroneous results. I'd rather have user-visible 
error thrown.

To have the best of both worlds, I'm now using an S4 class with a 
single slot (f) that stores the float data as an integer vector. It 
uses the extensible conversion mechanism created by Duncan Lang to 
allow the .C() function to handle float objects. (The custom converter 
also allows proper checking for single-precision NA values, which don't 
get caught if passed as integers.)

At some point, I would like to release this for others to use. Before I 
get too far into it, I would greatly appreciate it if anyone had any 
ideas, unforeseen caveats, or other suggestions about my current 
direction. My implementation is available here:

http://abagyan.scripps.edu/~csmith/float.R
http://abagyan.scripps.edu/~csmith/float.c
http://abagyan.scripps.edu/~csmith/float_0.1.0.tar.gz
I should have been a bit more explicit and said that R 2.1.x doesn't 
have?RAWSXP handlers in either RObjToCPtr() or CPtrToRObj(), so all the 
C code gets is a SEXP. R-devel does and passes raw vectors as unsigned 
char *. Of course that doesn't matter now that I'm using an extensible 
converter.

Cheers.

-Colin
#
On Wed, 24 Aug 2005, Colin A. Smith wrote:

            
Yes, configure tests for int being 32 bit, and R if set to fail to compile 
it not.  As we assume IEC60559 arithmetic, a float is 32-bit.