A package that I develop (xcms) sometimes needs to read and process vectors several hundreds of megabytes in size. (They only represent parts of a large data sets which can approach nearly 100GB.) Unfortunately, R sometimes hits the 2GB memory limit of Win32. To help cut the memory footprint in half, I'm implementing a "float" class as a subclass of "raw". Because almost all the computation on the large vectors is done in C code, having a somewhat limited single-precision data type is okay. I've run into a limitation with the .C() function where it does not handle raw vectors, which it will do in 2.2.0. In the meantime, I'm using the .Call() function to access the raw vectors. However, there don't seem to be any macros for handling raw vectors in Rdefines.h. I've made a guess at what those macros would be and was wondering whether my guesses were correct and/or might be included in 2.2.0: #define NEW_RAW(n) allocVector(RAWSXP,n) #define RAW_POINTER(x) (RAW(x)) #define AS_RAW(x) coerceVector(x,RAWSXP) I'm not sure whether coerceVector(x,RAWSXP) will actually work. Also, there isn't an Rf_isRaw() function, which would be useful for an IS_RAW(x) macro. Another issue with the "float" class is that it will run into endian issues if it ever gets saved to disk and moved cross-platform. I don't really anticipate that happening but it might be nice to incorporate serialization hooks if possible. Are there any facilities in R for doing that? Thanks for any feedback or suggestions. -Colin http://abagyan.scripps.edu/~csmith/float.R http://abagyan.scripps.edu/~csmith/float.c
Implementing a single-precision class with raw
4 messages · Colin A. Smith, Brian Ripley
2 days later
On Fri, 19 Aug 2005, Colin A. Smith wrote:
A package that I develop (xcms) sometimes needs to read and process vectors several hundreds of megabytes in size. (They only represent parts of a large data sets which can approach nearly 100GB.) Unfortunately, R sometimes hits the 2GB memory limit of Win32.
The rw-FAQ explains why that is _not_ the limit!
To help cut the memory footprint in half, I'm implementing a "float" class as a subclass of "raw".
Why via "raw"? I believe the intention is that this sort of thing be done via external references, but as float and int are the same size on all current platforms, I would have considered R integers for storage. Then for example subsetting would work and you had a 4x larger size limit on 64-bit platforms. (You would also have got automatic handling of endianness.)
Because almost all the computation on the large vectors is done in C code, having a somewhat limited single-precision data type is okay. I've run into a limitation with the .C() function where it does not handle raw vectors, which it will do in 2.2.0.
That is just not true!
In the meantime, I'm using the .Call() function to access the raw vectors. However, there don't seem to be any macros for handling raw vectors in Rdefines.h.
So? We recommend using Rinternals.h: Rdefines.h is a compatibility wrapper for macros from S4. The raw type has not attempted to be compatible with S4, and we are not aware of any user who has compiled S4 code using raw vectors that (s)he wishes to port to R. (The R-exts.texi manual has been rather too optimistic about Rdefines.h: as you need to use SET_STRING_ELT and SET_VECTOR_ELT in R, you are rather limited as to what you can do in S4 style. This has been so since R 1.2.0 and Rdefines.h has hardly been updated since.)
I've made a guess at what those macros would be and was wondering whether my guesses were correct and/or might be included in 2.2.0: #define NEW_RAW(n) allocVector(RAWSXP,n) #define RAW_POINTER(x) (RAW(x)) #define AS_RAW(x) coerceVector(x,RAWSXP) I'm not sure whether coerceVector(x,RAWSXP) will actually work.
You should have read the code to find out (people answering your comment would have had to). It will `actually work', but it may not do whatever it is that you expect. (It interprets its input as integer (decimal if a string) representations of the bytes.) This is in contrast to S, where I have no idea precisely what AS_RAW is supposed to do and no code to read. (as(, "raw") seems to do wierd and unpredictable things, though, and the Green Book suggests that coercion probably is not intended to work.) For completeness I have added my (informed) guesses to Rdefines.h in R-devel.
Also, there isn't an Rf_isRaw() function, which would be useful for an IS_RAW(x) macro.
Why would this be necessary? TYPEOF(x) == RAWSXP is all that is needed.
Another issue with the "float" class is that it will run into endian issues if it ever gets saved to disk and moved cross-platform. I don't really anticipate that happening but it might be nice to incorporate serialization hooks if possible. Are there any facilities in R for doing that?
See the comment above.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
2 days later
On Aug 22, 2005, at 3:38 , Prof Brian Ripley wrote:
Unfortunately, R sometimes hits the 2GB memory limit of Win32.
The rw-FAQ explains why that is _not_ the limit!
Sorry to be uninformed, I see from the Microsoft documentation that the limit can be increased to 3GB when you change the boot parameters of Windows and a flag in the R binary.
Why via "raw"? I believe the intention is that this sort of thing be done via external references, but as float and int are the same size on all current platforms, I would have considered R integers for storage. Then for example subsetting would work and you had a 4x larger size limit on 64-bit platforms. (You would also have got automatic handling of endianness.)
I was concerned that on some platforms (particularly 64 bit ones) an integer may actually be 64 bits. However, if all current platforms have standardized on float and int being the same size, then using the integer data type would be more elegant and solve any endian issues. (Is that really the case?) Another motivation for using raw was so other R functions (particularly internal ones) wouldn't assume the data was integer and produce erroneous results. I'd rather have user-visible error thrown. To have the best of both worlds, I'm now using an S4 class with a single slot (f) that stores the float data as an integer vector. It uses the extensible conversion mechanism created by Duncan Lang to allow the .C() function to handle float objects. (The custom converter also allows proper checking for single-precision NA values, which don't get caught if passed as integers.) At some point, I would like to release this for others to use. Before I get too far into it, I would greatly appreciate it if anyone had any ideas, unforeseen caveats, or other suggestions about my current direction. My implementation is available here: http://abagyan.scripps.edu/~csmith/float.R http://abagyan.scripps.edu/~csmith/float.c http://abagyan.scripps.edu/~csmith/float_0.1.0.tar.gz
I've run into a limitation with the .C() function where it does not handle raw vectors, which it will do in 2.2.0.
That is just not true!
I should have been a bit more explicit and said that R 2.1.x doesn't have?RAWSXP handlers in either RObjToCPtr() or CPtrToRObj(), so all the C code gets is a SEXP. R-devel does and passes raw vectors as unsigned char *. Of course that doesn't matter now that I'm using an extensible converter. Cheers. -Colin
On Wed, 24 Aug 2005, Colin A. Smith wrote:
On Aug 22, 2005, at 3:38 , Prof Brian Ripley wrote:
Unfortunately, R sometimes hits the 2GB memory limit of Win32.
The rw-FAQ explains why that is _not_ the limit!
Sorry to be uninformed, I see from the Microsoft documentation that the limit can be increased to 3GB when you change the boot parameters of Windows and a flag in the R binary.
Why via "raw"? I believe the intention is that this sort of thing be done via external references, but as float and int are the same size on all current platforms, I would have considered R integers for storage. Then for example subsetting would work and you had a 4x larger size limit on 64-bit platforms. (You would also have got automatic handling of endianness.)
I was concerned that on some platforms (particularly 64 bit ones) an integer may actually be 64 bits. However, if all current platforms have standardized on float and int being the same size, then using the integer data type would be more elegant and solve any endian issues. (Is that really the case?)
Yes, configure tests for int being 32 bit, and R if set to fail to compile it not. As we assume IEC60559 arithmetic, a float is 32-bit.
Another motivation for using raw was so other R functions (particularly internal ones) wouldn't assume the data was integer and produce erroneous results. I'd rather have user-visible error thrown. To have the best of both worlds, I'm now using an S4 class with a single slot (f) that stores the float data as an integer vector. It uses the extensible conversion mechanism created by Duncan Lang to allow the .C() function to handle float objects. (The custom converter also allows proper checking for single-precision NA values, which don't get caught if passed as integers.) At some point, I would like to release this for others to use. Before I get too far into it, I would greatly appreciate it if anyone had any ideas, unforeseen caveats, or other suggestions about my current direction. My implementation is available here: http://abagyan.scripps.edu/~csmith/float.R http://abagyan.scripps.edu/~csmith/float.c http://abagyan.scripps.edu/~csmith/float_0.1.0.tar.gz
I've run into a limitation with the .C() function where it does not handle raw vectors, which it will do in 2.2.0.
That is just not true!
I should have been a bit more explicit and said that R 2.1.x doesn't have?RAWSXP handlers in either RObjToCPtr() or CPtrToRObj(), so all the C code gets is a SEXP. R-devel does and passes raw vectors as unsigned char *. Of course that doesn't matter now that I'm using an extensible converter. Cheers. -Colin
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595