Skip to content

How allocate STRSXP outside of gc

2 messages · Vadim Ogranovich, Jeffrey Horner

#
Yes, and space sharing also improves speed since gc() does not need to
collect so many objects.

I thought about more efficient formats for my data, but:
* ASCII is ubiquitous. Your have grep, head, perl, etc. to work w/ them
* AFAIK, there is no industry standard binary format and a mature
supporting C-library (especially when the data needs to be compressed).
I considered HDF and netcdf.
* the programs that collect my data store it in ASCII. It is
advantageous to be able to read it directly from the original files. (I
have about 200G of these compressed)
* C code was able to read the data at a decent speed, it was the R's
overhead that was causing problems. One of them was mkChar, the other
was how chars are read from a connection. I detailed my findings in a
message to r-devel.

I tried to see is I could improve the original R codes for IO, but for
various reasons decided that I wouldn't be able to accomplish this. In
the end I decided to write a custom R IO package which came close to the
speed of raw C code (the difference is largely due to the lookup
overhead).

Thanks,
Vadim
#
Vadim Ogranovich wrote:
[...]
[...]

Interesting. I just finished reading a little about HDF's new format HD5 
and their web documentation claims it's flexible enough to store 
compressed or chunked data:

http://hdf.ncsa.uiuc.edu/whatishdf5.html

Also, you mentioned that you like line oriented ASCII files since many 
UNIX utilities work with them, but have you considered NCO, a collection 
of UNIX utilites for processing netcdf files:

http://nco.sourceforge.net/