Skip to content
Prev 15758 / 63461 Next

How allocate STRSXP outside of gc

Yes, and space sharing also improves speed since gc() does not need to
collect so many objects.

I thought about more efficient formats for my data, but:
* ASCII is ubiquitous. Your have grep, head, perl, etc. to work w/ them
* AFAIK, there is no industry standard binary format and a mature
supporting C-library (especially when the data needs to be compressed).
I considered HDF and netcdf.
* the programs that collect my data store it in ASCII. It is
advantageous to be able to read it directly from the original files. (I
have about 200G of these compressed)
* C code was able to read the data at a decent speed, it was the R's
overhead that was causing problems. One of them was mkChar, the other
was how chars are read from a connection. I detailed my findings in a
message to r-devel.

I tried to see is I could improve the original R codes for IO, but for
various reasons decided that I wouldn't be able to accomplish this. In
the end I decided to write a custom R IO package which came close to the
speed of raw C code (the difference is largely due to the lookup
overhead).

Thanks,
Vadim