Skip to content

good and bad ways to import fixed column data (rpy)

4 messages · Ross Boylan, Gabor Grothendieck, Wensui Liu

#
Recorded here so others may avoid my mistakes.

I have a bunch of files containing fixed width data.  The R Data guide
suggests that one pre-process them with a script if they are large.
They were 50MG and up, and I needed to process another file that gave
the layout of the lines anyway.

I tried rpy to not only preprocess but create the R data object in one
go.  It seemed like a good idea; it wasn't.  The core operation, was to
build up a string for each line that looked like "data.frame(var1=val1,
var2=val2, [etc])" and then rbind this to the data.frame so far.  I did
this with r(mycommand string). Almost all the values were numeric.

This was incredibly slow, being unable to complete after running
overnight.

So, the lesson is, don't do that!

I switched to preprocessing that created a csv file, and then read.csv
from R.  This worked in under a minute.  The result had dimension 150913
x 129.

The good news in rpy was that I found objects persisted across calls to
the r object.

Exactly why this was so slow I don't know.  The two obvious suspects the
speed of rbind, which I think is pretty inefficient, and the overhead of
crossing the python/R boundary.

This was on Debian Lenny:
python-rpy                    1.0.3-2
Python 2.5.2
R 2.7.1

rpy2 is not available in Lenny, though it is in development versions of
Debian.

Ross Boylan
#
Check out ?read.fwf
On Sun, Aug 16, 2009 at 4:49 PM, Ross Boylan<ross at biostat.ucsf.edu> wrote:
#
Just to quote explicitly the passage I mentioned in the R Data document:


<QUOTE>
   Function `read.fwf' provides a simple way to read such files,
specifying a vector of field widths.  The function reads the file into
memory as whole lines, splits the resulting character strings, writes
out a temporary tab-separated file and then calls `read.table'.  This
is adequate for small files, but for anything more complicated we
recommend using the facilities of a language like `perl' to pre-process
the file.  
</QUOTE>

Note particularly the final sentence.

Ross
On Sun, 2009-08-16 at 19:37 -0400, Wensui Liu wrote: