Recorded here so others may avoid my mistakes. I have a bunch of files containing fixed width data. The R Data guide suggests that one pre-process them with a script if they are large. They were 50MG and up, and I needed to process another file that gave the layout of the lines anyway. I tried rpy to not only preprocess but create the R data object in one go. It seemed like a good idea; it wasn't. The core operation, was to build up a string for each line that looked like "data.frame(var1=val1, var2=val2, [etc])" and then rbind this to the data.frame so far. I did this with r(mycommand string). Almost all the values were numeric. This was incredibly slow, being unable to complete after running overnight. So, the lesson is, don't do that! I switched to preprocessing that created a csv file, and then read.csv from R. This worked in under a minute. The result had dimension 150913 x 129. The good news in rpy was that I found objects persisted across calls to the r object. Exactly why this was so slow I don't know. The two obvious suspects the speed of rbind, which I think is pretty inefficient, and the overhead of crossing the python/R boundary. This was on Debian Lenny: python-rpy 1.0.3-2 Python 2.5.2 R 2.7.1 rpy2 is not available in Lenny, though it is in development versions of Debian. Ross Boylan
good and bad ways to import fixed column data (rpy)
4 messages · Ross Boylan, Gabor Grothendieck, Wensui Liu
Check out ?read.fwf
On Sun, Aug 16, 2009 at 4:49 PM, Ross Boylan<ross at biostat.ucsf.edu> wrote:
Recorded here so others may avoid my mistakes. I have a bunch of files containing fixed width data. ?The R Data guide suggests that one pre-process them with a script if they are large. They were 50MG and up, and I needed to process another file that gave the layout of the lines anyway. I tried rpy to not only preprocess but create the R data object in one go. ?It seemed like a good idea; it wasn't. ?The core operation, was to build up a string for each line that looked like "data.frame(var1=val1, var2=val2, [etc])" and then rbind this to the data.frame so far. ?I did this with r(mycommand string). Almost all the values were numeric. This was incredibly slow, being unable to complete after running overnight. So, the lesson is, don't do that! I switched to preprocessing that created a csv file, and then read.csv from R. ?This worked in under a minute. ?The result had dimension 150913 x 129. The good news in rpy was that I found objects persisted across calls to the r object. Exactly why this was so slow I don't know. ?The two obvious suspects the speed of rbind, which I think is pretty inefficient, and the overhead of crossing the python/R boundary. This was on Debian Lenny: python-rpy ? ? ? ? ? ? ? ? ? ?1.0.3-2 Python 2.5.2 R 2.7.1 rpy2 is not available in Lenny, though it is in development versions of Debian. Ross Boylan
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090816/c9d8c6ab/attachment-0001.pl>
Just to quote explicitly the passage I mentioned in the R Data document: <QUOTE> Function `read.fwf' provides a simple way to read such files, specifying a vector of field widths. The function reads the file into memory as whole lines, splits the resulting character strings, writes out a temporary tab-separated file and then calls `read.table'. This is adequate for small files, but for anything more complicated we recommend using the facilities of a language like `perl' to pre-process the file. </QUOTE> Note particularly the final sentence. Ross
On Sun, 2009-08-16 at 19:37 -0400, Wensui Liu wrote:
Gabor made a good point.
Here is an example I copied from my blog.
##############################################
# READ FIXED-WIDTH DATA FILE WITH read.fwf() #
# ------------------------------------------ #
# EQUIVALENT SAS CODE: #
# filename data 'E:\sas\fixed.txt'; #
# data test; #
# infile data truncover; #
# input @1 city $ 1 - 22 @23 population; #
# run; #
##############################################
# OPEN A CONNECTION TO THE DATA FILE
data <- file(description = "e:\\sas\\fixed.txt", open = "r")
# width = c(...) ==> SPECIFIES COLUMN WIDTHS
# col.names = c(...) ==> GIVES COLUMN NAMES
# colClasses = c(...) ==> DEFINES COLUMN CLASSES
test <- read.fwf(data, header = FALSE, width = c(22, 10),
col.names = c("city", "population"),
colClasses = c("character", "numeric"))
close(data)
On Sun, Aug 16, 2009 at 6:36 PM, Gabor
Grothendieck<ggrothendieck at gmail.com> wrote:
Check out ?read.fwf On Sun, Aug 16, 2009 at 4:49 PM, Ross Boylan<ross at biostat.ucsf.edu>
wrote:
Recorded here so others may avoid my mistakes. I have a bunch of files containing fixed width data. The R Data
guide
suggests that one pre-process them with a script if they are large. They were 50MG and up, and I needed to process another file that
gave
the layout of the lines anyway. I tried rpy to not only preprocess but create the R data object in
one
go. It seemed like a good idea; it wasn't. The core operation,
was to
build up a string for each line that looked like
"data.frame(var1=val1,
var2=val2, [etc])" and then rbind this to the data.frame so far. I
did
this with r(mycommand string). Almost all the values were numeric. This was incredibly slow, being unable to complete after running overnight. So, the lesson is, don't do that! I switched to preprocessing that created a csv file, and then
read.csv
from R. This worked in under a minute. The result had dimension
150913
x 129. The good news in rpy was that I found objects persisted across
calls to
the r object. Exactly why this was so slow I don't know. The two obvious
suspects the
speed of rbind, which I think is pretty inefficient, and the
overhead of
crossing the python/R boundary. This was on Debian Lenny: python-rpy 1.0.3-2 Python 2.5.2 R 2.7.1 rpy2 is not available in Lenny, though it is in development
versions of
Debian. Ross Boylan
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- ============================== WenSui Liu Blog : statcompute.spaces.live.com Tough Times Never Last. But Tough People Do. - Robert Schuller ==============================