Skip to content

sscanf equivalent

3 messages · Paul Roebuck, Brian Ripley

#
I have a data file from which I need to read portions of
data but data location/quantity can change from file to file.
I wrote some code and have a working solution but it seems
wasteful to have to do it this way. Here's the contrived
incomplete code.

    datalines <- readLines(datafile.pathname)
    # marker will appear on line preceding and following
    # actual data
    offset.data <- grep("marker", datalines)
    datalines <- NULL

    # grab first column of each assoc dataline
    data <- scan(datafile.pathname,
                 what = numeric(0),
                 skip = offset.data[1],
                 nlines = offset.data[2]-offset.data[1]-1,
                 flush = TRUE,
                 multi.line = FALSE,
                 quiet = TRUE)
    # output is vector of values

Originally wrote code to parse data from 'datalines'
using sub and strsplit methods but it was woefully slower
and more complex than using scan method. What is desired
is a means of invoking method like scan but with existing
data instead of filename.

----------------------------------------------------------
SIGSIG -- signature too long (core dumped)
#
Why not use a text connection?
On Fri, 7 Oct 2005, Paul Roebuck wrote:

            

  
    
1 day later
#
On Fri, 7 Oct 2005, Prof Brian Ripley wrote:

            
I tried that but result was far slower than the method above.

R> file.info(datafile.pathname)$size
[1] 944850
R> system.time(datalines<-readLines(datafile.pathname), TRUE)[3]
[1] 0.59
R> length(datalines)
[1] 67931
R> system.time(tconn<-textConnection(datalines), TRUE)[3]
[1] 52.97

Once a textConnection object was created, the scan method
invocation using it took less than half the time of the
corresponding filename-based invocation. Problem is that
this was only taking a second to perform the scan using
the filename-based invocation. And since grep method doesn't
accept textConnection as argument, I still require the
otherwise unused 'datalines' variable and its associated
memory. Even if grep supported such, the timing increased
even more not having the variable.

R> system.time(tconn<-textConnection(readLines(datafile.pathname)), TRUE)[3]
[1] 66.61


Any other thoughts?


# R version 2.1.1, 2005-06-20, powerpc-apple-darwin7.9.0

----------------------------------------------------------
SIGSIG -- signature too long (core dumped)