Skip to content

help reading a variably formatted text file

7 messages · Jason Turner, Corey Moffet, Michael Na Li +2 more

#
This looks possible in R, and your algorithm looks precise enough.
The help pages for file, readLines, and scan should cast some
light.

For jobs like this I tend to use Perl, however.  Familiarity
is one reason:  I'm more comfortable with Perl for scanning/parsing 
files.  Also, Perl was originally written for exactly this sort of
thing.

Cheers

Jason
#
Dear R-Help,

I have a generated file that looks like the following:

----- Begin file -----
 #
 #       Output File
 #
 float   Version      2002.700000000000
 int     Numdays         31
 int     NumOFEs          1
 #
 #       Hillslope-specific variables
 #
 char    HillVarNames[ 3 ]
         {Days In Simulation}                         
         {Hillslope: Precipitation (mm)}              
         {Hillslope: Average detachment (kg/m**2)}    
 #
 #       OFE-specific variables
 #
 char    OFEVarNames[ 3 ]
         {Irrigation depth (mm)}                      
         {Irrigation_volume_supplied/unit_area (mm)}  
         {Runoff (mm)}                                
 #
 #       Daily values:
 #
     1    5.40000    0.00000    0.00000    0.00000    0.00000
     2    0.00000    0.00000    0.00000    0.00000    0.00000
     3    2.30000    0.00000    0.00000    0.00000    0.00000
     4    0.00000    0.00000    0.00000    0.00000    0.00000
     5    0.00000    0.00000    0.00000    0.00000    0.00000
     6    0.00000    0.00000    0.00000    0.00000    0.00000
     7    0.00000    0.00000    0.00000    0.00000    0.00000
     8    0.00000    0.00000    0.00000    0.00000    0.00000
     9   12.80000    0.00000    0.00000    4.57200    0.00000
    10    0.00000    0.00000    0.00000    0.00000    0.00000
    11    0.00000    0.00000    0.00000    0.00000    0.00000
    12    0.00000    0.00000    0.00000    0.00000    0.00000
    13    0.00000    0.00000    0.00000    0.00000    0.00000
    14    0.00000    0.00000    0.00000    0.00000    0.00000
    15    0.00000    0.00000    0.00000    0.00000    0.00000
    16    0.00000    0.00000    0.00000    0.00000    0.00000
    17    0.00000    0.00000    0.00000    0.00000    0.00000
    18    0.00000    0.00000    0.00000    0.00000    0.00000
    19    0.00000    0.00000    0.00000    0.00000    0.00000
    20    0.00000    0.00000    0.00000    0.00000    0.00000
    21    0.00000    0.00000    0.00000    0.00000    0.00000
    22    0.00000    0.00000    0.00000    0.00000    0.00000
    23    0.00000    0.00000    0.00000    0.00000    0.00000
    24    0.00000    0.00000    0.00000    0.00000    0.00000
    25    0.00000    0.00000    0.00000    0.00000    0.00000
    26    0.00000    0.00000    0.00000    0.00000    0.00000
    27    0.00000    0.00000    0.00000    0.00000    0.00000
    28    0.00000    0.00000    0.00000    0.00000    0.00000
    29   32.30000    0.00001    0.00001    4.57200    0.00000
    30    0.00000    0.00000    0.00000    0.00000    0.00000
    31    0.00000    0.00000    0.00000    0.00000    0.00000
 #
 #       Minimum/Maximum values:
 #
     1    0.00000    0.00000    0.00000    0.00000    0.00000
    63   32.30000    0.00001    0.00001    4.57200    0.00000

----- end file -----

Note: Spaces in the first column are real.

I would like to read in a data.frame containing only the data between:

" #
 #        Daily values:
 #"
and 
" #
 #       Minimum/Maximum values:
 #"

but the number of columns in the dataset will vary.  The information 
describing how it veries is contained in the sections:

" char    HillVarNames[ 3 ]
         {Days In Simulation}                         
         {Hillslope: Precipitation (mm)}              
         {Hillslope: Average detachment (kg/m**2)}"
and 

" char    OFEVarNames[ 3 ]
         {Irrigation depth (mm)}                      
         {Irrigation_volume_supplied/unit_area (mm)}  
         {Runoff (mm)}"

the number of columns is the sum of HillVarNames and OFEVarNames (6), and
the column labels are listed below.

Depending on options in the model run which generates this file, the number
of columns can change.  But I would like to write a function that reads the
file
and makes a data.frame with two columns, day and runoff, in this case columns
1 and 6 in the file.  If I can parse the variable names into a vector
I can determine which element has {Days In Simulation} and {Runoff (mm)} but
I am having trouble finding a function that will allow me to read in parts
of the
file and use information gathered along the way to direct additional reading.

The procedure I invision will look like this:

(1) skip first 9 lines
(2) read 3rd word in next line and assign to variable hillvarnames
(3) read hillvarnames more lines
(4) test which line has the value {Days In Simulation} and assign index to
daycolumn.
(5) skip 3 lines
(6) read 3rd word in next line and assign to variable ofevarnames
(7) read ofevarnames more lines
(8) test which line has the value {Runoff (mm)} and assign
index+hillvarnames to runoffcolumn.
(9) skip 3 lines
(10) read lines until 5 lines remain and assign the values in the daycolumn
and runoffcolumn columns to a data.frame with columns day and runoff.

Is this a reasonable thing to do in R?  Are there some functions that 
will make this task less difficult?  Is there a function that alows you to 
read a small amount of information, parse it, test it, and then begin reading 
again where it left off?

I am using the following R version:
         _              
platform i386-pc-mingw32
arch     i386           
os       mingw32        
system   i386, mingw32  
status                  
major    1              
minor    6.1            
year     2002           
month    11             
day      01             
language R              

Thank you in advance.

With best wishes and kind regards I am

Sincerely,

Corey A. Moffet
Support Scientist

University of Idaho
Northwest Watershed Research Center
800 Park Blvd, Plaza IV, Suite 105
Boise, ID 83712-7716
(208) 422-0718
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Tue, 19 Nov 2002, Corey Moffet wrote:

            
That's what connections and pushbacks are for.

?connection
?pushBack
#
On Tue, 19 Nov 2002, Corey Moffet stated:
This function seems to work, on your sample file at least,

read.hill <- function (file)
{
    lines <- scan (file, what = "", sep = "\n", quiet = TRUE)
    ## Get the line starting with ' char'
    chars <- grep ("^ char", lines)
    ## Get the number of columns
    ncols <- get.numbers (lines[chars])
    ## Get the column labels
    labels <- lines[rep (chars, ncols) +
                    as.vector (sapply (ncols, seq, from = 1))]
    ##
    days.col <- grep ("Days", labels)
    runoff.col <- grep ("Runoff", labels)
    ## Get the numbers 
    toSkip <- grep ("Daily values", lines) + 1
    toRead  <- grep ("Minimum/Maximum", lines) - 2 - toSkip
    temp <- unlist (strsplit (lines[(toSkip+1):(toSkip+toRead)],
                              split = " +"))
    ## There are some "" at the first column
    temp <- matrix (temp, ncol = length (labels) + 1, byrow = TRUE)
    data.frame (days = as.numeric (temp[, days.col + 1]),
                runoff = as.numeric (temp[, runoff.col + 1]))
}

get.numbers () is a function that I wrote to extract numbers from a character
vector that match a certain pattern.

get.numbers <- function (ss, pattern, ignore.case = FALSE) {
    if (!missing (pattern)) {
        ss <- grep (pattern, x = ss, ignore.case = ignore.case,
                    extended = TRUE, value = TRUE)
    }
    if (length (ss) == 0) {
        return (NULL)
    }
    ## split at non numeric, non-dot characters and two or more dots
    ## FIXME: this is not the optimal split
    token <- strsplit (ss, split = "([^-+.0-9]|--+|\\+\\++|\\.\\.+| \t)")
    ## remove any trailing '.'
    token <- lapply (token, function (x) sub ("\\.$", "", x))
    ## remove empty strings and convert to numeric
    token <- lapply (token, function (x) {
        as.numeric (x[sapply (x, function (y) y != "")])
    })
    if (is.null (names (ss))) {
        names (token) <- ss
    } else {
        names (token) <- names (ss)
    }
    token
}

As a test:
days runoff
1     1      0
2     2      0
3     3      0
4     4      0
5     5      0
6     6      0
7     7      0
8     8      0
9     9      0
10   10      0
11   11      0
12   12      0
13   13      0
14   14      0
15   15      0
16   16      0
17   17      0
18   18      0
19   19      0
20   20      0
21   21      0
22   22      0
23   23      0
24   24      0
25   25      0
26   26      0
27   27      0
28   28      0
29   29      0
30   30      0
31   31      0

As Jason pointed out, Perl might be more suitable to this job.  However, I do
like using R to parse many weird files.  I find maintaining R scripts much
easier than Perl and it is often more convenient to read a file directly into
R. 

It would be nice to have more powerful regex in R, such as returning matched
substring grouped with "()".

Michael
#
On Tue, 19 Nov 2002, Michael Na Li wrote:

            
I think you are overlooking the power of gsub.  You can certainly do that.
#
On Tue, 19 Nov 2002, ripley at stats.ox.ac.uk verbalised:
I want something like:
[[1]]
[1] "30" "80"

I'm not sure how to achieve this with 'gsub'.

The best I can come up with is:

regex.match <- function (pattern, x) {
    a <- strsplit (gsub(pattern, "*| \\1 |*", x), split = "\\*")
    b <- lapply (a, function (x) x[grep ("^\\|.*\\|", x)])
    lapply (b, function (x) {
        temp <- unlist (strsplit (x, split = " *\\| *"))
        temp[temp != ""]
    })
}
[[1]]
[1] "30" "80"

It is unfortunately not quite useful and breaks down when there are two "()"
expressions or none, for instance.
[[1]]
[1] "30"

Michael
#
Hi,

 | Date: Tue, 19 Nov 2002 11:11:41 -0700
 | From: Corey Moffet <cmoffet at nwrc.ars.usda.gov>
 | 
 | I have a generated file that looks like the following:
 | 
 | ----- Begin file -----
 |  #
 |  #       Output File
 |  #
 |  float   Version      2002.700000000000

As I understand, you have generated the file yourself using a
different software.  In that case I strongly recommend to consider
using XML as the format of the data file.  It allows much more
flexible parsing and the changes in reading routines are simple if you
change the format of data.  Many programs have pre-programmed XML
parsers, among them R (package XML) and perl.  I have used XML with
success while transfering complicated estimation results from SAS and
GAUSS to R.

Just a suggestion.

Ott
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._