This looks possible in R, and your algorithm looks precise enough.
The help pages for file, readLines, and scan should cast some
light.
For jobs like this I tend to use Perl, however. Familiarity
is one reason: I'm more comfortable with Perl for scanning/parsing
files. Also, Perl was originally written for exactly this sort of
thing.
Cheers
Jason
Indigo Industrial Controls Ltd.
64-21-343-545
jasont at indigoindustrial.co.nz
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear R-Help,
I have a generated file that looks like the following:
----- Begin file -----
#
# Output File
#
float Version 2002.700000000000
int Numdays 31
int NumOFEs 1
#
# Hillslope-specific variables
#
char HillVarNames[ 3 ]
{Days In Simulation}
{Hillslope: Precipitation (mm)}
{Hillslope: Average detachment (kg/m**2)}
#
# OFE-specific variables
#
char OFEVarNames[ 3 ]
{Irrigation depth (mm)}
{Irrigation_volume_supplied/unit_area (mm)}
{Runoff (mm)}
#
# Daily values:
#
1 5.40000 0.00000 0.00000 0.00000 0.00000
2 0.00000 0.00000 0.00000 0.00000 0.00000
3 2.30000 0.00000 0.00000 0.00000 0.00000
4 0.00000 0.00000 0.00000 0.00000 0.00000
5 0.00000 0.00000 0.00000 0.00000 0.00000
6 0.00000 0.00000 0.00000 0.00000 0.00000
7 0.00000 0.00000 0.00000 0.00000 0.00000
8 0.00000 0.00000 0.00000 0.00000 0.00000
9 12.80000 0.00000 0.00000 4.57200 0.00000
10 0.00000 0.00000 0.00000 0.00000 0.00000
11 0.00000 0.00000 0.00000 0.00000 0.00000
12 0.00000 0.00000 0.00000 0.00000 0.00000
13 0.00000 0.00000 0.00000 0.00000 0.00000
14 0.00000 0.00000 0.00000 0.00000 0.00000
15 0.00000 0.00000 0.00000 0.00000 0.00000
16 0.00000 0.00000 0.00000 0.00000 0.00000
17 0.00000 0.00000 0.00000 0.00000 0.00000
18 0.00000 0.00000 0.00000 0.00000 0.00000
19 0.00000 0.00000 0.00000 0.00000 0.00000
20 0.00000 0.00000 0.00000 0.00000 0.00000
21 0.00000 0.00000 0.00000 0.00000 0.00000
22 0.00000 0.00000 0.00000 0.00000 0.00000
23 0.00000 0.00000 0.00000 0.00000 0.00000
24 0.00000 0.00000 0.00000 0.00000 0.00000
25 0.00000 0.00000 0.00000 0.00000 0.00000
26 0.00000 0.00000 0.00000 0.00000 0.00000
27 0.00000 0.00000 0.00000 0.00000 0.00000
28 0.00000 0.00000 0.00000 0.00000 0.00000
29 32.30000 0.00001 0.00001 4.57200 0.00000
30 0.00000 0.00000 0.00000 0.00000 0.00000
31 0.00000 0.00000 0.00000 0.00000 0.00000
#
# Minimum/Maximum values:
#
1 0.00000 0.00000 0.00000 0.00000 0.00000
63 32.30000 0.00001 0.00001 4.57200 0.00000
----- end file -----
Note: Spaces in the first column are real.
I would like to read in a data.frame containing only the data between:
" #
# Daily values:
#"
and
" #
# Minimum/Maximum values:
#"
but the number of columns in the dataset will vary. The information
describing how it veries is contained in the sections:
" char HillVarNames[ 3 ]
{Days In Simulation}
{Hillslope: Precipitation (mm)}
{Hillslope: Average detachment (kg/m**2)}"
and
" char OFEVarNames[ 3 ]
{Irrigation depth (mm)}
{Irrigation_volume_supplied/unit_area (mm)}
{Runoff (mm)}"
the number of columns is the sum of HillVarNames and OFEVarNames (6), and
the column labels are listed below.
Depending on options in the model run which generates this file, the number
of columns can change. But I would like to write a function that reads the
file
and makes a data.frame with two columns, day and runoff, in this case columns
1 and 6 in the file. If I can parse the variable names into a vector
I can determine which element has {Days In Simulation} and {Runoff (mm)} but
I am having trouble finding a function that will allow me to read in parts
of the
file and use information gathered along the way to direct additional reading.
The procedure I invision will look like this:
(1) skip first 9 lines
(2) read 3rd word in next line and assign to variable hillvarnames
(3) read hillvarnames more lines
(4) test which line has the value {Days In Simulation} and assign index to
daycolumn.
(5) skip 3 lines
(6) read 3rd word in next line and assign to variable ofevarnames
(7) read ofevarnames more lines
(8) test which line has the value {Runoff (mm)} and assign
index+hillvarnames to runoffcolumn.
(9) skip 3 lines
(10) read lines until 5 lines remain and assign the values in the daycolumn
and runoffcolumn columns to a data.frame with columns day and runoff.
Is this a reasonable thing to do in R? Are there some functions that
will make this task less difficult? Is there a function that alows you to
read a small amount of information, parse it, test it, and then begin reading
again where it left off?
I am using the following R version:
_
platform i386-pc-mingw32
arch i386
os mingw32
system i386, mingw32
status
major 1
minor 6.1
year 2002
month 11
day 01
language R
Thank you in advance.
With best wishes and kind regards I am
Sincerely,
Corey A. Moffet
Support Scientist
University of Idaho
Northwest Watershed Research Center
800 Park Blvd, Plaza IV, Suite 105
Boise, ID 83712-7716
(208) 422-0718
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Is this a reasonable thing to do in R? Are there some functions that
will make this task less difficult? Is there a function that alows you to
read a small amount of information, parse it, test it, and then begin reading
again where it left off?
That's what connections and pushbacks are for.
?connection
?pushBack
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear R-Help,
I have a generated file that looks like the following:
....
Is this a reasonable thing to do in R? Are there some functions that will
make this task less difficult? Is there a function that alows you to read
a small amount of information, parse it, test it, and then begin reading
again where it left off?
This function seems to work, on your sample file at least,
read.hill <- function (file)
{
lines <- scan (file, what = "", sep = "\n", quiet = TRUE)
## Get the line starting with ' char'
chars <- grep ("^ char", lines)
## Get the number of columns
ncols <- get.numbers (lines[chars])
## Get the column labels
labels <- lines[rep (chars, ncols) +
as.vector (sapply (ncols, seq, from = 1))]
##
days.col <- grep ("Days", labels)
runoff.col <- grep ("Runoff", labels)
## Get the numbers
toSkip <- grep ("Daily values", lines) + 1
toRead <- grep ("Minimum/Maximum", lines) - 2 - toSkip
temp <- unlist (strsplit (lines[(toSkip+1):(toSkip+toRead)],
split = " +"))
## There are some "" at the first column
temp <- matrix (temp, ncol = length (labels) + 1, byrow = TRUE)
data.frame (days = as.numeric (temp[, days.col + 1]),
runoff = as.numeric (temp[, runoff.col + 1]))
}
get.numbers () is a function that I wrote to extract numbers from a character
vector that match a certain pattern.
get.numbers <- function (ss, pattern, ignore.case = FALSE) {
if (!missing (pattern)) {
ss <- grep (pattern, x = ss, ignore.case = ignore.case,
extended = TRUE, value = TRUE)
}
if (length (ss) == 0) {
return (NULL)
}
## split at non numeric, non-dot characters and two or more dots
## FIXME: this is not the optimal split
token <- strsplit (ss, split = "([^-+.0-9]|--+|\\+\\++|\\.\\.+| \t)")
## remove any trailing '.'
token <- lapply (token, function (x) sub ("\\.$", "", x))
## remove empty strings and convert to numeric
token <- lapply (token, function (x) {
as.numeric (x[sapply (x, function (y) y != "")])
})
if (is.null (names (ss))) {
names (token) <- ss
} else {
names (token) <- names (ss)
}
token
}
As a test:
read.hill ("hillslope.dat")
days runoff
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 0
11 11 0
12 12 0
13 13 0
14 14 0
15 15 0
16 16 0
17 17 0
18 18 0
19 19 0
20 20 0
21 21 0
22 22 0
23 23 0
24 24 0
25 25 0
26 26 0
27 27 0
28 28 0
29 29 0
30 30 0
31 31 0
As Jason pointed out, Perl might be more suitable to this job. However, I do
like using R to parse many weird files. I find maintaining R scripts much
easier than Perl and it is often more convenient to read a file directly into
R.
It would be nice to have more powerful regex in R, such as returning matched
substring grouped with "()".
Michael
----------------------------------------------------------------------------
Michael Na Li
Email: lina at u.washington.edu
Department of Biostatistics, Box 357232
University of Washington, Seattle, WA 98195
---------------------------------------------------------------------------
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
It would be nice to have more powerful regex in R, such as returning matched
substring grouped with "()".
I think you are overlooking the power of gsub. You can certainly do that.
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Tue, 19 Nov 2002, ripley at stats.ox.ac.uk verbalised:
On Tue, 19 Nov 2002, Michael Na Li wrote:
It would be nice to have more powerful regex in R, such as returning
matched substring grouped with "()".
I think you are overlooking the power of gsub. You can certainly do that.
I want something like:
REGEXFUN ("abc ([0-9]+)", "abc 30 and ABC 40 and abc 80")
[[1]]
[1] "30" "80"
I'm not sure how to achieve this with 'gsub'.
The best I can come up with is:
regex.match <- function (pattern, x) {
a <- strsplit (gsub(pattern, "*| \\1 |*", x), split = "\\*")
b <- lapply (a, function (x) x[grep ("^\\|.*\\|", x)])
lapply (b, function (x) {
temp <- unlist (strsplit (x, split = " *\\| *"))
temp[temp != ""]
})
}
regex.match ("abc ([0-9]+)", "abc 30 and ABC 40 and abc 80")
[[1]]
[1] "30" "80"
It is unfortunately not quite useful and breaks down when there are two "()"
expressions or none, for instance.
regex.match ("abc ([0-9]+) and ABC ([0-9+])", "abc 30 and ABC 40 and abc 80")
[[1]]
[1] "30"
Michael
----------------------------------------------------------------------------
Michael Na Li
Email: lina at u.washington.edu
Department of Biostatistics, Box 357232
University of Washington, Seattle, WA 98195
---------------------------------------------------------------------------
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Hi,
| Date: Tue, 19 Nov 2002 11:11:41 -0700
| From: Corey Moffet <cmoffet at nwrc.ars.usda.gov>
|
| I have a generated file that looks like the following:
|
| ----- Begin file -----
| #
| # Output File
| #
| float Version 2002.700000000000
As I understand, you have generated the file yourself using a
different software. In that case I strongly recommend to consider
using XML as the format of the data file. It allows much more
flexible parsing and the changes in reading routines are simple if you
change the format of data. Many programs have pre-programmed XML
parsers, among them R (package XML) and perl. I have used XML with
success while transfering complicated estimation results from SAS and
GAUSS to R.
Just a suggestion.
Ott
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._