Skip to content

extracting data using strings as delimiters

7 messages · lucy b, Katharine Mullen, Marc Schwartz +3 more

#
Dear List,

I have an ascii text file with data I'd like to extract. Example:

Year Built:  1873 Gross Building Area:  578 sq ft
Total Rooms:  6 Living Area:  578 sq ft

There is a lot of data I'd like to ignore in each record, so I'm
hoping there is a way to use strings as delimiters to get the data I
want (e.g. tell R to take data between "Built:" and "Gross" -
incidentally, not always numeric). I think an ugly way would be to
start at the end of each record and use a substitution expression to
chip away at it, but I'm afraid it will take forever to run. Is there
a way to use strings as delimiters in an expression?

Thanks in advance for ideas.

LB
#
have you seen help(strsplit)?
On Tue, 25 Sep 2007, lucy b wrote:

            
#
On Tue, 2007-09-25 at 16:39 -0400, lucy b wrote:
I don't know that any of the default base functions enable the use of a
regex as a delimiter. If your text file is consistent in the use of the
colon ':' as a separator, you might be able to use that. Each of the
above lines then would be broken into 3 fields using:

DF <- read.table("YourFile.txt", sep = ":")
V1                         V2          V3
1  Year Built   1873 Gross Building Area   578 sq ft
2 Total Rooms              6 Living Area   578 sq ft


You could then parse them further using appropriate functions if needed,
such as gsub():
V2  V3
1 1873 578
2    6 578


This now gives you the numeric data in two columns. You would now need
to know that data in the rows are perhaps in some predictable or
alternating order for further processing.  See ?gsub and ?regex for more
information.

Hope that provides some help. You also might want to look at ?readLines
and ?strsplit as other ways to read in the data and then post-process it
once in an R object.

Marc Schwartz
#
Here is one way.  You can setup a list of the patterns to match
against and then apply it to the string.  I am not sure  what the rest
of the text file look like, but this will return all the values that
match.
+ Total Rooms:  6 Living Area:  578 sq ft
+ Year Built:  1873 Gross Building Area:  578 sq ft
+ Total Rooms:  6 Living Area:  578 sq ft"))
+     Buildarea=".*Building Area:(.*)sq ft.*",
+     rooms=".*Rooms:(.*)Liv.*",
+     Livingarea=".*Living Area:(.*)sq ft.*")
+     # see which lines have the desired patterns
+     whichLines <- grep(m.list[[.pat]], x)
+     if (length(whichLines) > 0){
+         return(list(pattern=.pat, values=sub(m.list[[.pat]], "\\1",
x[whichLines])))
+     }
+     else return(NULL)
+ })
[[1]]
[[1]]$pattern
[1] "year"

[[1]]$values
[1] "  1873 " "  1873 "


[[2]]
[[2]]$pattern
[1] "Buildarea"

[[2]]$values
[1] "  578 " "  578 "


[[3]]
[[3]]$pattern
[1] "rooms"

[[3]]$values
[1] "  6 " "  6 "


[[4]]
[[4]]$pattern
[1] "Livingarea"

[[4]]$values
[1] "  578 " "  578 "
On 9/25/07, lucy b <lucy.lists at gmail.com> wrote:

  
    
#
On 25-Sep-07 20:39:11, lucy b wrote:
The scope of what you're trying to achieve is not clear,
though on the basis of your examples above you'd have to
use a different separator pattern for each type of line.

For your first example, a simple method is on the lines of

gsub(".*Built:" , "",
     "Year Built:  1873 Gross Building Area:  578 sq ft")
[1] "  1873 Gross Building Area:  578 sq ft"

and then just take the first white-space-delimited field
from the result.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 25-Sep-07                                       Time: 23:20:01
------------------------------ XFMail ------------------------------
#
Perhaps you could clarify what the general rule is but assuming
that what you want is any word after a colon it can be done with
strapply in the gsubfn package like this:

Lines <- c("Year Built:  1873 Gross Building Area:  578 sq ft",
"Total Rooms:  6 Living Area:  578 sq ft")

library(gsubfn)
strapply(Lines, ": *(\\w+)", backref = -1)

# or if each line has same number of returned words
strapply(Lines, ": *(\\w+)", backref = -1, simplify = rbind)

This matches a colon (:) followed by zero or more spaces ( *)
followed by a word ((\\w+)) and backref= - 1 causes it to return
only the first backreference (i..e. the portion within parentheses)
but not the match itself.
On 9/25/07, lucy b <lucy.lists at gmail.com> wrote:
#
All great ideas. I tried strsplit first and it worked, but thanks everyone!

Best-
LB
On 9/25/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote: