Which function to use: grep, replace, substr etc.?
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of David Winsemius Sent: Sunday, October 16, 2011 1:59 PM To: Jeff Newmiller Cc: r-help at r-project.org; syrvn Subject: Re: [R] Which function to use: grep, replace, substr etc.? On Oct 16, 2011, at 1:32 PM, Jeff Newmiller wrote:
Note that "male" comes before "female" in your data frame. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live...
syrvn <mentor_ at gmx.net> wrote:
Hi,
thanks for the tip! I do it as follows now but I still have a
problem I do
not understand:
abbrvs <- data.frame(c("peter", "name", "male", "female"),
c("P", "N", "m", "f"))
colnames(abbrvs) <- c("pattern", "replacement")
str <- "My name is peter and I am male"
for(m in 1:nrow(abbrvs)) {
str <- sub(abbrvs$pattern[m], abbrvs$replacement[m], str,
fixed=TRUE)
print(str)
}
This works perfectly fine as I get: "My N is P and I am m"
However, when I replace male by female then I get the following: "My
N is P
and I am fem"
but I want to have "My N is P and I am f".
Even with the parameter fixed=true I get the same result. Why is that?
Because "male" is in "female? This reminds me of a comment on a posting I made this morning on SO. http://stackoverflow.com/questions/7782113/counting-keyword-occurrences-in-r The problem was slightly different, but the greppish principle was that in order to match only complete words, you need to specific "^", "$" or " " at each end of the word: dataset <- c("corn", "cornmeal", "corn on the cob", "meal") grep("^corn$|^corn | corn$", dataset) [1] 1 3
You can use the 2 character sequences "\\<" and "\\>" to match
the beginning and end of a "word" (where the match takes up zero
characters):
> dataset <- c("corn", "cornmeal", "corn on the cob", "popcorn", "this corn is sweet")
> grep("^corn$|^corn | corn$", dataset)
[1] 1 3
> grep("\\<corn\\>", dataset)
[1] 1 3 5
> gsub("\\<corn\\>", "CORN", dataset)
[1] "CORN"
[2] "cornmeal"
[3] "CORN on the cob"
[4] "popcorn"
[5] "this CORN is sweet"
If your definition of a "word" is more expansive it gets complicated.
E.g., if words might include letters, numbers, and periods but not
underscores or anything else, you could use:
> gsub("(^|[^.[:alpha:][:digit:]])?corn($|[^.[:alpha:][:digit:]])?",
"\\1CORN.BY.ITSELF\\2",
c("corn.1", "corn_2", " corn", "4corn", "1.corn"))
[1] "corn.1"
[2] "CORN.BY.ITSELF_2"
[3] " CORN.BY.ITSELF"
[4] "4corn"
[5] "1.corn"
Moving to perl regular expressions would probably make this simpler.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
In such cases you may want to look at the gsubfn package. It offers higher level matching functions and I think strapply might be more efficient and expressive here. I can imagine construction in a loop such as yours, but you would probably want to build a pattern outside the sub() call. After struggling to fix your loop (and your data.frame which definitely should not be using factor variables), I am even more convinced you should be learning "gubfn" facilities. (Tate out the debugging print statements.)
> abbrvs <- data.frame(c("peter", "name", "male", "female"),
+ c(" P ", " N ", " m ", " f "), stringsAsFactors=FALSE)
>
> colnames(abbrvs) <- c("pattern", "replacement")
> for(m in 1:nrow(abbrvs)) { patt <- paste("^",abbrvs$pattern[m], "$|
", + abbrvs$pattern[m], " | ", + abbrvs$pattern[m], "$", sep="") + print(c( patt, abbrvs$replacement[m])) + str <- sub(patt, abbrvs$replacement[m], str) + print(str) + } [1] "^peter$| peter | peter$" " P " [1] "My name is P and I am female" [1] "^name$| name | name$" " N " [1] "My N is P and I am female" [1] "^male$| male | male$" " m " [1] "My N is P and I am female" [1] "^female$| female | female$" " f " [1] "My N is P and I am f " -- David Winsemius, MD Heritage Laboratories West Hartford, CT
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.