Skip to content

Which function to use: grep, replace, substr etc.?

6 messages · syrvn, Jeff Newmiller, David Winsemius +1 more

#
Hello,

I have a simple question but I don't know which method is best to use for my
problem.

I have the following strings:

str1 <- "My_name_is_peter"
str2 <- "what_is_your_surname_peter"

I would like to apply predefined abbreviations for peter=p and name=n to
both strings
so that the new strings look like the followings:

str1: "My_n_is_p"
str2: "what_is_your_surn_p"

Which method is the best to use for that particular problem?

syrvn

--
View this message in context: http://r.789695.n4.nabble.com/Which-function-to-use-grep-replace-substr-etc-tp3909871p3909871.html
Sent from the R help mailing list archive at Nabble.com.
#
On Oct 16, 2011, at 12:35 PM, syrvn wrote:

            
?sub  # on same page as grep

 > sub("(p)eter", "\\1", vec)
[1] "My_name_is_p"           "what_is_your_surname_p"
#
Hi,

thanks for the tip! I do it as follows now but I still have a problem I do
not understand:


abbrvs <- data.frame(c("peter", "name", "male", "female"),
		 	      c("P", "N", "m", "f"))
						
colnames(abbrvs) <- c("pattern", "replacement")
	
str <- "My name is peter and I am male"

for(m in 1:nrow(abbrvs)) {
		str <- sub(abbrvs$pattern[m], abbrvs$replacement[m], str, fixed=TRUE)
		print(str)
	}
	

This works perfectly fine as I get: "My N is P and I am m"

However, when I replace male by female then I get the following:  "My N is P
and I am fem"

but I want to have "My N is P and I am f".

Even with the parameter fixed=true I get the same result. Why is that?




--
View this message in context: http://r.789695.n4.nabble.com/Which-function-to-use-grep-replace-substr-etc-tp3909871p3909922.html
Sent from the R help mailing list archive at Nabble.com.
#
On Oct 16, 2011, at 1:32 PM, Jeff Newmiller wrote:

            
Because "male" is in "female? This reminds me of a comment on a  
posting I made this morning on SO.
http://stackoverflow.com/questions/7782113/counting-keyword-occurrences-in-r

The problem was slightly different, but the greppish principle was  
that in order to match only complete words, you need to specific "^",  
"$" or " " at each end of the word:

dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
grep("^corn$|^corn | corn$", dataset)
[1] 1 3

In such cases you may want to look at the gsubfn package. It offers  
higher level matching functions and I think strapply might be more  
efficient and expressive here. I can imagine construction in a loop  
such as yours, but you would probably want to build a pattern outside  
the sub() call.

After struggling to fix your loop (and your data.frame which  
definitely should not be using factor variables), I am even more  
convinced you should be learning "gubfn" facilities. (Tate out the  
debugging print statements.)

 > abbrvs <- data.frame(c("peter", "name", "male", "female"),
+ 		 	 c(" P ", " N ", " m ", " f "), stringsAsFactors=FALSE)
 > 						
 > colnames(abbrvs) <- c("pattern", "replacement")


 > for(m in 1:nrow(abbrvs)) { patt <- paste("^",abbrvs$pattern[m], "$|  
",
+                   abbrvs$pattern[m], " | ",
+                   abbrvs$pattern[m], "$", sep="")
+              print(c( patt, abbrvs$replacement[m]))
+ 		str <- sub(patt, abbrvs$replacement[m], str)
+ 		print(str)
+ 	}
[1] "^peter$| peter | peter$" " P "
[1] "My name is P and I am female"
[1] "^name$| name | name$" " N "
[1] "My N is P and I am female"
[1] "^male$| male | male$" " m "
[1] "My N is P and I am female"
[1] "^female$| female | female$" " f "
[1] "My N is P and I am f "
#
You can use the 2 character sequences "\\<" and "\\>" to match
the beginning and end of a "word" (where the match takes up zero
characters):
  > dataset <- c("corn", "cornmeal", "corn on the cob", "popcorn", "this corn is sweet")
  > grep("^corn$|^corn | corn$", dataset)
  [1] 1 3
  > grep("\\<corn\\>", dataset)
  [1] 1 3 5
  > gsub("\\<corn\\>", "CORN", dataset)
  [1] "CORN"              
  [2] "cornmeal"          
  [3] "CORN on the cob"   
  [4] "popcorn"           
  [5] "this CORN is sweet"

If your definition of a "word" is more expansive it gets complicated.
E.g., if words might include letters, numbers, and periods but not
underscores or anything else, you could use:
  > gsub("(^|[^.[:alpha:][:digit:]])?corn($|[^.[:alpha:][:digit:]])?",
      "\\1CORN.BY.ITSELF\\2",
      c("corn.1", "corn_2", " corn", "4corn", "1.corn"))
  [1] "corn.1"          
  [2] "CORN.BY.ITSELF_2"
  [3] " CORN.BY.ITSELF" 
  [4] "4corn"           
  [5] "1.corn"
Moving to perl regular expressions would probably make this simpler.    

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com