Skip to content

regexpr - ignore all special characters and punctuation in a string

8 messages · Dimitri Liakhovitski, Marc Schwartz, Sven E. Templer +3 more

#
Hello!

Please point me in the right direction.
I need to match 2 strings, but focusing ONLY on characters, ignoring
all special characters and punctuation signs, including (), "", etc..

For example:
I want the following to return: TRUE

"What a nice day today! - Story of happiness: Part 2." ==
   "What a nice day today: Story of happiness (Part 2)"
#
I think I found a partial answer:

str_replace_all(x, "[[:punct:]]", " ")

On Mon, Apr 20, 2015 at 9:59 AM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:

  
    
#
Look at ?agrep:

Vec1 <- "What a nice day today! - Story of happiness: Part 2."
Vec2 <- "What a nice day today: Story of happiness (Part 2)?

# Match the words, not the punctuation.
# Not fully tested
[1] 1 2
value = TRUE)
[1] "What a nice day today! - Story of happiness: Part 2."
[2] "What a nice day today: Story of happiness (Part 2)?  


Also, possibly:

  http://cran.r-project.org/web/packages/stringdist


Regards,

Marc Schwartz
#
Hi Dimitri,

str_replace_all is not in the base libraries, you could use 'gsub' as well,
for example:

a = "What a nice day today! - Story of happiness: Part 2."
b = "What a nice day today: Story of happiness (Part 2)"
sa = gsub("[^A-Za-z0-9]", "", a)
sb = gsub("[^A-Za-z0-9]", "", b)
a==b
# [1] FALSE
sa==sb
# [1] TRUE

Take care of the extra space in a after the '-', so also replace spaces...

Best,
Sven.

On 20 April 2015 at 16:05, Dimitri Liakhovitski <
dimitri.liakhovitski at gmail.com> wrote:

            

  
  
#
On 20/04/2015 9:59 AM, Dimitri Liakhovitski wrote:
I would transform both strings using gsub(), then compare.

e.g.

clean <- function(s)
  gsub("[[:punct:][:blank:]]", "", s)

clean("What a nice day today! - Story of happiness: Part 2.") ==
clean("What a nice day today: Story of happiness (Part 2)")

This completely ignores spaces; you might want something more
sophisticated if you consider "today" and "to day" to be different, e.g.

clean <- function(s) {
  s <- gsub("[[:punct:]]", "", s)
  gsub("[[:blank:]]+", " ", s)
}

which converts multiple blanks into single spaces.

Duncan Murdoch
#
On Mon, Apr 20, 2015 at 8:59 AM, Dimitri Liakhovitski <
dimitri.liakhovitski at gmail.com> wrote:

            
?Perhaps a variation on:
[1] TRUE
The gsub() removes all characters which are not alphabetic from each string
and then compares them.?
#
You can use the [:alnum:] regex class with gsub.

str1 <- "What a nice day today! - Story of happiness: Part 2."
str2 <- "What a nice day today: Story of happiness (Part 2)"

gsub("[^[:alnum:]]", "", str1) == gsub("[^[:alnum:]]", "", str2)
[1] TRUE

The same can be done with the stringr package if you really are partial to
it.

library(stringr)





On Mon, Apr 20, 2015 at 9:10 AM, Sven E. Templer <sven.templer at gmail.com>
wrote:

  
  
#
Thanks a lot, everybody for excellent suggestions!

On Mon, Apr 20, 2015 at 10:15 AM, Charles Determan
<cdetermanjr at gmail.com> wrote: