Skip to content

gsub() with unicode and escape character

7 messages · William Dunlap, Peter Langfelder, Brian Ripley +3 more

#
Dear helpers,

I'm trying to replace a character with a unicode code inside a data
frame using gsub(), but unsuccessfully.
[1] "d??g"  "w??lf" "cat"

It's not that a data frame cannot have unicode codes, cf. e.g.
[1] d?g  w?lf cat
Levels: cat d<U+0254>g w<U+0254>lf

I've done the best I can based on what ?gsub and ?enc2utf8 tell me,
but I haven't found a solution.

Unrelated to that problem, but related to gsub() is that I can't find
a way for gsub() to interpret the backslash as a character. In regular
expression, \\ should represent "the character \", but gsub() doesn't:
[1] "og"   "wolf" "cat"

Thank you
Sverre
#
To put a backslash in the replacement expression
of sub or gsub (when fixed=FALSE) use 4 backslashes.
The rationale is that the replacement expression
backslash-digit means to use the digit'th parenthesized
subpattern as the replacement and backslash-backslash means
to put in a literal backslash.  However, R parser also uses
backslashes to signify things like unicode characters (that
backslash is not in the string stored by R, but is just a
signal to the parser) and it requires a doubled backslash
to enter a backslash.  2*2 is 4 backslashes.  E.g.,

 > gsub("([[:digit:]]+)([[:alpha:]]+)", "alpha=<<\\2>>\\\\numeric=<<\\1>>", c("12P", "34Cat"))
 [1] "alpha=<<P>>\\numeric=<<12>>"   "alpha=<<Cat>>\\numeric=<<34>>"
 > cat(.Last.value, sep="\n") # see what is really in the strings
 alpha=<<P>>\numeric=<<12>>
 alpha=<<Cat>>\numeric=<<34>>

I don't know about your unicode/encoding problem.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Don't know the answer to you first question, but for the \\ see below.

On Sat, Jul 16, 2011 at 7:19 PM, Sverre Stausland
<johnsen at fas.harvard.edu> wrote:

            
Use \\\\ (yes, that's 4 backslashes).
[1] "\\og" "wolf" "cat"
\og wolf cat>


The reason is that the backslashes get interpreted twice, once when
the command line parses the string, second time when the gsub
processes the pattern.

HTH

Peter
#
You forgot the 'at a minimum' information required by the posting 
guide.

Most likely this is a limitation of the locale you used (and failed to 
tell us about) on the OS you used (...).
On Sat, 16 Jul 2011, Sverre Stausland wrote:

            

  
    
#
I really sorry if I understood your statement correctly :(

You said:
" To put a backslash in the replacement expression of sub or gsub
(when fixed=FALSE) use 4 backslashes"

I understood it is okay if I want to replace something with 2
backslashes. what if I want to replace that with just 1 backslash? I
have tried following however didn't work (R is asking few more input):

gsub("d","\\\",my.data$animals)

You said:
"replacement expression backslash-digit means to use the digit'th
parenthesized subpattern as the replacement"

Would you please elaborate this phenomena?  If I use "backslash-digit
= 6" then I dont see any difference in the end result:
[1] "\\og" "wolf" "cat"

Really helpful if you elaborate more on these issues.

Thanks,
On Sun, Jul 17, 2011 at 8:34 AM, William Dunlap <wdunlap at tibco.com> wrote:
#
On 17.07.2011 15:18, Nipesh Bajaj wrote:
Yes, because that translates (after R's processing) to "\\\" and end up 
after the real replacement in the string "\\\og"

If you interpret that it means 1 backslash (coming from the first two), 
an (escaped) "o" which is the same as a regular "o" and finally that "g".

Uwe Ligges
#
Sorry for not including those details. Here is a more detailed description:
[1] "d??g"  "w??lf" "cat"
R version 2.13.1 (2011-07-08)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Best
Sverre

On Sun, Jul 17, 2011 at 2:26 AM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote: