Skip to content
Prev 8291 / 15076 Next

accented vowels

On 11-08-15 7:48 PM, Denis Chabot wrote:
Unicode sometimes gives different ways to encode what is rendered as the 
same character (e.g. letter + accent versus accented letter).  I think 
(see below) the OS uses one convention, but R chooses the other when it 
parses your text.

Cut and paste did just work for me, in a version of R 2.13.0 Patched 
which predates 2.13.1 by a few weeks; I'm not up to date on my Mac:


 > x <- list.files()
 > x
[1] "1_MO2 soles Se?te sda.Rda"
 > gsub("Se?te", "XXXX", x)
[1] "1_MO2 soles XXXX sda.Rda"



In the second line, I didn't try to type the pattern containing Se?te, I 
just cut and pasted it from the printed version of x.

One other possibility (and perhaps it's the best one, if your 
substitutions are all so simple) is to use the useBytes=TRUE option to 
gsub.  You can use charToRaw to see the bytes in a string, to make sure 
they are what you expect.

When I do that, I see that the e? really is handled differently in the 
two cases:

 > charToRaw("Se?te") # cut and paste from list.files() output
[1] 53 65 cc 80 74 65
 > charToRaw("S?te") # entered on the keyboard
[1] 53 c3 a8 74 65

So your solution is ugly:  you'll need to code all your substitutions 
twice (or more!) to handle all the possible ways the same letter could 
be encoded.  Or maybe iconv() or some other function has an option to 
normalize the encoding.  (I've just read some more about the issue in 
http://en.wikipedia.org/wiki/Unicode_equivalence; normalization is what 
you want to do, but I don't know how to do it.)

Duncan Murdoch