Hi,
I usually do not give second thought to accented vowels and R handles everything fine thanks to UTF8 being used in my R scripts. But today I have a problem. Accented vowels do not behave properly when they were imported into R using list.files.
Maybe this is because OS X (I'm using 10.6.8) still uses MacRoman for file names, though visually the names seem to have been read correctly into R.
An example is better than words:
sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
This does not cause problem:
a = c("1_MO2 crevettes po2crit.Rda", "1_MO2 soles S?te sda.Rda", "1_MO2 turbots po2crit.Rda"); a
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
a2 = gsub(" S?te", "S", a); a2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda" "1_MO2 turbots po2crit.Rda"
but if instead of creating the vector within the R script, I read it as a series of file names, the substitution does not work. I am sorry that I cannot make this a reproducible example as it requires the 3 files to exist on your computer, but you could create 3 dummy files having the same names in the directory of your choice.
don = file.path("donn?es/")
b = list.files(path = don, pattern = "1_MO2"); b
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
b2 = gsub(" S?te", "S", b); b2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
I am puzzled and also "stuck". For now I'll modify the file name, but I need to be able to handle such names at some point.
Any advice?
thanks in advance,
Denis
accented vowels
7 messages · Denis Chabot, Duncan Murdoch
As a follow up, I tried this a[2] [1] "1_MO2 soles S?te sda.Rda" b[2] [1] "1_MO2 soles S?te sda.Rda" a[2] == b[2] [1] FALSE Denis Le 2011-08-15 ? 14:42, Denis Chabot a ?crit :
Hi,
I usually do not give second thought to accented vowels and R handles everything fine thanks to UTF8 being used in my R scripts. But today I have a problem. Accented vowels do not behave properly when they were imported into R using list.files.
Maybe this is because OS X (I'm using 10.6.8) still uses MacRoman for file names, though visually the names seem to have been read correctly into R.
An example is better than words:
sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
This does not cause problem:
a = c("1_MO2 crevettes po2crit.Rda", "1_MO2 soles S?te sda.Rda", "1_MO2 turbots po2crit.Rda"); a
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
a2 = gsub(" S?te", "S", a); a2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda" "1_MO2 turbots po2crit.Rda"
but if instead of creating the vector within the R script, I read it as a series of file names, the substitution does not work. I am sorry that I cannot make this a reproducible example as it requires the 3 files to exist on your computer, but you could create 3 dummy files having the same names in the directory of your choice.
don = file.path("donn?es/")
b = list.files(path = don, pattern = "1_MO2"); b
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
b2 = gsub(" S?te", "S", b); b2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
I am puzzled and also "stuck". For now I'll modify the file name, but I need to be able to handle such names at some point.
Any advice?
thanks in advance,
Denis
On 11-08-15 2:42 PM, Denis Chabot wrote:
Hi,
I usually do not give second thought to accented vowels and R handles everything fine thanks to UTF8 being used in my R scripts. But today I have a problem. Accented vowels do not behave properly when they were imported into R using list.files.
Maybe this is because OS X (I'm using 10.6.8) still uses MacRoman for file names, though visually the names seem to have been read correctly into R.
An example is better than words:
sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
This does not cause problem:
a = c("1_MO2 crevettes po2crit.Rda", "1_MO2 soles S?te sda.Rda", "1_MO2 turbots po2crit.Rda"); a
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
a2 = gsub(" S?te", "S", a); a2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda" "1_MO2 turbots po2crit.Rda"
but if instead of creating the vector within the R script, I read it as a series of file names, the substitution does not work. I am sorry that I cannot make this a reproducible example as it requires the 3 files to exist on your computer, but you could create 3 dummy files having the same names in the directory of your choice.
don = file.path("donn?es/")
b = list.files(path = don, pattern = "1_MO2"); b
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
b2 = gsub(" S?te", "S", b); b2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
I am puzzled and also "stuck". For now I'll modify the file name, but I need to be able to handle such names at some point.
Any advice?
Possibly your system really is using MacRoman or some other local encoding; in that case, iconv(x, "", "UTF-8") should convert from the local encoding to UTF-8. I think declaring everything to be UTF8 may be sufficient. When I use list.files(), I see the encoding listed as "unknown", but x <- list.files() Encoding(x) <- "UTF-8" works. However, the iconv() method should be safer. Duncan Murdoch
Le 2011-08-15 ? 19:06, Duncan Murdoch a ?crit :
On 11-08-15 2:42 PM, Denis Chabot wrote:
Hi,
I usually do not give second thought to accented vowels and R handles everything fine thanks to UTF8 being used in my R scripts. But today I have a problem. Accented vowels do not behave properly when they were imported into R using list.files.
Maybe this is because OS X (I'm using 10.6.8) still uses MacRoman for file names, though visually the names seem to have been read correctly into R.
An example is better than words:
sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
This does not cause problem:
a = c("1_MO2 crevettes po2crit.Rda", "1_MO2 soles S?te sda.Rda", "1_MO2 turbots po2crit.Rda"); a
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
a2 = gsub(" S?te", "S", a); a2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda" "1_MO2 turbots po2crit.Rda"
but if instead of creating the vector within the R script, I read it as a series of file names, the substitution does not work. I am sorry that I cannot make this a reproducible example as it requires the 3 files to exist on your computer, but you could create 3 dummy files having the same names in the directory of your choice.
don = file.path("donn?es/")
b = list.files(path = don, pattern = "1_MO2"); b
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
b2 = gsub(" S?te", "S", b); b2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
I am puzzled and also "stuck". For now I'll modify the file name, but I need to be able to handle such names at some point.
Any advice?
Possibly your system really is using MacRoman or some other local encoding; in that case, iconv(x, "", "UTF-8") should convert from the local encoding to UTF-8. I think declaring everything to be UTF8 may be sufficient. When I use list.files(), I see the encoding listed as "unknown", but x <- list.files() Encoding(x) <- "UTF-8" works. However, the iconv() method should be safer. Duncan Murdoch
Hi Duncan,
iconv() confirmed what I suspected: there was no problem with the encoding of the result of list.files, and if there had been one, the "?" would not have looked like a "?". Therefore, I got nonsense when treating this "?" as MacRoman to be converted into UTF-8:
iconv(b, from="MacRoman", to="UTF-8")
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Se??te sda.Rda" "1_MO2 turbots po2crit.Rda"
It is not clear however that R considered b to be UTF=8:
Encoding(b)
[1] "unknown" "unknown" "unknown"
so I followed your suggestion:
Encoding(b) <- "UTF-8"
Encoding(b)
[1] "unknown" "UTF-8" "unknown"
but gsub still did not work:
b2 = gsub(" S?te", "S", b); b2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
I do not know why gsub worked with example "a" but not "b" in the example shown in my original message. Strange and frustrating.
Denis
On 11-08-15 7:48 PM, Denis Chabot wrote:
Le 2011-08-15 ? 19:06, Duncan Murdoch a ?crit :
On 11-08-15 2:42 PM, Denis Chabot wrote:
Hi,
I usually do not give second thought to accented vowels and R handles everything fine thanks to UTF8 being used in my R scripts. But today I have a problem. Accented vowels do not behave properly when they were imported into R using list.files.
Maybe this is because OS X (I'm using 10.6.8) still uses MacRoman for file names, though visually the names seem to have been read correctly into R.
An example is better than words:
sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
This does not cause problem:
a = c("1_MO2 crevettes po2crit.Rda", "1_MO2 soles S?te sda.Rda", "1_MO2 turbots po2crit.Rda"); a
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
a2 = gsub(" S?te", "S", a); a2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda" "1_MO2 turbots po2crit.Rda"
but if instead of creating the vector within the R script, I read it as a series of file names, the substitution does not work. I am sorry that I cannot make this a reproducible example as it requires the 3 files to exist on your computer, but you could create 3 dummy files having the same names in the directory of your choice.
don = file.path("donn?es/")
b = list.files(path = don, pattern = "1_MO2"); b
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
b2 = gsub(" S?te", "S", b); b2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
I am puzzled and also "stuck". For now I'll modify the file name, but I need to be able to handle such names at some point.
Any advice?
Possibly your system really is using MacRoman or some other local encoding; in that case, iconv(x, "", "UTF-8") should convert from the local encoding to UTF-8. I think declaring everything to be UTF8 may be sufficient. When I use list.files(), I see the encoding listed as "unknown", but x<- list.files() Encoding(x)<- "UTF-8" works. However, the iconv() method should be safer. Duncan Murdoch
Hi Duncan,
iconv() confirmed what I suspected: there was no problem with the encoding of the result of list.files, and if there had been one, the "?" would not have looked like a "?". Therefore, I got nonsense when treating this "?" as MacRoman to be converted into UTF-8:
iconv(b, from="MacRoman", to="UTF-8")
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Se??te sda.Rda" "1_MO2 turbots po2crit.Rda"
It is not clear however that R considered b to be UTF=8:
Encoding(b)
[1] "unknown" "unknown" "unknown"
so I followed your suggestion:
Encoding(b)<- "UTF-8"
Encoding(b)
[1] "unknown" "UTF-8" "unknown"
but gsub still did not work:
b2 = gsub(" S?te", "S", b); b2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
I do not know why gsub worked with example "a" but not "b" in the example shown in my original message. Strange and frustrating.
Unicode sometimes gives different ways to encode what is rendered as the
same character (e.g. letter + accent versus accented letter). I think
(see below) the OS uses one convention, but R chooses the other when it
parses your text.
Cut and paste did just work for me, in a version of R 2.13.0 Patched
which predates 2.13.1 by a few weeks; I'm not up to date on my Mac:
> x <- list.files()
> x
[1] "1_MO2 soles Se?te sda.Rda"
> gsub("Se?te", "XXXX", x)
[1] "1_MO2 soles XXXX sda.Rda"
In the second line, I didn't try to type the pattern containing Se?te, I
just cut and pasted it from the printed version of x.
One other possibility (and perhaps it's the best one, if your
substitutions are all so simple) is to use the useBytes=TRUE option to
gsub. You can use charToRaw to see the bytes in a string, to make sure
they are what you expect.
When I do that, I see that the e? really is handled differently in the
two cases:
> charToRaw("Se?te") # cut and paste from list.files() output
[1] 53 65 cc 80 74 65
> charToRaw("S?te") # entered on the keyboard
[1] 53 c3 a8 74 65
So your solution is ugly: you'll need to code all your substitutions
twice (or more!) to handle all the possible ways the same letter could
be encoded. Or maybe iconv() or some other function has an option to
normalize the encoding. (I've just read some more about the issue in
http://en.wikipedia.org/wiki/Unicode_equivalence; normalization is what
you want to do, but I don't know how to do it.)
Duncan Murdoch
Le 2011-08-15 ? 22:24, Duncan Murdoch a ?crit :
On 11-08-15 7:48 PM, Denis Chabot wrote:
Le 2011-08-15 ? 19:06, Duncan Murdoch a ?crit :
On 11-08-15 2:42 PM, Denis Chabot wrote:
Hi,
I usually do not give second thought to accented vowels and R handles everything fine thanks to UTF8 being used in my R scripts. But today I have a problem. Accented vowels do not behave properly when they were imported into R using list.files.
Maybe this is because OS X (I'm using 10.6.8) still uses MacRoman for file names, though visually the names seem to have been read correctly into R.
An example is better than words:
sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
This does not cause problem:
a = c("1_MO2 crevettes po2crit.Rda", "1_MO2 soles S?te sda.Rda", "1_MO2 turbots po2crit.Rda"); a
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
a2 = gsub(" S?te", "S", a); a2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda" "1_MO2 turbots po2crit.Rda"
but if instead of creating the vector within the R script, I read it as a series of file names, the substitution does not work. I am sorry that I cannot make this a reproducible example as it requires the 3 files to exist on your computer, but you could create 3 dummy files having the same names in the directory of your choice.
don = file.path("donn?es/")
b = list.files(path = don, pattern = "1_MO2"); b
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
b2 = gsub(" S?te", "S", b); b2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
I am puzzled and also "stuck". For now I'll modify the file name, but I need to be able to handle such names at some point.
Any advice?
Possibly your system really is using MacRoman or some other local encoding; in that case, iconv(x, "", "UTF-8") should convert from the local encoding to UTF-8. I think declaring everything to be UTF8 may be sufficient. When I use list.files(), I see the encoding listed as "unknown", but x<- list.files() Encoding(x)<- "UTF-8" works. However, the iconv() method should be safer. Duncan Murdoch
Hi Duncan,
iconv() confirmed what I suspected: there was no problem with the encoding of the result of list.files, and if there had been one, the "?" would not have looked like a "?". Therefore, I got nonsense when treating this "?" as MacRoman to be converted into UTF-8:
iconv(b, from="MacRoman", to="UTF-8")
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles Se??te sda.Rda" "1_MO2 turbots po2crit.Rda"
It is not clear however that R considered b to be UTF=8:
Encoding(b)
[1] "unknown" "unknown" "unknown"
so I followed your suggestion:
Encoding(b)<- "UTF-8"
Encoding(b)
[1] "unknown" "UTF-8" "unknown"
but gsub still did not work:
b2 = gsub(" S?te", "S", b); b2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 soles S?te sda.Rda" "1_MO2 turbots po2crit.Rda"
I do not know why gsub worked with example "a" but not "b" in the example shown in my original message. Strange and frustrating.
Unicode sometimes gives different ways to encode what is rendered as the same character (e.g. letter + accent versus accented letter). I think (see below) the OS uses one convention, but R chooses the other when it parses your text. Cut and paste did just work for me, in a version of R 2.13.0 Patched which predates 2.13.1 by a few weeks; I'm not up to date on my Mac:
x <- list.files() x
[1] "1_MO2 soles S?te sda.Rda"
gsub("S?te", "XXXX", x)
[1] "1_MO2 soles XXXX sda.Rda" In the second line, I didn't try to type the pattern containing S?te, I just cut and pasted it from the printed version of x. One other possibility (and perhaps it's the best one, if your substitutions are all so simple) is to use the useBytes=TRUE option to gsub. You can use charToRaw to see the bytes in a string, to make sure they are what you expect. When I do that, I see that the ? really is handled differently in the two cases:
charToRaw("S?te") # cut and paste from list.files() output
[1] 53 65 cc 80 74 65
charToRaw("S?te") # entered on the keyboard
[1] 53 c3 a8 74 65 So your solution is ugly: you'll need to code all your substitutions twice (or more!) to handle all the possible ways the same letter could be encoded. Or maybe iconv() or some other function has an option to normalize the encoding. (I've just read some more about the issue in http://en.wikipedia.org/wiki/Unicode_equivalence; normalization is what you want to do, but I don't know how to do it.) Duncan Murdoch
Hi again Duncan,
the "Errors due to normalization differences" part of the article you referred to seems to confirm your suspicion.
I can get this to work but it is messy:
S?tefileraw = charToRaw(substr(b[2],13,17))
S?tefile = rawToChar(S?tefileraw)
S?tekbraw = charToRaw(substr(a[2],13,16))
S?tekb = rawToChar(S?tekbraw)
c = b
c = gsub(S?tefile, S?tekb, c)
at this point, S?te has become the "keyboard" version and the rest of the script can work
c2 = gsub(" S?te", "S", c); c2
[1] "1_MO2 crevettes po2crit.Rda" "1_MO2 solesS sda.Rda" "1_MO2 turbots po2crit.Rda"
I'll keep accented vowels out of file names for this project whenever I'll have to use gsub on them!
Thanks again,
Denis
Duncan, I think I'll avoid the problem in another manner, leaving me free to choose whatever file names I want or need. The reason I used gsub on file names is that I want to collate a series of data.frames using rbind (or rbind.fill). For this, I need to know the names of the data.frames I just read and this depends on the files that were in the directory I choose to read. I therefore use a bunch of gsub to extract data.frame names from file names (file names were constructed to make this possible). But I did this only because I did not know how to find the name of a data.frame after it is loaded. I searched on the net to see if this was feasible. I did not find a way to do this (short of doing a ls() before and after each load to see what has changed) but I found something close, thanks to you actually, in an answer you gave to another user!
Aug 12, 2011; 3:23pmRe: Getting data from an *.RData file into a data.frame object. On 12/08/2011 3:12 PM, Ed Heaton wrote:
Hi, all.
I'm new to R. I've been a SAS programmer for 20 years.
I seem to be having trouble with the most basic task - bringing a table in
an *.RData file into a data.frame object.
Here's how I created the *.RData file.
library(RODBC)
db<- odbcConnect("*******")
df<- sqlQuery(
db
, "select * from schema.table where year(someDate)=2006"
)
save(
df
, file="C:/Documents and Settings/userName/My Documents/table2006.RData"
)
dim(df)
remove(df)
odbcClose(db)
remove(db)
detach("package:RODBC")
Next, I moved that data file (table2006.RData) to another workstation - not
at the client site.
Now, I need to get that data file into a data.frame object. I know this
should be simple, but I can't seem to find out how to do that. I tried the
following. First, after opening R without doing anything, RGui used 35,008
KB of memory. I submitted the following.
debt2006<- load("T:/R.Data/table2006.RData")
Memory used by RGui jumped to 191,512 KB. So, it looks like the data loaded. However, debt2005 is of type character instead of data.frame.
ls()
[1] "debt2005"
class(debt2005)
[1] "character"
Help, please.
... [show rest of quote] save() and load() work with multiple objects, and the objects keep their names. So your object would be recreated as "df" after the load. If you just want to save the data from one object without its name, use saveRDS() and readRDS(). Duncan Murdoch
I'll use this in the current project to avoid the situation that led to my initial message. Many thanks! Denis