Skip to content

data frames with å, ä, and ö (=non-ASCII-characters) from windows to mac os x

10 messages · Ivan Alves, Gustaf Rydevik, David Winsemius +1 more

#
Hi,
I ran into this issue previously and managed to solve it, but I've
forgotten how and am getting frustrated...

I have a data frame (see below) with scandinavian characters in R
(2.7.1) running on a Win Xp-computer. I save the data frame in an
RData-file on a usb stick, and load() it in R (2.8.0) running on OS X
10.5. Now the name of the data frame and all factor labels with
scandinavian characters are scrambled. How do I make R in OS X read my
data frame?
1) run
 Sys.setlocale("LC_ALL","en_US.UTF-8") ### Doesn't change anything
or
2) run
  defaults write org.R-project.R force.LANG en_US.UTF-8
in the terminal, which doesn't help either.
I must admit that I couldn't quite follow what documentation i found
on locales, so I might have messed up somewhere along the line.

Many thanks in advance for your help!

Regards,

Gustaf


--------

L?nkarta <-
structure(list(LANKOD = structure(c(11L, 19L, 10L, 13L, 21L,
7L, 9L, 18L, 8L, 3L, 16L, 6L, 5L, 4L, 15L, 2L, 20L, 17L, 1L,
14L, 12L), .Label = c("AB", "AC", "BD", "C", "D", "E", "F", "G",
"H", "I", "K", "M", "N", "O", "S", "T", "U", "W", "X", "Y", "Z"
), class = "factor"), L?n = structure(c(1L, 4L, 3L, 5L, 6L, 7L,
8L, 2L, 9L, 10L, 20L, 21L, 13L, 14L, 15L, 16L, 17L, 18L, 12L,
19L, 11L), .Label = c("Blekinge l?n", "Dalarnas l?n", "Gotlands l?n",
"G?vleborgs l?n", "Hallands l?n", "J?mtlands l?n", "J?nk?pings l?n",
"Kalmar l?n", "Kronobergs l?n", "Norrbottens l?n", "Sk?ne l?n",
"Stockholms l?n", "S?dermanlands l?n", "Uppsala l?n", "V?rmlands l?n",
"V?sterbottens l?n", "V?sternorrlands l?n", "V?stmanlands l?n",
"V?stra G?talands l?n", "?rebro l?n", "?sterg?tlands l?n"), class =
"factor")), .Names = c("LANKOD",
"L?n"), class = "data.frame", row.names = c("0", "1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))
#
Hi,

On my system (see below), it works fine (inputing the code below at  
the R prompt).  Make sure that the encoding of the input file is  
encoded UTF-8.

Rgds,

Ivan

 > sessionInfo()
R version 2.8.1 Patched (2009-01-14 r47602)
i386-apple-darwin9.6.0

locale:
en_GB.UTF-8/en_GB.UTF-8/C/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
 > structure(list(LANKOD = structure(c(11L, 19L, 10L, 13L, 21L,7L, 9L,  
18L, 8L, 3L, 16L, 6L, 5L, 4L, 15L, 2L, 20L, 17L, 1L,14L, 12L), .Label  
= c("AB", "AC", "BD", "C", "D", "E", "F", "G","H", "I", "K", "M", "N",  
"O", "S", "T", "U", "W", "X", "Y", "Z"), class = "factor"), L?n =  
structure(c(1L, 4L, 3L, 5L, 6L, 7L,8L, 2L, 9L, 10L, 20L, 21L, 13L,  
14L, 15L, 16L, 17L, 18L, 12L,19L, 11L), .Label = c("Blekinge l?n",  
"Dalarnas l?n", "Gotlands l?n","G?vleborgs l?n","Hallands l?n",  
"J?mtlands l?n", "J?nk?pings l?n","Kalmar l?n", "Kronobergs l?n",  
"Norrbottens l?n", "Sk?ne l?n","Stockholms l?n", "S?dermanlands l?n",  
"Uppsala l?n", "V?rmlands l?n","V?sterbottens l?n", "V?sternorrlands  
l?n", "V?stmanlands l?n","V?stra G?talands l?n", "?rebro l?n",  
"?sterg?tlands l?n"), class ="factor")), .Names = c("LANKOD","L?n"),  
class = "data.frame", row.names = c("0", "1", "2", "3","4", "5", "6",  
"7", "8", "9", "10", "11", "12", "13", "14", "15","16", "17", "18",  
"19", "20"))
    LANKOD                  L?n
0       K         Blekinge l?n
1       X       G?vleborgs l?n
2       I         Gotlands l?n
3       N         Hallands l?n
4       Z        J?mtlands l?n
5       F       J?nk?pings l?n
6       H           Kalmar l?n
7       W         Dalarnas l?n
8       G       Kronobergs l?n
9      BD      Norrbottens l?n
10      T           ?rebro l?n
11      E    ?sterg?tlands l?n
12      D    S?dermanlands l?n
13      C          Uppsala l?n
14      S        V?rmlands l?n
15     AC    V?sterbottens l?n
16      Y  V?sternorrlands l?n
17      U     V?stmanlands l?n
18     AB       Stockholms l?n
19      O V?stra G?talands l?n
20      M            Sk?ne l?n
 > L?nkarta <- structure(list(LANKOD = structure(c(11L, 19L, 10L, 13L,  
21L,7L, 9L, 18L, 8L, 3L, 16L, 6L, 5L, 4L, 15L, 2L, 20L, 17L, 1L,14L,  
12L), .Label = c("AB", "AC", "BD", "C", "D", "E", "F", "G","H", "I",  
"K", "M", "N", "O", "S", "T", "U", "W", "X", "Y", "Z"), class =  
"factor"), L?n = structure(c(1L, 4L, 3L, 5L, 6L, 7L,8L, 2L, 9L, 10L,  
20L, 21L, 13L, 14L, 15L, 16L, 17L, 18L, 12L,19L, 11L), .Label =  
c("Blekinge l?n", "Dalarnas l?n", "Gotlands l?n","G?vleborgs  
l?n","Hallands l?n", "J?mtlands l?n", "J?nk?pings l?n","Kalmar l?n",  
"Kronobergs l?n", "Norrbottens l?n", "Sk?ne l?n","Stockholms l?n",  
"S?dermanlands l?n", "Uppsala l?n", "V?rmlands l?n","V?sterbottens  
l?n", "V?sternorrlands l?n", "V?stmanlands l?n","V?stra G?talands  
l?n", "?rebro l?n", "?sterg?tlands l?n"), class ="factor")), .Names =  
c("LANKOD","L?n"), class = "data.frame", row.names = c("0", "1", "2",  
"3","4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14",  
"15","16", "17", "18", "19", "20"))
 > ls()
[1] "L?nkarta"
 >
On 16 Jan 2009, at 14:13, Gustaf Rydevik wrote:

            
#
It displays sensibly (at least I think so, not being a reader of any  
Scandinavian language)  on my Mac (10.5.6).

 > L?nkarta <-
+ structure(list(LANKOD = structure(c(11L, 19L, 10L, 13L, 21L,
+ 7L, 9L, 18L, 8L, 3L, 16L, 6L, 5L, 4L, 15L, 2L, 20L, 17L, 1L,
+ 14L, 12L), .Label = c("AB", "AC", "BD", "C", "D", "E", "F", "G",
+ "H", "I", "K", "M", "N", "O", "S", "T", "U", "W", "X", "Y", "Z"
+ ), class = "factor"), L?n = structure(c(1L, 4L, 3L, 5L, 6L, 7L,
+ 8L, 2L, 9L, 10L, 20L, 21L, 13L, 14L, 15L, 16L, 17L, 18L, 12L,
+ 19L, 11L), .Label = c("Blekinge l?n", "Dalarnas l?n", "Gotlands l?n",
+ "G?vleborgs l?n", "Hallands l?n", "J?mtlands l?n", "J?nk?pings l?n",
+ "Kalmar l?n", "Kronobergs l?n", "Norrbottens l?n", "Sk?ne l?n",
+ "Stockholms l?n", "S?dermanlands l?n", "Uppsala l?n", "V?rmlands l?n",
+ "V?sterbottens l?n", "V?sternorrlands l?n", "V?stmanlands l?n",
+ "V?stra G?talands l?n", "?rebro l?n", "?sterg?tlands l?n"), class =
+ "factor")), .Names = c("LANKOD",
+ "L?n"), class = "data.frame", row.names = c("0", "1", "2", "3",
+ "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
+ "16", "17", "18", "19", "20"))
 > L?nkarta
    LANKOD                  L?n
0       K         Blekinge l?n
1       X       G?vleborgs l?n
2       I         Gotlands l?n
3       N         Hallands l?n
4       Z        J?mtlands l?n
5       F       J?nk?pings l?n
6       H           Kalmar l?n
7       W         Dalarnas l?n
8       G       Kronobergs l?n
9      BD      Norrbottens l?n
10      T           ?rebro l?n
11      E    ?sterg?tlands l?n
12      D    S?dermanlands l?n
13      C          Uppsala l?n
14      S        V?rmlands l?n
15     AC    V?sterbottens l?n
16      Y  V?sternorrlands l?n
17      U     V?stmanlands l?n
18     AB       Stockholms l?n
19      O V?stra G?talands l?n
20      M            Sk?ne l?n
 >
 > sessionInfo()
R version 2.8.1 Patched (2009-01-07 r47515)
i386-apple-darwin9.6.0

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

(An etiquette note: It is considered impolite to cross post to both  
the r-help and r-sig-mac lists.)
#
You need to use CP1252 not UTF-8 to read the data.  It tells you how 
to do so on the help page ... under 'encoding'. So something like

   A <- read.table(con <- file("myfile", encoding="CP1252"));close(con)

Please don't cross-post ... I am being brief because you did.
On Fri, 16 Jan 2009, Gustaf Rydevik wrote:

            

  
    
#
On Fri, 16 Jan 2009, David Winsemius wrote:

            
I think that is because your email client re-encoded it (as did mine), 
always a hazard of email.  It was marked as iso-8859-1.  Email, unlike 
text files, can have the encoding marked.
Not just impolite, inconsiderate of the time and resources of others: 
you are asked not to do so on the mailing lists top page.

  
    
#
Reading the help page for Sys.get/set/locale:

"Attempts to change the character set (by Sys.setlocale("LC_TYPE", ),  
if that implies a different character set) during a session may not  
work and are likely to lead to some confusion.
Value
A character string of length one describing the locale in use (after  
setting for Sys.setlocale), or an empty character string if the  
current locale settings are invalid or NULL if locale information is  
unavailable.
For category = "LC_ALL" the details of the string are system-specific:  
it might be a single locale name or a set of locale names separated by  
"/"(Solaris, Mac OS X) or ";" (Windows, Linux). For portability, it is  
best to query categories individually: it is not necessarily the case  
that the result of foo <- Sys.getlocale() can be used in  
Sys.setlocale("LC_ALL", locale = foo).'

I interpret that as saying that if you use "LC_ALL", then you need to  
pass a character string to Sys.setlocale() that is constructed  
properly for a Mac and that it might have "/"'s. And you need to do it  
at the beginning of a session. And that it will be ignored, as you say  
"not do anything" if not precisely correct. This is what Sys.getlocale  
returns on mine:

"en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8"



Hope this helps;

David Winsemius
On Jan 16, 2009, at 8:44 AM, David Winsemius wrote:

            
#
On Fri, Jan 16, 2009 at 2:53 PM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:
Thank you for your help, and I apologise for crossposting previously.
I've previously figured out how to solve this issue when using read.table(),
but in this case I was using save() and load() on the dataframe,
inbedding it in a workspave- is there a way to tell  load() that the
workspace to be loaded was created with a specific encoding?

Regards,

Gustaf
#
On Jan 16, 2009, at 8:48 AM, Prof Brian Ripley wrote:

            
Realizing that it might not work in all situations, would it give  
(possibly) useful results to assign the incorrect encoding found in  
Gustaf's email, which nonetheless was interpreted sensibly,  
"iso-8859-1", to the encoding string?
#
On Fri, 16 Jan 2009, David Winsemius wrote:

            
Actually, it says the opposite: the output you get is not necessarily 
valid input.
However, to set it, just en_US works (Mac locales are by default in 
UTF-8).  In Swedish, you can have:

tystie% locale -a | grep SE
sv_SE
sv_SE.ISO8859-1
sv_SE.ISO8859-15
sv_SE.UTF-8

and setting one of the middle two would have worked.

Annoyingly, Mac OS does not tell you which is which in the locales 
settings list, so it is basically useless.  I believe they are 
alphabetic (in the C locale) order since the Mac only has 6 
categories.
#
On Fri, 16 Jan 2009, David Winsemius wrote:

            
Yes: CP1252 is a superset of ISO-8859-1.  I knew because only one 
encoding can be used for Swedish on Windows (unlike most other OSes 
where there are three -- see my second posting).