Skip to content

Coding systems.

3 messages · gerald.jean at dgag.ca, Jan van der Laan

#
Hello,

I am using R, 2.15.2, on a 64-bit Linux box.  I run R through Emacs' ESS.

R runs in a French, Canadian-French, locale and lately I got surprising
results
from functions making factor variables from character variables.  Many of
the
variables in input data.frames are character variables and contain latin
accents, for exemple the "?" in "Montr?al".  I waisted several days playing
with coding systems and trying to understand why some code when run one
command at
a time from the command line gives the expected result while when cut and
pasted in a function it doesn't???

For example the following code:

==============================================================================
ttt.rmr <- sima.31122012$rmrnom
ttt.rmr.2 <- ifelse (ttt.rmr %in% c("Edmonton", "Edmundston",
                                    "Charlottetown", "Calgary", "Winnipeg",
                                    "Victoria", "Vancouver", "Toronto",
                                    "St. John's", "Saskatoon", "Regina",
                                    "Qu?bec", "Ottawa - Gatineau (Ontario",
                                    "Ottawa - Gatineau (partie",
"Montr?al",
                                    "Halifax", "Fredericton"),
                     "Grandes villes", ifelse(ttt.rmr == "", "Manquant",
"Autres"))
unique(ttt.rmr.2)
ttt.rmr.2 <- factor(ttt.rmr.2, levels = c("Grandes villes", "Autres",
"Manquant"),
                    labels = c("Grandes villes", "Autres", "Manquant"))

==============================================================================

will have "Montr?al" and "Qu?bec" in the "Grandes villes" level of the
factor
variable, while running the same code in a function will have them in
"Autres".
The variable "rmr.Merged" in the data.frame "test2.sima.31122012.DataPrep"
is
the output of the function, which, of course, does a lot of other stuff.

==============================================================================
ttt.w <- which(ttt.rmr.2 != test2.sima.31122012.DataPrep$rmr.Merged)
frequence(test2.sima.31122012.DataPrep$rmrnom[ttt.w])
         Frequency  Percent Cum.Freq Cum.Percent
Montr?al   1301254 79.57173  1301254    79.57173
Qu?bec      334068 20.42827  1635322   100.00000
==============================================================================

All other city names, no accents, were correctly classified but "Montr?al"
and
"Qu?bec", together they represent over 1.5M records, not negligeable!!!

Following is my ".Renviron" file where I set up environment variables for
R.

R_PROFILE_USER="/home/jeg002/MyRwork/StartUp/profile.R"
# export R_PROFILE_USER
R_HISTFILE="/home/jeg002/MyRwork/.Rhistory"
## Default editor
EDITOR=${EDITOR-${VISUAL-'/usr/local/bin/emacsclient'}}
## Default pager
PAGER=${PAGER-'/usr/local/bin/emacsclient'}

## Setting locale, hoping it will be OK "all" the time!!!
LANG=fr_CA
LANGUAGE=fr_CA
LC_ADDRESS=fr_CA
LC_COLLATE=fr_CA
LC_TYPE=fr_CA
LC_IDENTIFICATION=fr_CA
LC_MEASUREMENT=fr_CA
LC_MESSAGES=fr_CA
LC_NAME=fr_CA
LC_PAPER=en_US
LC_NUMERIC=en_US
LC_TELEPHONE=fr_CA
LC_MONETARY=fr_CA
LC_TIME=fr_CA
R_PAPERSIZE='letter'
==============================================================================

and:
[1]
"LC_CTYPE=fr_CA;LC_NUMERIC=C;LC_TIME=fr_CA;LC_COLLATE=fr_CA;LC_MONETARY=fr_CA;LC_MESSAGES=fr_CA;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_CA;LC_IDENTIFICATION=C"
LANGUAGE     LANG
 "fr_CA"  "fr_CA"

I must be missing something!!!  Maybe someone can make sense of this!!!
Thanks
for your support,

G?rald Jean
                                                                                   
 (Embedded image moved to file:                                                    
 pic06023.gif)                                                                     
                                                                                   
 Gerald Jean, M. Sc. en statistiques                                               
 Conseiller senior en statistiques     L?vis (si?ge social)                        
                                                                                   
 Actuariat corporatif,                 418 835-4900, poste                         
 Mod?lisation et Recherche             7639                                        
 Assurance de dommages                 1 877 835-4900, poste                       
 Mouvement Desjardins                  7639                                        
                                       T?l?copieur : 418                           
                                       835-6657                                    
                                                                                   


                                                                                  
 Faites bonne impression et imprimez seulement au besoin!                         
                                                                                  
 Ce courriel est confidentiel, peut ?tre prot?g? par le secret professionnel et   
 est adress? exclusivement au destinataire. Il est strictement interdit ? toute   
 autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez  
 re?u par erreur, veuillez imm?diatement le d?truire et aviser l'exp?diteur.      
 Merci.
#
Could it be that your r-script is saved in a different encoding than  
the one used by R (which will probably be UTF8 since you're working on  
linux)?
#
Hello,

as Jan pointed out the problem is with the encoding in which R saves the
fucntion.  If I set this encoding to "UTF-8" in source everything is fine.

If I go either in my .bash_profile or my .Renviron file and set all LOCALE
variables to "fr_CA.UTF8" it should do the job, and to a certain point it
does, I can source, and save in my personnal library functions with
multibyte characters and they will run as expected.

BUT with these settings

at startup R throws the following error:

Erreur : caract?res multioctets incorrects dans l'analyse de code (parser)
? la ligne 28

which translates in something like:

Error: incorrect multi-byte characters in the code analysis (parser) at
line 28

Further more I can't install any package, install.packages returns the same
error and stops execution???

I know the work around is to not specify an UTF-8 locale in my profiles and
explicitly pass the argument "encoding = 'UTF-8'" to source.  But to me,
this is somewhat of an inconsistency!!!

Thanks to Jan for his insights,

G?rald
                                                                                   
 (Embedded image moved to file:                                                    
 pic09232.gif)                                                                     
                                                                                   
 Gerald Jean, M. Sc. en statistiques                                               
 Conseiller senior en statistiques     L?vis (si?ge social)                        
                                                                                   
 Actuariat corporatif,                 418 835-4900, poste                         
 Mod?lisation et Recherche             7639                                        
 Assurance de dommages                 1 877 835-4900, poste                       
 Mouvement Desjardins                  7639                                        
                                       T?l?copieur : 418                           
                                       835-6657                                    
                                                                                   


                                                                                  
 Faites bonne impression et imprimez seulement au besoin!                         
                                                                                  
 Ce courriel est confidentiel, peut ?tre prot?g? par le secret professionnel et   
 est adress? exclusivement au destinataire. Il est strictement interdit ? toute   
 autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez  
 re?u par erreur, veuillez imm?diatement le d?truire et aviser l'exp?diteur.      
 Merci.                                                                           
                                                                                  





                                                                           
             Jan van der Laan                                              
             <rhelp at eoos.dds.n                                             
             l>                                                          A 
                                       r-help at r-project.org                
             2013/11/27 02:26                                           cc 
                                       gerald.jean at dgag.ca                 
                                                                     Objet 
                                       Re: [R] Coding systems.             
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           





Could it be that your r-script is saved in a different encoding than
the one used by R (which will probably be UTF8 since you're working on
linux)?

--
Jan



gerald.jean at dgag.ca schreef:
playing
==============================================================================
"Winnipeg",
(Ontario",
==============================================================================
"test2.sima.31122012.DataPrep"
==============================================================================
==============================================================================
"Montr?al"
==============================================================================
"LC_CTYPE=fr_CA;LC_NUMERIC=C;LC_TIME=fr_CA;LC_COLLATE=fr_CA;LC_MONETARY=fr_CA;LC_MESSAGES=fr_CA;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_CA;LC_IDENTIFICATION=C"
l'exp?diteur.