Skip to content

Russian language in R

4 messages · lyolya, Lyolya, Duncan Murdoch

#
Hello, 

I am experiencing a problem in reading a database in Russian. The problem
appears when it comes to char variables. I have already tried changing the
encoding, i.e.

options(encoding="UTF-8")

and

options(encoding="KOI8-R")

but every time there appear to be something unreadable in the data frame,
like \x82\xa2\xae\xef etc. 

Could you please answer whether it is possible to operate with Russian
strings in R, and, if yes, how to get to do that. Thank you, in advance. 

Olga.   

--
View this message in context: http://r.789695.n4.nabble.com/Russian-language-in-R-tp3521206p3521206.html
Sent from the R help mailing list archive at Nabble.com.
#
On 13/05/2011 4:57 PM, lyolya wrote:
Yes, it is possible.  You can test it using a text editor that supports 
Russian.  Just put

x <- " some Russian text "

into the file, the use source() to read the filename.  Two things are 
likely outcomes:

x will be defined to be a string holding Russian text, and it will 
display properly.

OR

it will be defined to be a string with lots of escapes or mis-displayed 
characters in it.  In the latter case, the problem is that R is assuming 
a different encoding than your text editor.  The l10n_info() will 
display information about what R is expecting.

If none of the above helps you to get your code working, then you'll 
have to give details on exactly what you're doing to read the file, and 
exactly what is in the file.

Duncan Murdoch
2 days later
#
On 16/05/2011 8:33 AM, Lyolya wrote:
I'm not familiar with Russian encodings.  If you know what encoding is 
in the file, you may be able to use iconv() to convert it to UTF-8, 
which the l10n_info function says is native to your system.   To 
simplify things, use

read.dbf( "MSL_1010.dbf", as.is = TRUE)

so that you don't have to worry about factors and factor names.  Then try

iconv(x, from="KOI8-R", to="UTF-8")

where x is one of the character vectors with bad characters.  If that 
doesn't work, try a different possible encoding (e.g. cp1251).

Duncan Murdoch