Skip to content

UTF-16 input and read.delim/scan

2 messages · Patrick Callier, Peter Dalgaard

#
On May 18, 2012, at 20:19 , Patrick Callier wrote:

            
This stuff is highly locale dependent (and locales are OS dependent). As I understand things, the encoding= argument to scan() or read.table() says that the file is in a foreign encoding and you want to treat strings in that encoding, whereas fileEncoding= means that you want to convert to your current encoding and then treat the converted data. In the first case, you need to get the encoding right, in the other, in addition, you need to be in a locale that allows the conversion. 

For file(), requesting an encoding means asking for conversion, so if that doesn't work, you are out of luck (and you're just confusing the issue anyway). Here are a couple of examples in Latin1; notice that if you can't convert Chinese characters to your current locale, then the <U+1234> style output is the best you can hope for.

Peter-Dalgaards-MacBook-Air:minimal pd$ LC_ALL="da_DK.ISO8859-1" R --vanilla < minimal2.R

R version 2.14.2 (2012-02-29)
....
HITId                      HITTypeId Title Question
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z    NA       NA
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection 'minimal.txt'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'minimal.txt'
HITId                      HITTypeId
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z
                                                                     Title
1 <U+770B><U+770B><U+53E5><U+5B50><U+FF0C><U+5199><U+5199><U+60F3><U+6CD5>
                                                                                          Question
1 <U+8BF7><U+770B><U+4EE5><U+4E0B><U+7684><U+53E5><U+5B50><U+FF0C><U+518D><U+56DE><U+7B54><U+95EE>
HITId                      HITTypeId
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z
                                                                                                         Title
1 ?\234\213?\234\213?\217??\220?\214?\206\231?\206\231?\203??\225
                                                                                                                                          Question
1 ??\234\213??\213?\232\204?\217??\220?\214?\206\215?\233\236?\224?\227?
HITId                      HITTypeId Title Question
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z    NA       NA
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection 'minimal.txt'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'minimal.txt'