An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120518/fdf39b15/attachment.pl>
UTF-16 input and read.delim/scan
2 messages · Patrick Callier, Peter Dalgaard
On May 18, 2012, at 20:19 , Patrick Callier wrote:
Hi all,
I am running 64-bit R 2.15.0 on windows 7. I am trying to use read.delim
to read from a file that has 2-byte unicode (CJK) characters.
Here is an example of the data (it is tab-delimited if that gets messed up):
HITId HITTypeId Title
2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z ?????????
????????????
So read.delim (code below) doesn't read in correctly. It reads up until it
hits the CJK characters and then terminates with a warning:
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'minimal.txt'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'minimal.txt'
The "Title" field gets filled with an NA. I played around with scan() a
little bit and it can read the file correctly if i send it an open file
with the correct encoding given for the "encoding" parameter. It barfs with
the same warnings if I just send it the filename and set the fileEncoding
parameter.
Here is some test code with the above text in a file "minimal.txt"
# works
scan(file("minimal.txt",encoding="UTF-16LE"),what=character(),nlines=2)
# don't work
scan("minimal.txt",what=character(),nlines=2) # output is in wrong
encoding
scan("minimal.txt",what=character(),nlines=2,fileEncoding="UTF-16LE")
#"invalid input found on input connection"
read.delim(file("minimal.txt",encoding="UTF-16LE"), sep = "\t",
header=TRUE) # ditto
Is this a bug? Or am I just doing something wrong? Thanks for any help you
can provide.
This stuff is highly locale dependent (and locales are OS dependent). As I understand things, the encoding= argument to scan() or read.table() says that the file is in a foreign encoding and you want to treat strings in that encoding, whereas fileEncoding= means that you want to convert to your current encoding and then treat the converted data. In the first case, you need to get the encoding right, in the other, in addition, you need to be in a locale that allows the conversion. For file(), requesting an encoding means asking for conversion, so if that doesn't work, you are out of luck (and you're just confusing the issue anyway). Here are a couple of examples in Latin1; notice that if you can't convert Chinese characters to your current locale, then the <U+1234> style output is the best you can hope for. Peter-Dalgaards-MacBook-Air:minimal pd$ LC_ALL="da_DK.ISO8859-1" R --vanilla < minimal2.R R version 2.14.2 (2012-02-29) ....
read.delim(file("minimal.txt",encoding="UTF-8"), sep = "\t", header=TRUE,encoding="UTF-8")
HITId HITTypeId Title Question 1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z NA NA Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : invalid input found on input connection 'minimal.txt' 2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'minimal.txt'
read.delim(file="minimal.txt", encoding="UTF-8")
HITId HITTypeId
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z
Title
1 <U+770B><U+770B><U+53E5><U+5B50><U+FF0C><U+5199><U+5199><U+60F3><U+6CD5>
Question
1 <U+8BF7><U+770B><U+4EE5><U+4E0B><U+7684><U+53E5><U+5B50><U+FF0C><U+518D><U+56DE><U+7B54><U+95EE>
read.delim(file="minimal.txt")
HITId HITTypeId
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z
Title
1 ?\234\213?\234\213?\217??\220?\214?\206\231?\206\231?\203??\225
Question
1 ??\234\213??\213?\232\204?\217??\220?\214?\206\215?\233\236?\224?\227?
read.delim(file="minimal.txt", fileEncoding="UTF-8")
HITId HITTypeId Title Question 1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z NA NA Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : invalid input found on input connection 'minimal.txt' 2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'minimal.txt'
--Pat -- Patrick Callier Georgetown University http://www12.georgetown.edu/students/prc23/ [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com