Can't read table encoded in Unicode (R-2.8.1)
On 18/04/2009 1:18 PM, Hilmar Berger wrote:
Hi all,
I have problems reading Unicode (UTF-16) coded tables in R 2.8.1 under
Windows Vista.
Imagine the following table:
a b c d
X 1,2 1,3 1,4
Y 2,2 2,3 2,4
Z 3,2 3,3 3,4
Usually I would use the following code to read the table:
t = read.table("test.txt", header=T, sep="\t",dec=",")
This works well if I create the table using Notepad (the text will be in
UTF-8 or ASCII, then).
I haven't tried 2.8.1 (which is obsolete, since yesterday :-), but in 2.9.0 it works fine if I use the fileEncoding argument to read.table. Duncan Murdoch
However, If I use e.g. OpenOffice scalc to create a spreadsheet holding the same data and save this data as text (using tabs as separators, no quotes and using Unicode encoding) the command above gives this:
> t = read.table("test.csv", header=T, sep="\t",dec=",")
> t
??a 1 NA 2 NA 3 NA I tried to play with the "encoding" parameter but that would not change anything. The file from OpenOffice is in UTF-16, as shown by hexdump: $ hexdump test.csv 0000000 feff 0061 0009 0062 0009 0063 0009 0064 0000010 000d 000a 0058 0009 0031 002c 0032 0009 0000020 0031 002c 0033 0009 0031 002c 0034 000d 0000030 000a 0059 0009 0032 002c 0032 0009 0032 0000040 002c 0033 0009 0032 002c 0034 000d 000a 0000050 005a 0009 0033 002c 0032 0009 0033 002c 0000060 0033 0009 0033 002c 0034 000d 000a 000006e I tried to read the file using file/readLines, which seemed to work after specifying the encoding:
> a = file("test.csv",open="r", encoding="UTF-16")
> b = readLines(a)
> b
[1] "a\tb\tc\td" "X\t1,2\t1,3\t1,4" "Y\t2,2\t2,3\t2,4" "Z\t3,2\t3,3\t3,4" Looking at the code of readtable.R in R-2.8.1. and R-2.9.0 it seems that the encoding does not get passed through in the second call to scan() appearing in the code. I'm not sure if this is a bug or if I'm doing something wrong here. Regards, Hilmar ------------------ My system and R settings are:
> sessionInfo()
R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.8.1
> Sys.info()
sysname
release version nodename
"Windows" "Vista" "build 6001,
Service Pack 1" "PC"
machine
login user
"x86"
> options("encoding")
$encoding [1] "native.enc"
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.