iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
On 23.02.2016 11:37, Martin Maechler wrote:
nospam at altfeld-im de <nospam at altfeld-im.de>
on Mon, 22 Feb 2016 18:45:59 +0100 writes:
> Dear R developers
> I think I have found a bug that can be reproduced with two lines of code
> and I am very thankful to get your first assessment or feed-back on my
> report.
> If this is the wrong mailing list or I did something wrong
> (e. g. semi "anonymous" email address to protect my privacy and defend
> unwanted spam) please let me know since I am new here.
> Thank you very much :-)
> J. Altfeld
Dear J., (yes, a bit less anonymity would be very welcomed here!), You are right, this is a bug, at least in the documentation, but probably "all real", indeed, but read on.
> On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>>
>>
>> If I execute the code from the "?write.table" examples section
>>
>> x <- data.frame(a = I("a \" quote"), b = pi)
>> # (ommited code)
>> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>
>> the resulting CSV file has a size of 6 bytes which is too short
>> (truncated):
>>
>> """,3
reproducibly, yes. If you look at what write.csv does and then simplify, you can get a similar wrong result by write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") which results in a file with one line """ 3 and if you debug write.table() you see that its building blocks here are file <- file(........, encoding = fileEncoding) a writeLines(*, file=file) for the column headers, and then "deeper down" C code which I did not investigate.
I took a look at connections.c. There is a call to strlen() that gets
confused by null characters. I think the obvious fix is to avoid the
call to strlen() as the size is already known:
Index: src/main/connections.c
===================================================================
--- src/main/connections.c (revision 70213)
+++ src/main/connections.c (working copy)
@@ -369,7 +369,7 @@
/* is this safe? */
warning(_("invalid char string in output conversion"));
*ob = '\0';
- con->write(outbuf, 1, strlen(outbuf), con);
+ con->write(outbuf, 1, ob - outbuf, con);
} while(again && inb > 0); /* it seems some iconv signal -1 on
zero-length input */
} else
But just looking a bit at such a file() object with writeLines() seems slightly revealing, as e.g., 'eol' does not seem to "work" for this encoding:
> fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
> writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
> close(ff)
> file.show(fn)
CBA|>
> file.size(fn)
[1] 5
>
With the patch applied:
> readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
[1] "C" "B" "A" "|" ">a"
> file.size(fn)
[1] 22
- Mikko Korpela
>> The problem seems to be the iconv function:
>>
>> iconv("foo", to="UTF-16")
>>
>> produces
>>
>> Error in iconv("foo", to = "UTF-16"):
>> embedded nul in string: '\xff\xfef\0o\0o\0'
but this works
> iconv("foo", to="UTF-16", toRaw=TRUE)
[[1]]
[1] ff fe 66 00 6f 00 6f 00
(indeed showing the embedded '\0's)
>> In 2010 a (partial) patch for this problem was submitted:
>> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html
the patch only related to the iconv() problem not allowing 'raw' (instead of character) argument x. ... and it is > 5.5 years old, for an iconv() version that was less featureful than today. Rather, current iconv(x) allows x to be a list of raw entries.
>> Are there chances to fix this problem since it prevents writing Windows
>> UTF-16LE text files?
>>
>> PS: This problem can be reproduced on Windows and Linux.
indeed.... also on "R devel of today". I agree it should be fixed... but as I said not by the patch you mentioned. Tested patches to fix this are welcome, indeed.