Skip to content

iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

15 messages · nospam at altfeld-im.de, Martin Maechler, Duncan Murdoch +2 more

#
If I execute the code from the "?write.table" examples section

  x <- data.frame(a = I("a \" quote"), b = pi)
  # (ommited code)
  write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")

the resulting CSV file has a size of 6 bytes which is too short
(truncated):

  """,3

The problem seems to be the iconv function:

  iconv("foo", to="UTF-16")

produces

  Error in iconv("foo", to = "UTF-16"):
  embedded nul in string: '\xff\xfef\0o\0o\0'

In 2010 a (partial) patch for this problem was submitted:

http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html

Are there chances to fix this problem since it prevents writing Windows
UTF-16LE text files?



PS: This problem can be reproduced on Windows and Linux.

---------------
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
base     

loaded via a namespace (and not attached):
[1] tools_3.2.3
6 days later
#
Dear R developers

I think I have found a bug that can be reproduced with two lines of code
and I am very thankful to get your first assessment or feed-back on my
report.

If this is the wrong mailing list or I did something wrong
(e. g. semi "anonymous" email address to protect my privacy and defend
unwanted spam) please let me know since I am new here.

Thank you very much :-)

J. Altfeld
On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
#
> Dear R developers
    > I think I have found a bug that can be reproduced with two lines of code
    > and I am very thankful to get your first assessment or feed-back on my
    > report.

    > If this is the wrong mailing list or I did something wrong
    > (e. g. semi "anonymous" email address to protect my privacy and defend
    > unwanted spam) please let me know since I am new here.

    > Thank you very much :-)

    > J. Altfeld

Dear J.,
(yes, a bit less anonymity would be very welcomed here!),

You are right, this is a bug, at least in the documentation, but
probably "all real", indeed,

but read on.
> On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:
>> 
    >> 
    >> If I execute the code from the "?write.table" examples section
    >> 
    >> x <- data.frame(a = I("a \" quote"), b = pi)
    >> # (ommited code)
    >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
    >> 
    >> the resulting CSV file has a size of 6 bytes which is too short
    >> (truncated):
    >> 
    >> """,3

reproducibly, yes.
If you look at what write.csv does
and then simplify, you can get a similar wrong result by

  write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")

which results in a file with one line

""" 3

and if you debug  write.table() you see that its building blocks
here are
	 file <- file(........, encoding = fileEncoding)

a 	 writeLines(*, file=file)  for the column headers,

and then "deeper down" C code which I did not investigate.

But just looking a bit at such a file() object with writeLines()
seems slightly revealing, as e.g., 'eol' does not seem to
"work" for this encoding:

    > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
    > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
    > close(ff)
    > file.show(fn)
    CBA|>
    > file.size(fn)
    [1] 5
    > 

    >> The problem seems to be the iconv function:
    >> 
    >> iconv("foo", to="UTF-16")
    >> 
    >> produces
    >> 
    >> Error in iconv("foo", to = "UTF-16"):
    >> embedded nul in string: '\xff\xfef\0o\0o\0'

but this works

    > iconv("foo", to="UTF-16", toRaw=TRUE)
    [[1]]
    [1] ff fe 66 00 6f 00 6f 00

(indeed showing the embedded '\0's)

    >> In 2010 a (partial) patch for this problem was submitted:
    >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html

the patch only related to the iconv() problem not allowing 'raw'
(instead of character) argument x.

... and it is > 5.5 years old, for an iconv() version that was less
featureful than today.
Rather, current iconv(x) allows x to be a list of raw entries.


    >> Are there chances to fix this problem since it prevents writing Windows
    >> UTF-16LE text files?

    >> 
    >> PS: This problem can be reproduced on Windows and Linux.

indeed.... also on "R devel of today".

I agree it should be fixed... but as I said not by the patch you
mentioned.

Tested patches to fix this are welcome, indeed.

Martin Maechler



    >> ---------------
    >> 
    >> > sessionInfo()
    >> R version 3.2.3 (2015-12-10)
    >> Platform: x86_64-pc-linux-gnu (64-bit)
    >> Running under: Ubuntu 14.04.3 LTS
    >> 
    >> locale:
    >> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
    >> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
    >> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
    >> LC_PAPER=en_US.UTF-8       LC_NAME=C                 
    >> [9] LC_ADDRESS=C               LC_TELEPHONE=C
    >> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    >> 
    >> attached base packages:
    >> [1] stats     graphics  grDevices utils     datasets  methods
    >> base     
    >> 
    >> loaded via a namespace (and not attached):
    >> [1] tools_3.2.3
    >> >
    >> 
    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel
#
On 23.02.2016 11:37, Martin Maechler wrote:
I took a look at connections.c. There is a call to strlen() that gets
confused by null characters. I think the obvious fix is to avoid the
call to strlen() as the size is already known:

Index: src/main/connections.c
===================================================================
--- src/main/connections.c	(revision 70213)
+++ src/main/connections.c	(working copy)
@@ -369,7 +369,7 @@
 		/* is this safe? */
 		warning(_("invalid char string in output conversion"));
 	    *ob = '\0';
-	    con->write(outbuf, 1, strlen(outbuf), con);
+	    con->write(outbuf, 1, ob - outbuf, con);
 	} while(again && inb > 0);  /* it seems some iconv signal -1 on
 				       zero-length input */
     } else
With the patch applied:

    > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
    [1] "C"  "B"  "A"  "|"  ">a"
    > file.size(fn)
    [1] 22

- Mikko Korpela
#
Excellent analysis, thank you both for the quick reply!

Is there anything I can do to get the bug fixed in the next version of R
(e. g. filing a bug report at https://bugs.r-project.org/bugzilla3/)?
On Tue, 2016-02-23 at 14:06 +0200, Mikko Korpela wrote:
#
On 23/02/2016 4:53 PM, nospam at altfeld-im.de wrote:
Wait a few days, and file a bug report if nothing has happened.

Duncan Murdoch
#
On 23/02/2016 7:06 AM, Mikko Korpela wrote:
That may be okay on Unix, but it's not enough on Windows.  There the \n 
that writeLines adds at the end of each line isn't translated to 
UTF-16LE properly, so things get messed up.  (I think the \n is 
translated, but the \r that Windows wants is not, so you get a mix of 8 
bit and 16 bit characters.)

Duncan Murdoch
#
On 24.02.2016 15:47, Duncan Murdoch wrote:
That's unfortunate. I tested my tiny patch on Linux. I don't know what
kind of additional changes would be needed to make this work on Windows.
#
On 24/02/2016 9:55 AM, Mikko Korpela wrote:
It looks like a big change is needed for a perfect solution:

  - Windows does the translation of \n to \r\n.  In the R code, Windows 
is never told that the output is UTF-16LE, so it does an 8 bit translation.

  - Telling Windows that output is UTF-16LE looks hard:  we'd need to
convert the string to wide chars in R, then write it in wide chars. 
This seems like a lot of work for a rare case.

  - It might be easier to do a hack:  if the user asks for "UTF-16LE", 
then treat it internally as a text file but tell Windows it's a binary 
file.  This means no \n to \r\n translation will be done by Windows.  If 
the desired output file needs Windows line endings, the user would have 
to specify sep="\r\n" in writeLines.

Duncan Murdoch
#
On 24/02/2016 11:16 AM, Duncan Murdoch wrote:
A third possibility is to handle the insertion of the \r completely 
within R.  This will have the advantage of making it optional, so it 
would be a lot easier to write a Unix-style file on Windows.

I think either the first or third possibilities will take too much time 
for me to attempt them before 3.3.0.  I'm not sure about the second one yet.

Duncan Murdoch
#
On 23.02.2016 14:06, Mikko Korpela wrote:
I just realized that I was misusing the encoding argument of
readLines(). The code above works by accident, but the following would
be more appropriate:

    > ff <- file(fn, open="r", encoding="UTF-16LE")
    > readLines(ff)
    [1] "C"  "B"  "A"  "|"  ">a"
    > close(ff)

Testing on Linux, with the patch applied. (As noted by Duncan Murdoch,
the patch is incomplete on Windows.)

- Mikko
#
Aim for 3.3.1 then? It's not like we have hordes of people demanding to have this fixed right here and now, or do we? 

(A practical problem is that the version control dynamics dictate that at this stage, commits to r-devel _will_ end up in 3.3.0 on April 14, unless backed out and then inserted in the new r-devel branch to be created on March 17.) 

- Peter
On 24 Feb 2016, at 21:49 , Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

            
[...]

  
    
#
On 25.02.2016 11:31, Mikko Korpela wrote:
Before inspecting the file with readLines() I tried file.show() but it
did not work as expected. On Linux using a UTF-8 locale, the result of
trying to show the truly UTF-16LE encoded file with

    > file.show(fn, encoding="UTF-16LE")

was a pager showing "<43>" (quotes not included) followed by several
empty lines.

With the following patch, the command works correctly (in this case, on
this platform, not tested comprehensively). The idea is to read the
input file "raw" in order to avoid problems with null characters. The
input then needs to be split into lines after iconv(), or it could be
written to the output file with cat() if the style of line termination
characters does not matter. The 'perl = TRUE' is for assumed performance
advantage only. It can be removed, or one might want to test if there is
a significant difference one way or the other.

- Mikko

Index: src/library/base/R/files.R
===================================================================
--- src/library/base/R/files.R	(revision 70217)
+++ src/library/base/R/files.R	(working copy)
@@ -50,10 +50,13 @@
         for(i in seq_along(files)) {
             f <- files[i]
             tf <- tempfile()
-            tmp <- readLines(f, warn = FALSE)
+            tmp <- list(readBin(f, "raw", file.size(f)))
             tmp2 <- try(iconv(tmp, encoding, "", "byte"))
             if(inherits(tmp2, "try-error")) file.copy(f, tf)
-            else writeLines(tmp2, tf)
+            else {
+                tmp2 <- strsplit(tmp2, "\r\n?|\n", perl = TRUE)[[1L]]
+                writeLines(tmp2, tf)
+            }
             files[i] <- tf
             if(delete.file) unlink(f)
         }
4 days later
#
I have just committed your first patch (the strlen() replacement) to 
R-devel, and will soon put it in R-patched as well.  I wont have time to 
look at this again before the 3.2.4 release, so your file.show() patch 
isn't going to make it unless someone else gets to it.

There's still a faint chance that I'll do more in R-devel before 3.3.0, 
but I think it's best if there were bug reports about both of these 
problems so they don't get forgotten.  Since the first one is mainly a 
Windows problem, I'll write that one up; I'd appreciate it if you could 
write up the file.show() issue, after checking against R-devel rev 70247 
or higher.

Duncan Murdoch
On 25/02/2016 5:54 AM, Mikko Korpela wrote:
#
The file.show() issue is now in the bug tracker. I used a slightly
different example to demonstrate the problem.

https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=16738

- Mikko
On 29.02.2016 20:30, Duncan Murdoch wrote: