Skip to content

cat cannot write more than 10000 characters? [R 2.8.1]

7 messages · Daniel Sabanés Bové, Brian Ripley

#
Hi,

during the examination of a Sweave hang-up inside an odfWeave call (OOo
XMLs have looong lines) I have discovered that my cat function cannot
write more than 10000 characters to a text file. Otherwise, the internal
C code causes a hang-up, which can only be stopped with a quit signal
that terminates the R session. Is this behavior normal?

Code to reproduce this:

testChunk <- paste(rep("a", 10000 + 1), ## delete "+ 1" to be successful
                   collapse="")
output <- tempfile()
cat(testChunk, sep = "\n", file = output, append = TRUE)

My sessionInfo:

R version 2.8.1 (2008-12-22)
i686-pc-linux-gnu (actually the latest openSuse 11.1)
locale:
LC_CTYPE=de_DE.UTF-8;LC_NUMERIC=C;LC_TIME=de_DE.UTF-8;LC_COLLATE=de_DE.UTF-8;LC_MONETARY=C;LC_MESSAGES=de_DE.UTF-8;LC_PAPER=de_DE.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=de_DE.UTF-8;LC_IDENTIFICATION=C

Thanks in advance,
Daniel
#
On Sun, 4 Jan 2009, Daniel Saban?s Bov? wrote:

            
You mean on a single line?
No, works for me on Mac OS X and x86_64 Fedora 8 (as does 10x larger).
Can you run this under a debugger and find where it is going wrong for 
you?
We have writeLines() for that and it is more efficient, especially if you 
keep a connection open.

  
    
#
Dear Prof. Ripley,
Yes. OOo tries to save space...
Oh, then this might be distribution- or gcc-version-specific:
gcc --version
gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291]

glibc is version 2.9-2.3.

Using ddd I found the (relevant part of the) backtrace when interrupting
the infinite loop:

(gdb) backtrace
#0  __gconv (cd=0x846cde0, inbuf=0xbfff7738, inbufend=0x84ca589 "",
outbuf=0xbfff773c, outbufend=0xbfff9e57 "", irreversible=0xbfff76a8) at
gconv.c:80

The program comes here more than 100 000 times... with outbuf and inbuf
always being "\0".

#1  0xb7b581e7 in iconv (cd=0x846cde0, inbuf=0xbfff7738,
inbytesleft=0xbfff7734, outbuf=0xbfff773c, outbytesleft=0xbfff7730) at
iconv.c:53
[this is   result = __gconv (gcd, (const unsigned char **) inbuf,
                        (const unsigned char *)  (*inbuf + *inbytesleft),
                          (unsigned char **) outbuf,
                           (unsigned char *) (*outbuf + *outbytesleft),
                       &irreversible);]

#2  0xb7e44d29 in Riconv (cd=0x846cde0, inbuf=0xbfff7738,
inbytesleft=0xbfff7734, outbuf=0xbfff773c, outbytesleft=0xbfff7730) at
sysutils.c:692
[ this is the only line of Riconv,  return iconv((iconv_t) cd,
(ICONV_CONST char **) inbuf, inbytesleft, outbuf, outbytesleft);]

#3  0xb7d2c337 in dummy_vfprintf (con=0x8400bb0, format=0xb7ee0c48 "%s",
ap=0xbfffc604 "\230?L\b??\005\b?h\a\b?h\a\b??\005\b??\005\b\001") at
connections.c:316
[this is      ires = Riconv(con->outconv, &ib, &inb, &ob, &onb);]

The infinite loop seems to be inside dummy_vfprintf, as this position is
the "highest" inside the backtrace which is reached again and again. And
at line 249 appears the magic number 10000 as BUFSIZE, which is indeed
selected by the preprocessor in my environment!

#4  0xb7d2c4fa in file_vfprintf (con=0x8400bb0, format=0xb7ee0c48 "%s",
ap=0xbfffc604 "\230?L\b??\005\b?h\a\b?h\a\b??\005\b??\005\b\001") at
connections.c:579
[this is  if(con->outconv) return dummy_vfprintf(con, format, ap);]

This and everything above is only reached once, so this might be OK.

#5  0xb7dfe069 in Rvprintf (format=0xb7ee0c48 "%s", arg=0xbfffc604
"\230?L\b??\005\b?h\a\b?h\a\b??\005\b??\005\b\001") at printutils.c:785   
[this is   (con->vfprintf)(con, format, argcopy);]

#6  0xb7dfe244 in Rprintf (format=0xb7ee0c48 "%s") at printutils.c:679    
[this is   Rvprintf(format, ap);]

#7  0xb7d0446c in do_cat (call=0x83032a8, op=0x806b7d4, args=<value
optimized out>, rho=0x830359c) at builtin.c:597   
[this is   Rprintf("%s", p);]

Unfortunately, I'm not experienced in R/C code internals, but if you
have detailed instructions for me (like "show me the value of this
variable after 10000 stops") I can provide more debugging info.
OK, maybe Prof. Leisch wants to improve the Sweave code...?

Thank you very much for your help,
best regards,
Daniel Sabanes
6 days later
#
Looks like a bug in your iconv.  However, that section of code is 
conditionalized by

     if(con->outconv) { /* translate the buffer */

and I don't see that as non-NULL on my systems.  It should only be 
called when you specify an encoding on the output connection, so have 
you set an option (e.g. "encoding")  without telling us?

I was able to reproduce a similar problem by

cat(testChunk, sep = "\n", file = file("output", encoding="latin1"),
     append = TRUE)

in a UTF-8 locale, and I'll add a workaround to the R sources.

Please do run your tests with R --vanilla and make sure they are 
complete -- see the posting guide.
On Mon, 5 Jan 2009, Daniel Saban?s Bov? wrote:

            
I think you meant *bytes*, BTW.

  
    
#
Yes, I set the encoding to UTF-8 in my .Rprofile. Sorry that I didn't
mention it already. So the complete stand-alone test code which fails in
R --vanilla is the following:

### code begin
options (encoding = "utf-8")
testChunk <- paste(rep("a", 10000 + 1), ## delete "+ 1" to be successful
                   collapse="")
output <- tempfile()
cat(testChunk, sep = "\n", file = output, append = TRUE)
### code end

And the version and locale of my system are

R version 2.8.1 (2008-12-22)
i686-pc-linux-gnu
locale:
LC_CTYPE=de_DE.UTF-8;LC_NUMERIC=C;LC_TIME=de_DE.UTF-8;LC_COLLATE=de_DE.UTF-8;LC_MONETARY=C;LC_MESSAGES=de_DE.UTF-8;LC_PAPER=de_DE.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=de_DE.UTF-8;LC_IDENTIFICATION=C


Prof Brian Ripley schrieb:
#
On Sun, 11 Jan 2009, Daniel Saban?s Bov? wrote:

            
You really don't want to do that: it adds a considerable overhead and 
relies on a bug-free iconv ....

The latest R-patched should work around this.

  
    
#
Thank you very much for your help and advice!

Prof Brian Ripley schrieb: