Skip to content

seek() and gzfile() on 32-bit R2.12.0 in linux

4 messages · Brandon Whitcher, Matt Shotwell, Peter Dalgaard

#
You used file to open "ex.gz", which ought to work, but relies on do_url
to automatically detect that the file is a gzip file. It's a long shot,
but you could try to verify that the file is a valid gzip file (R checks
that the first two bytes == "\x1f\x8b") and try the gzfile function on
the 32 bit machine and see what happens. Also, it would be nice to see
the output of your sessionInfo(), in order to reproduce your finding.

This might be a bug in the R source:
(1 - unlikely) The C function do_url (src/main/connections.c) fails to
detect the gzip file on the 32 bit machine. Unfortunately, even if
do_url does detect a gzip file, the class of the returned connection
object is still marked c("file", "connection") rather than c("gzfile",
"connection"), so there's no easy check for this. Even so, this doesn't
explain why you get 7.80707e+17.

(2 - more likely) The zlib function gztell (declared:
src/extra/zlib/zlib.h defined: src/extra/zlib/gzlib.c) returns z_off_t.
The bug may relate to the size of z_off_t on the two different machines
and/or casting z_off_t to double (which is done just before the value is
returned by gzfile_seek, defined in src/main/connections.c). What a
headache. Need to reproduce the bug to investigate this further.

I have been wondering why double was used in the prototype for the seek
member of (struct Rconn), rather than an integer type. Presumably to
solve problems such as this. I'll be very interested to see what the
core team has to say here.

-Matt
On Tue, 2010-06-22 at 13:04 -0400, Brandon Whitcher wrote:
#
I was able to reproduce this bug. After some investigating, it's clearly
localized to gztell (a zlib function), and the z_off_t type. However,
there may be a broader cross-compiling problem. I don't know what
procedure Brandon used to compile the 32 bit version (I used the gcc
-m32 flag), but we should be sure that we're doing this correctly (and
document it!) before going on a goose chase. The real issue may or may
not be related to zlib, but only manifested there. Discussion of my
findings are below.

-Matt

I checked to ensure that R's file function was recognizing the gzip file
as such. So that's not the problem. I next modified some code in
gzfile_seek, just above and below the call to gztell (line 1230 of
connections.c), and defined a small function z_off_t_print, to print the
bits of the z_off_t offset in least significant order (assuming little
endian):

static void z_off_t_print(z_off_t)
{
    z_off_t mask = 1; 
    while( mask > 0 ) {
        printf("%u", (mask & u) > 0 ); 
        mask <<= 1;
    }
    printf("\n");
}

static double gzfile_seek(Rconnection con, double where, int origin, int rw)
{
    gzFile  fp = ((Rgzfileconn)(con->private))->fp;

    /** begin modified code **/
    z_off_t pos;
    printf("sizeof(z_off_t): %u\n", sizeof(z_off_t));
    printf("sizeof(double): %u\n", sizeof(double));
    printf("before gztell():\n");
    z_off_t_print(pos);
    pos = gztell(fp);
    printf("after gztell():\n");
    z_off_t_print(pos);
    printf("(double) pos: %f\n", (double) pos);

    /** end modified code **/
    ...

Here's what happens running code similar to yours in the 32 bit build:
+     "", "11 13 17", file = zz, sep = "\n")
sizeof(z_off_t): 8
sizeof(double): 8
before gztell():
000000000000000000000000000000000000000000000000000000000000000
after gztell():
000000000000000000000000000000000000110000111011110111001001000
(double) pos: 665367468683821056.000000
[1] 6.653675e+17
before gztell():
000000000000000000000000000000000000000000000000000000000000000
after gztell():
101000000000000000000000000000000000110000111011110111001001000
(double) pos: 665367468683821056.000000
[1] 6.653675e+17

Hence, gztell is doing what we expect in the least significant 32 bits
(which is binary for decimal 5), but returns junk in the most
significant 32 bits. Here are the results for the 64 bit build:
sizeof(z_off_t): 8
sizeof(double): 8
before gztell():
000000000000000000000000000000000000000000000000000000000000000
after gztell():
000000000000000000000000000000000000000000000000000000000000000
(double) pos: 0.000000
[1] 0
before gztell():
000000000000000000000000000000000000000000000000000000000000000
after gztell():
101000000000000000000000000000000000000000000000000000000000000
(double) pos: 5.000000
[1] 5

No problems with the 64 bit build.
On Tue, 2010-06-22 at 13:04 -0400, Brandon Whitcher wrote:
#
Brandon Whitcher wrote:
Please notice that there is NO release of R 2.12.0 until some time
around October. You are using a build from the UNSTABLE development
branch. The stable branch is 2.11.1 with a release date of May 31. If
Ubuntu is claiming that there is such a thing as a R 2.12.0 release, I'd
say that they  have a problem.

Not that we don't welcome reports on problems in the development branch,
but do notice that it is by definition UNSTABLE, and that bugs can come
and go without notice.

-pd

 I observe the following behavior