Skip to content

getting corrupted data when using readBin() after seek() on a gzfile connection

2 messages · Hervé Pagès, Henrik Bengtsson

#
Hi,

I'm running into more issues when reading data from a gzfile connection.
If I read the data sequentially with successive calls to readBin(), the
data I get looks ok. But if I call seek() between the successive calls
to readBin(), I get corrupted data.

Here is a (hopefully) reproducible example. See my sessionInfo() at the
end (I'm not on Windows, where, according to the man page, seek() is
broken).

   ## Generate data with a repeated easy-to-recognize byte pattern
   ## of length 26:
   mydata <- rep(charToRaw(paste(letters, collapse="")), 400)

   ## Write the data to test.gz file:
   con <- gzfile("test.gz", open="wb")
   writeBin(mydata, con)
   close(con)

   ## Read the data from test.gz file. We'll read blocks of 26 bytes
   ## located at various offsets that are multiple of 26, so we expect
   ## to see our original pattern ("abc...xyz").
   con <- gzfile("test.gz", open="rb")

   ## Offset 0: ok
   > rawToChar(readBin(con, "raw", n=26))
   [1] "abcdefghijklmnopqrstuvwxyz"

   ## Offset 78: still ok
   > seek(con, where=78)
   [1] 26
   > seek(con)
   [1] 78
   > rawToChar(readBin(con, "raw", n=26))
   [1] "abcdefghijklmnopqrstuvwxyz"

   ## Offset 520: data is messed up
   > seek(con, where=520)
   [1] 104
   > seek(con)
   [1] 520
   > rawToChar(readBin(con, "raw", n=26))
   [1] "zabcdefghijklmnopqrstuvvuv"


   ## Offset 2600: very messed up
   > seek(con, where=2600)
   [1] 546
   > seek(con)
   [1] 2600
   > rawToChar(readBin(con, "raw", n=26))
   [1] "xxxxxmpxxxxxxesxxxxxxxxxxp"

   ## Offset 10400: see previous email (subject: "error when calling
   ## seek() twice on a gzfile connection")
   > seek(con, where=10400)
   [1] 2626
   Warning message:
   In seek.connection(con, where = 10400) :
     seek on a gzfile connection returned an internal error

   close(con)

Reading the data sequentially with no calls to seek() returns the
expected pattern 400 times:

   con <- gzfile("test.gz", open="rb")
   blocks <- sapply(1:400, function(i) rawToChar(readBin(con, "raw", n=26)))

   ## Check the result:

   > readBin(con, "raw", n=26)  # no more data
   raw(0)

   > seek(con)
   [1] 10400

   > table(blocks)
   blocks
   abcdefghijklmnopqrstuvwxyz
                          400

Thanks,
H.

 > sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
#
I can reproduce this (exactly the same output) on Windows:
R version 3.0.0 Patched (2013-04-29 r62694)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_3.0.0

/Henrik
On Wed, May 8, 2013 at 1:51 AM, Herv? Pag?s <hpages at fhcrc.org> wrote: