Skip to content

Decompressing raw vectors in memory

7 messages · Brian Ripley, Duncan Temple Lang, Hadley Wickham

#
Hi all,

I'm struggling to decompress a gzip'd raw vector in memory:

content <- readBin("http://httpbin.org/gzip", "raw", 1000)

memDecompress(content, type = "gzip")
# Error in memDecompress(content, type = "gzip") :
#  internal error -3 in memDecompress(2)

I'm reasonably certain that the file is correctly compressed, because
if I save it out to a file, I can read the uncompressed data:

tmp <- tempfile()
writeBin(content, tmp)
readLines(tmp)

So that suggests I'm using memDecompress incorrectly.  Any hints?

Thanks!

Hadley
#
On 02/05/2012 14:24, Hadley Wickham wrote:
Headers.

  
    
#
Looking at http://tools.ietf.org/html/rfc1952:

* the first two bytes are id1 and id2, which are 1f 8b as expected

* the third byte is the compression: deflate (as.integer(content[3]))

* the fourth byte is the flag

  rawToBits(content[4])
  [1] 00 00 00 00 00 00 00 00

  which indicates no extra header fields are present

So the header looks ok to me (with my limited knowledge of gzip)

Stripping off the header doesn't seem to help either:

memDecompress(content[-(1:10)], type = "gzip")
# Error in memDecompress(content[-(1:10)], type = "gzip") :
#  internal error -3 in memDecompress(2)

I've read the help for memDecompress but I don't see anything there to help me.

Any more hints?

Thanks!

Hadley
#
On 02/05/2012 16:43, Hadley Wickham wrote:
Well, it seems what you get there depends on the client, but I did

tystie% curl -o foo "http://httpbin.org/gzip"
tystie% file foo
foo: gzip compressed data, last modified: Wed May  2 17:06:24 2012, max 
compression

and the final part worried me: I do not know if memDecompress() knows 
about that format.  The help page does not claim it can do anything 
other than de-compress the results of memCompress() (although past 
experience has shown that it can in some cases).  gzfile() supports a 
much wider range of formats.
#
Ah, ok.  Thanks.  Then in that case it's probably just as easy to save
it to a temp file and read that.

  con <- file(tmp) # R automatically detects compression
  open(con, "rb")
  on.exit(close(con), TRUE)

  readBin(con, raw(), file.info(tmp)$size * 10)

The only challenge is figuring out what n to give readBin. Is there a
good general strategy for this?  Guess based on the file size and then
iterate until result of readBin has length less than n?

  n <- file.info(tmp)$size * 2
  content <- readBin(con, raw(),  n)
  n_read <- length(content)
  while(n_read == n) {
    more <- readBin(con, raw(),  n)
    content <- c(content, more)
    n_read <- length(more)
  }

Which is not great style, but there shouldn't be many reads.

Hadley
#
I understand the desire not to have any dependency on additional
packages, and I have no desire to engage in any "mine's better" exchanges.
So write this just for the record. 
The gzunzip() function handle this.
{
  "origin": "24.5.119.171",
  "headers": {
    "Content-Length": "",
    "Host": "httpbin.org",
    "Content-Type": "",
    "Connection": "keep-alive",
    "Accept": "*/*"
  },
  "gzipped": true,
  "method": "GET"
}


Just FWIW, as I really don't like writing to temporary files,
most so that we might move towards security in R.

   D.
Hadley Wickham wrote:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20120502/9fddbfc5/attachment.bin>
#
Funnily enough I just discovered that RCurl already handles this: you
just need to set encoding = "gzip".  No extra dependencies, and yours
is better ;)

Hadley