Skip to content
Prev 3060 / 21312 Next

[Bioc-devel] Rsamtools Reading TabixFile URL

On 12/22/2011 03:00 PM, Dario Strbenac wrote:
Specifically, I get

 > tbx = TabixFile("http://savantbrowser.com/data/hg18/hg18.refGene.gz")
 > open(tbx)
Error in open.TabixFile(tbx) : failed to open file
In addition: Warning message:
In open.TabixFile(tbx) :
   [khttp_connect_file] fail to open file (HTTP code: 301).

The '301' error is generally file not found. If I open the URL in a 
browser I'm redirected to

   url = "http://genomesavant.com/savant//data/hg18/hg18.refGene.gz"

and things work out

   tbx = open(TabixFile(url))
   res <- yieldTabix(tbx)
The error isn't being very helpful, but R is trying to allocate an 
infinite amount of space for the result. This causes an integer 
overflow, reported as 'negative length vectors are not allowed'.

yieldSize is the maximum number of records to read in for each call to 
yieldTabix; the whole file if it is smaller than yieldSize.
The yieldSize is the number of lines parsed, so is equivalent to an 
allocation of character(yieldSize). The maximum size allowed by R is 
.Machine$integer.max

I'm not sure what a good rule of thumb is for VCF files; each record 
could easily be 1000 characters, you'd need memory to manipulate the 
result, so I'd say a yieldSize of at most mem.size / 1000 / 10.

But I'm not sure you gain alot by having very large input chunks? The 
paradigm for processing the whole file is

   tbx = open(TabixFile(url))
   while (length(res <- yieldTabix(tbx))) {
       ## work on res
   }

Martin