Skip to content

[Bioc-devel] Running Time of readBamGappedAlignmentPairs

5 messages · Martin Morgan, Michael Lawrence, Dario Strbenac

#
On 07/31/2013 04:05 PM, Herv? Pag?s wrote:
I haven't looked at Dario's data directly, but a surprising bottleneck is the 
creation of large character vectors

$ R --vanilla
[...]
 >  i = seq_len(55780092/2)
 > system.time(as.character(i))
    user  system elapsed
  74.880   0.408  75.451

This can be alleviated by telling R that that you're likely to need lots of 
memory up front; I alias R so that

$ R --min-vsize=2048M --min-nsize=20M --vanilla
[...]
 > i = seq_len(55780092/2)
 > system.time(as.character(i))
    user  system elapsed
  25.269   0.472  25.796

or

$ R_GC_MEM_GROW=3 R --min-vsize=2048M --min-nsize=20M --vanilla
[...]
 > i = seq_len(55780092/2)
 > system.time(as.character(i))
    user  system elapsed
  12.196   0.464  12.687

These are documented on ?Memory and with

 > R.version.string
[1] "R version 3.0.1 Patched (2013-06-05 r62876)"


Martin

  
    
#
On 08/02/2013 07:33 AM, Michael Lawrence wrote:
Currently, R allocates space for a STRSXP containing n = 55780092/2 elements, 
but for all R knows these could all be identical and no CHARSXP's are allocated; 
I don't think there's a way to pre-allocate n CHARSXPs. As the elements are 
filled, R sometimes needs to do garbage collection, and it has the increasingly 
onerous task of checking each CHARSXP to see whether it is in use (I think -- it 
seems like an in-use STRSXP might somehow be used to short-circuit this?). Maybe 
it would be helpful to be able to pre-allocate space for n unique CHARSXPs.

It seems like R_GC_MEM_GROW could be made accessible via C calls, but presumably 
this is not a good thing for user code to be entrusted with.

Martin

  
    
4 days later
#
Thanks. There is an improvement with those options. Not as much as the character creation example, but it does help.

   user  system elapsed 
1048.42   71.16 1120.75