Skip to content

[Bioc-devel] IRanges should support long vectors

4 messages · Pariksheet Nanda, Hervé Pagès

#
Hello,

R 3.0 added support for long vectors, but it's not yet possible to use them
with IRanges.  Without long vector support it's not possible to construct
an IRanges object with more than 2^31 elements:
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")
:
  long vectors not supported yet: memory.c:3715
In addition: Warning message:
In .normargSEW0(start, "start") :
  NAs introduced by coercion to integer range
This is true when using the latest version from GitHub
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)

Matrix products: default
BLAS:
/home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRblas.so
LAPACK:
/home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] IRanges_2.19.5      S4Vectors_0.22.0    BiocGenerics_0.30.0

loaded via a namespace (and not attached):
 [1] ps_1.3.0           prettyunits_1.0.2  withr_2.1.2        crayon_1.3.4

 [5] rprojroot_1.3-2    assertthat_0.2.1   R6_2.4.0
backports_1.1.4
 [9] magrittr_1.5       cli_1.1.0          curl_3.3           remotes_2.0.4

[13] callr_3.2.0        tools_3.6.0        compiler_3.6.0
processx_3.3.1
[17] pkgbuild_1.0.3     BiocManager_1.30.4
Pariksheet
2 days later
#
Hi Pariksheet,
On 5/25/19 12:49, Pariksheet Nanda wrote:
Hello,

R 3.0 added support for long vectors, but it's not yet possible to use them
with IRanges.  Without long vector support it's not possible to construct
an IRanges object with more than 2^31 elements:




ir <- IRanges(start = 1:(2^31 - 1), width = 1)
ir <- IRanges(start = 1:2^31, width = 1)


Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")
:
  long vectors not supported yet: memory.c:3715
In addition: Warning message:
In .normargSEW0(start, "start") :
  NAs introduced by coercion to integer range

Right. This is a known limitation of IRanges objects and Vector derivatives in general.

I wonder what's your use case?

FWIW supporting long Vector derivatives (including long IRanges) has been on the TODO list for a while. Unfortunately it seems that we keep getting distracted by other things.

Note that even when long IRanges objects are supported, computing on them will not be very efficient because the memory footprint of these objects will be very big (> 16Gb). It is much more interesting (and fun) to use long Vector derivatives that have a **small** memory footprint like long Rle's or long StitchedIPos/StitchedGPos objects:

  library(S4Vectors)


  x <- Rle(1:15, 1e9)
  x
  # integer-Rle of length 15000000000 with 15 runs
  #   Lengths: 1000000000 1000000000 1000000000 ... 1000000000 1000000000
  #   Values :          1          2          3 ...         14         15


  object.size(x)
  # 1288 bytes


  library(IRanges)

  ipos <- IPos(IRanges(1, 2e9))
  ipos
  # StitchedIPos object with 2000000000 positions and 0 metadata columns:
  #                       pos
  #                 <integer>
  #            [1]          1
  #            [2]          2
  #            [3]          3
  #            [4]          4
  #            [5]          5
  #            ...        ...
  #   [1999999996] 1999999996
  #   [1999999997] 1999999997
  #   [1999999998] 1999999998
  #   [1999999999] 1999999999
  #   [2000000000] 2000000000

  object.size(ipos)
  # 2736 bytes


  library(GenomicRanges)

  gpos <- GPos("chr1:1-5e8")  # not a real organism ;-)
  gpos
  # StitchedGPos object with 500000000 positions and 0 metadata columns:
  #             seqnames       pos strand
  #                <Rle> <integer>  <Rle>
  #         [1]     chr1         1      *
  #         [2]     chr1         2      *
  #         [3]     chr1         3      *
  #         [4]     chr1         4      *
  #         [5]     chr1         5      *
  #         ...      ...       ...    ...
  # [499999996]     chr1 499999996      *
  # [499999997]     chr1 499999997      *
  # [499999998]     chr1 499999998      *
  # [499999999]     chr1 499999999      *
  # [500000000]     chr1 500000000      *
  # -------
  # seqinfo: 1 sequence from an unspecified genome; no seqlengths

  object.size(gpos)
  # 10552 bytes



We're not here yet but the goal would be to have light-weight objects that can represent all the genomic positions in the Human genome.

H.










This is true when using the latest version from GitHub




BiocManager::install("Bioconductor/IRanges")
sessionInfo()


R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)

Matrix products: default
BLAS:
/home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRblas.so
LAPACK:
/home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] IRanges_2.19.5      S4Vectors_0.22.0    BiocGenerics_0.30.0

loaded via a namespace (and not attached):
 [1] ps_1.3.0           prettyunits_1.0.2  withr_2.1.2        crayon_1.3.4

 [5] rprojroot_1.3-2    assertthat_0.2.1   R6_2.4.0
backports_1.1.4
 [9] magrittr_1.5       cli_1.1.0          curl_3.3           remotes_2.0.4

[13] callr_3.2.0        tools_3.6.0        compiler_3.6.0
processx_3.3.1
[17] pkgbuild_1.0.3     BiocManager_1.30.4







Pariksheet


_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw&s=pkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc&e=


--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org<mailto:hpages at fredhutch.org>
Phone:  (206) 667-5791
Fax:    (206) 667-1319
#
Hi Herv?,

Indeed, an IRanges with 2^31 elements is 17.1 GB.
The reason I was interested in IRanges, was GRanges are needed to create
the BSgenome::BSgenomeViews.
More broadly, my use case is chopping up a large genome into a fixed kmer
size so that repetitive "unmappable" regions can be removed.
https://github.com/coregenomics/kmap
My interest in long vectors is to address issue #8
https://github.com/coregenomics/kmap/issues/8

The workaround I've imagined so far is to have my kmap::kmerize function
return an iterator that creates GRanges less than length 2^31.
Using iterators doesn't even need any additional packages: they're
implemented in the BiocParallel bpiterator unit tests as returning a
function that keeps returning objects until it returns NULL.

But looking at how much more efficient your GPos, etc functions are,
perhaps maybe BSgenomeViews requiring a GRanges is not as reasonable?
I don't even know of a sane way to mock a BSgenome object for writing
tests.  It's irritating to have to use actual small genomes for tests.

Pariksheet
On Tue, May 28, 2019 at 3:35 AM Pages, Herve <hpages at fredhutch.org> wrote:

            

  
  
#
On 5/28/19 07:57, Pariksheet Nanda wrote:
Good point. I opened an issue on GitHub for this:
   https://github.com/Bioconductor/BSgenome/issues/2
Yeah, no user-level BSgenome constructor at the moment. Here is one you
could use:

BSgenome <- function(dna, circ_seqs=NA, genome=NA,
                      organism=NA, common_name=NA, provider=NA,
                      release_date=NA, release_name=NA, source_url=NA)
{
     ## Some sanity checks.
     if (!is(dna, "DNAStringSet"))
         dna <- as(dna, "DNAStringSet")
     seqnames <- names(dna)
     if (is.null(seqnames))
         stop("'dna' must have names")
     if (!is.character(circ_seqs))
         circ_seqs <- as.character(circ_seqs)
     if (!identical(circ_seqs, NA_character_)) {
         if (anyNA(circ_seqs) ||
             anyDuplicated(circ_seqs) ||
             !all(circ_seqs %in% seqnames))
             stop(wmsg("when not set to NA, 'circ_seqs' must ",
                       "contain unique sequence names that are ",
                       "present in 'names(dna)'"))
     }
     stopifnot(isSingleStringOrNA(genome),
               isSingleStringOrNA(organism),
               isSingleStringOrNA(common_name),
               isSingleStringOrNA(provider),
               isSingleStringOrNA(release_date),
               isSingleStringOrNA(release_name),
               isSingleStringOrNA(source_url))

     ## Write the sequences to disk.
     seqs_dirpath <- tempfile(pattern="BSgenome_seqs_dir")
     dir.create(seqs_dirpath)
     twobit_filepath <- file.path(seqs_dirpath, "single_sequences.2bit")
     rtracklayer::export(dna, twobit_filepath)

     ## Create the BSgenome object.
     BSgenome:::BSgenome(organism=as.character(organism),
                         common_name=as.character(organism),
                         provider=as.character(provider),
                         provider_version=as.character(genome),
                         release_date=as.character(release_date),
                         release_name=as.character(release_name),
                         source_url=as.character(source_url),
                         seqnames=seqnames,
                         circ_seqs=circ_seqs,
                         mseqnames=NULL,
                         seqs_pkgname=NA_character_,
                         seqs_dirpath=seqs_dirpath)
}

Then:

genome <- BSgenome(c(chr1="TCCTTGAAAAC", chrM="GGAATAT"),
                    circ_seqs="chrM", genome="hg00")
genome
# organism: NA (NA)
# provider: NA
# provider version: hg00
# release date: NA
# release name: NA
# 2 sequences:
#   chr1 chrM 

# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' 
operator
# to access a given sequence)

seqinfo(genome)
# Seqinfo object with 2 sequences (1 circular) from hg00 genome:
#   seqnames seqlengths isCircular genome
#   chr1             11      FALSE   hg00
#   chrM              7       TRUE   hg00

genome$chr1
#   11-letter "DNAString" instance
# seq: TCCTTGAAAAC

We should probably have something like that in the BSgenome package. 
Also opened an issue for that one:

   https://github.com/Bioconductor/BSgenome/issues/3

Best,
H.