[Bioc-devel] IRanges should support long vectors
Hi Herv?, Indeed, an IRanges with 2^31 elements is 17.1 GB. The reason I was interested in IRanges, was GRanges are needed to create the BSgenome::BSgenomeViews. More broadly, my use case is chopping up a large genome into a fixed kmer size so that repetitive "unmappable" regions can be removed. https://github.com/coregenomics/kmap My interest in long vectors is to address issue #8 https://github.com/coregenomics/kmap/issues/8 The workaround I've imagined so far is to have my kmap::kmerize function return an iterator that creates GRanges less than length 2^31. Using iterators doesn't even need any additional packages: they're implemented in the BiocParallel bpiterator unit tests as returning a function that keeps returning objects until it returns NULL. But looking at how much more efficient your GPos, etc functions are, perhaps maybe BSgenomeViews requiring a GRanges is not as reasonable? I don't even know of a sane way to mock a BSgenome object for writing tests. It's irritating to have to use actual small genomes for tests. Pariksheet
On Tue, May 28, 2019 at 3:35 AM Pages, Herve <hpages at fredhutch.org> wrote:
Hi Pariksheet,
On 5/25/19 12:49, Pariksheet Nanda wrote:
Hello,
R 3.0 added support for long vectors, but it's not yet possible to use them
with IRanges. Without long vector support it's not possible to construct
an IRanges object with more than 2^31 elements:
ir <- IRanges(start = 1:(2^31 - 1), width = 1)
ir <- IRanges(start = 1:2^31, width = 1)
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")
:
long vectors not supported yet: memory.c:3715
In addition: Warning message:
In .normargSEW0(start, "start") :
NAs introduced by coercion to integer range
Right. This is a known limitation of IRanges objects and Vector
derivatives in general.
I wonder what's your use case?
FWIW supporting long Vector derivatives (including long IRanges) has been
on the TODO list for a while. Unfortunately it seems that we keep getting
distracted by other things.
Note that even when long IRanges objects are supported, computing on them
will not be very efficient because the memory footprint of these objects
will be very big (> 16Gb). It is much more interesting (and fun) to use
long Vector derivatives that have a **small** memory footprint like long
Rle's or long StitchedIPos/StitchedGPos objects:
library(S4Vectors)
x <- Rle(1:15, 1e9)
x
# integer-Rle of length 15000000000 with 15 runs
# Lengths: 1000000000 1000000000 1000000000 ... 1000000000 1000000000
# Values : 1 2 3 ... 14 15
object.size(x)
# 1288 bytes
library(IRanges)
ipos <- IPos(IRanges(1, 2e9))
ipos
# StitchedIPos object with 2000000000 positions and 0 metadata columns:
# pos
# <integer>
# [1] 1
# [2] 2
# [3] 3
# [4] 4
# [5] 5
# ... ...
# [1999999996] 1999999996
# [1999999997] 1999999997
# [1999999998] 1999999998
# [1999999999] 1999999999
# [2000000000] 2000000000
object.size(ipos)
# 2736 bytes
library(GenomicRanges)
gpos <- GPos("chr1:1-5e8") # not a real organism ;-)
gpos
# StitchedGPos object with 500000000 positions and 0 metadata columns:
# seqnames pos strand
# <Rle> <integer> <Rle>
# [1] chr1 1 *
# [2] chr1 2 *
# [3] chr1 3 *
# [4] chr1 4 *
# [5] chr1 5 *
# ... ... ... ...
# [499999996] chr1 499999996 *
# [499999997] chr1 499999997 *
# [499999998] chr1 499999998 *
# [499999999] chr1 499999999 *
# [500000000] chr1 500000000 *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
object.size(gpos)
# 10552 bytes
We're not here yet but the goal would be to have light-weight objects that
can represent all the genomic positions in the Human genome.
H.
This is true when using the latest version from GitHub
BiocManager::install("Bioconductor/IRanges")
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)
Matrix products: default
BLAS:
/home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRblas.so
LAPACK:
/home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] IRanges_2.19.5 S4Vectors_0.22.0 BiocGenerics_0.30.0
loaded via a namespace (and not attached):
[1] ps_1.3.0 prettyunits_1.0.2 withr_2.1.2 crayon_1.3.4
[5] rprojroot_1.3-2 assertthat_0.2.1 R6_2.4.0
backports_1.1.4
[9] magrittr_1.5 cli_1.1.0 curl_3.3 remotes_2.0.4
[13] callr_3.2.0 tools_3.6.0 compiler_3.6.0
processx_3.3.1
[17] pkgbuild_1.0.3 BiocManager_1.30.4
Pariksheet
[[alternative HTML version deleted]]
_______________________________________________Bioc-devel at r-project.org mailing listhttps://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw&s=pkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc&e= <https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel%26d%3DDwICAg%26c%3DeRAMFD45gAfqt84VtBcfhQ%26r%3DBK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA%26m%3Dn-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw%26s%3DpkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc%26e%3D&data=02%7C01%7Cpariksheet.nanda%40uconn.edu%7C6eae687ace5f4c0340cd08d6e33f128d%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636946257374964712&sdata=ejesWIst1vuOrzlL6s%2BPA6MkgXnSoHQuZIDDCDV6dkM%3D&reserved=0> -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319