Skip to content

[Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews

5 messages · Hervé Pagès, Michael Lawrence, Pariksheet Nanda

#
Hi bioconductor devs,

The BSgenomeViews class has been very useful in efficiently propagating
metadata for running Biostring operations.  I noticed something unexpected
when iterating over views - it seems to return the Biostrings object
instead of a single length Views object, and thus loses the associated view
metadata.  Is this intentional?  Below is some example code, the output and
sessionInfo().  Yes, I also confirmed this happens in the development
version of R / bioconductor 3.5.

On a side note, for unit testing it's been difficult to mock a BSgenome
object due to the link to physical files, and as a workaround I use a
small, arbitrary BSgenome package.  Can one construct a BSgenome from its
package bundled extdata?  The man page examples use packaged genomes.

Code to reproduce the issue:

----------------------------------------------------------------------
library(BSgenome)
genome <- getBSgenome("BSgenome.Hsapiens.UCSC.hg19")
gr <- GRanges(c("chr1:25001-28000", "chr2:30001-31000"))
views <- Views(genome, gr)
views
lapply(views, class)
----------------------------------------------------------------------

Result:

----------------------------------------------------------------------
BSgenomeViews object with 2 views and 0 metadata columns:
      seqnames         ranges strand                       dna
         <Rle>      <IRanges>  <Rle>            <DNAStringSet>
  [1]     chr1 [25001, 28000]      * [GCTTCAGCCT...TTATTTATTG]
  [2]     chr2 [30001, 31000]      * [GACCCTCCTG...AGCAGGTGGT]
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
[[1]]
[1] "DNAString"
attr(,"package")
[1] "Biostrings"

[[2]]
[1] "DNAString"
attr(,"package")
[1] "Biostrings"
----------------------------------------------------------------------

Tested against these configurations:
1) R 3.3.2 + BSgenome 1.42.0 (stable bioconductor 3.4)
2) R 2017-04-05 installed via llnl/spack + BSgenome 1.43.7 (devel
bioconductor 3.5)

sessionInfo for configuration #2 above:
----------------------------------------------------------------------
R Under development (unstable) (2017-04-05 r72488)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS:
/share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRblas.so
LAPACK:
/share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.43.7
 [3] rtracklayer_1.35.10               Biostrings_2.43.7
 [5] XVector_0.15.2                    GenomicRanges_1.27.23
 [7] GenomeInfoDb_1.11.10              IRanges_2.9.19
 [9] S4Vectors_0.13.15                 BiocGenerics_0.21.3

loaded via a namespace (and not attached):
 [1] zlibbioc_1.21.0            GenomicAlignments_1.11.12
 [3] BiocParallel_1.9.5         lattice_0.20-35
 [5] tools_3.5.0                SummarizedExperiment_1.5.7
 [7] grid_3.5.0                 Biobase_2.35.1
 [9] matrixStats_0.52.1         Matrix_1.2-9
[11] GenomeInfoDbData_0.99.0    bitops_1.0-6
[13] RCurl_1.95-4.8             DelayedArray_0.1.7
[15] compiler_3.5.0             Rsamtools_1.27.15
[17] XML_3.98-1.6
[1] TRUE
---
Pariksheet Nanda
PhD Candidate in Genetics and Genomics
System Administrator, Storrs HPC Cluster
University of Connecticut
#
Hi Pariksheet,

This is the expected behavior.

Some background: BSgenomeViews are list-like objects where the *list
elements* (i.e. the elements one extracts with [[) are the DNA
sequences from the views:

   > views
   BSgenomeViews object with 2 views and 0 metadata columns:
         seqnames         ranges strand                       dna
            <Rle>      <IRanges>  <Rle>            <DNAStringSet>
     [1]     chr1 [25001, 28000]      * [GCTTCAGCCT...TTATTTATTG]
     [2]     chr2 [30001, 31000]      * [GACCCTCCTG...AGCAGGTGGT]
     -------
     seqinfo: 93 sequences (1 circular) from hg19 genome

   > views[[1]]
     3000-letter "DNAString" instance
   seq: 
GCTTCAGCCTGCACAGATAGGGGAGTAGGGGACAGA...CTGTCGTCTGAATTCCAAGCTTTTTGTTATTTATTG
   > views[[2]]
     1000-letter "DNAString" instance
   seq: 
GACCCTCCTGTTGGCTTTTTTACAAGACTGCTATGT...TGCTGGGACAGTCATGGGCAAACCAGAGCAGGTGGT

When subsetting with [ I get:

   > views[1]
   BSgenomeViews object with 1 view and 0 metadata columns:
         seqnames         ranges strand                       dna
            <Rle>      <IRanges>  <Rle>            <DNAStringSet>
     [1]     chr1 [25001, 28000]      * [GCTTCAGCCT...TTATTTATTG]
     -------
     seqinfo: 93 sequences (1 circular) from hg19 genome

   > views[2]
   BSgenomeViews object with 1 view and 0 metadata columns:
         seqnames         ranges strand                       dna
            <Rle>      <IRanges>  <Rle>            <DNAStringSet>
     [1]     chr2 [30001, 31000]      * [GACCCTCCTG...AGCAGGTGGT]
     -------
     seqinfo: 93 sequences (1 circular) from hg19 genome

The important difference is that with [[ I get a DNAString object
(the content of the view) and with [ I get a BSgenomeViews object
of length 1.

The semantic of lapply() is to always use [[ internally to extract
elements. This is why 'lapply(views, class)' returns "DNAString".
If you want to operate on the length-one Views objects, you need to
do:

   lapply(seq_along(views), function(i) { do something on views[i] })

Hope this helps,
H.
On 04/05/2017 08:13 PM, Pariksheet Nanda wrote:

  
    
#
I'm curious as to why you are looping over the views in the first
place. Maybe we could arrive at a vectorized solution, which is often
but not always simpler and faster.

Michael

On Wed, Apr 5, 2017 at 8:13 PM, Pariksheet Nanda
<pariksheet.nanda at uconn.edu> wrote:
6 days later
#
On Fri, Apr 7, 2017 at 1:13 AM, Herv? Pag?s <hpages at fredhutch.org> wrote:
--snip--
Thank you, Herv?!

I was failing to make the connection with the `[[` accessor.


On Fri, Apr 7, 2017 at 1:16 AM, Michael Lawrence <lawrence.michael at gene.com>
wrote:
Hi Michael!

Broad background is I'm acculturating an undergraduate student to writing a
bioconductor package and applying software engineering practices of version
control, unit testing, documenting, dependency setup and validation in a
different environment on our university HPC cluster, etc.  The student also
came along to LibrePlanet to better understand the culture of software
freedom :o)  The package goal is to use Biostrings to look for repeating
DNA sequences of a fixed kmer size and subset to portions of the genome
without repeats (an aligner can do this ofc, but the goal is to teach R and
engineering practices).

I appreciate your thoughtfulness for vectorizing the code to best use
BSgenomeViews, but please don't spend more than 10 minutes as I have to
balance changes to the code with the student's learning and coding "voice"
and may not do proper justice for more of your effort.  My slowness to
reply was getting the project further along to be more understandable.
Here was the line which I've updating as Herv? suggested to use seq_along():
https://github.com/coregenomics/kmap/blob/4adaed6b8007e8ea39f39ff57a42a821445d3d46/R/BiostringsProjectNEW.R#L185
(I'm having a hard time thinking of how to summarizing a small example out
of context).
Although in that line ranges_hits() is only operating on single indices,
ranges_hits() was written to process groups of indices to reduce
multi-processor communication.  Generating such sets of indices would
involve applying width() to the views inside mappable() to break in into
chunks of, say, a million bases for matchPDict().  Again, I'm linking to
the code for anything that stands out at you, but I will feel bad if you
spend a lot of time on it.
Pariksheet
#
You could eventually point your student to MaskedXString and
oligonucleotideFrequency(). You can mask the repeats and then just run
the latter to count the N-mers. Comparing their original code to the
code based on existing high-level utilities might be a useful
exercise.

Michael


On Wed, Apr 12, 2017 at 8:24 PM, Pariksheet Nanda
<pariksheet.nanda at uconn.edu> wrote: