Skip to content

[Bioc-devel] BamTallyParam argument 'which'

16 messages · Thomas Sandmann, Michael Lawrence, Leonard Goldstein +3 more

#
Hi Michael,

I noticed that when the tallyVariants function receives a 'which' arguments
(via BamTallyParam), that contains overlapping or duplicated regions,
duplicated rows are returned.

(See below for an example.)

It took me a little while to understand where I was picking duplicates.

Would it be useful to 'reduce' the 'which' GRanges/RangesList object by
default, e.g. before tallying variants, to make sure each base is only
tallied once ?

Best,
Thomas

library(VariantTools)

## 'which' is a set of non-overlapping regions
tally.param <- TallyVariantsParam(gmapR::TP53Genome(),
                                  high_base_quality = 23L,
                                  which = gmapR::TP53Which())
bams <- LungCancerLines::LungCancerBamFiles()
raw.variants <- tallyVariants(bams$H1993, tally.param)
any(duplicated( raw.variants ))  ## FALSE

## 'which' is a set of duplicated regions
tally.param <- TallyVariantsParam(
  gmapR::TP53Genome(),
  high_base_quality = 23L,
  which = c(
    gmapR::TP53Which(),
    gmapR::TP53Which()
    )
)
raw.variants <- tallyVariants(bams$H1993, tally.param)
any(duplicated( raw.variants )) ## TRUE

sort(raw.variants)[1:4]


### SessionInfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices
[6] utils     datasets  methods   base

other attached packages:
 [1] VariantTools_1.8.0       VariantAnnotation_1.12.9
 [3] Rsamtools_1.18.2         Biostrings_2.34.1
 [5] XVector_0.6.0            GenomicRanges_1.18.4
 [7] GenomeInfoDb_1.2.4       IRanges_2.0.1
 [9] S4Vectors_0.4.0          BiocGenerics_0.12.1
[11] BiocInstaller_1.16.1     roxygen2_4.1.0
[13] devtools_1.7.0

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.28.1    base64enc_0.1-2
 [3] BatchJobs_1.5           BBmisc_1.9
 [5] Biobase_2.26.0          BiocParallel_1.0.3
 [7] biomaRt_2.22.0          bitops_1.0-6
 [9] brew_1.0-6              BSgenome_1.34.1
[11] checkmate_1.5.1         codetools_0.2-10
[13] DBI_0.3.1               digest_0.6.8
[15] fail_1.2                foreach_1.4.2
[17] GenomicAlignments_1.2.1 GenomicFeatures_1.18.3
[19] gmapR_1.8.0             grid_3.1.2
[21] iterators_1.0.7         lattice_0.20-29
[23] LungCancerLines_0.3.1   Matrix_1.1-5
[25] Rcpp_0.11.4             RCurl_1.95-4.5
[27] RSQLite_1.0.0           rtracklayer_1.26.2
[29] sendmailR_1.2-1         stringr_0.6.2
[31] tools_3.1.2             XML_3.98-1.1
[33] zlibbioc_1.12.0
#
Hmm. I guess we could do at least two things:

1) Return the ID of the which range for each variant, like readVcf does
with its paramRangeID column in the rowData.

2) Do as you suggest, and reduce() the which.

Obviously, these address different use cases. The user can always achieve
#2 by just reduce()ing the which range first. #1 would be more difficult
for the user to implement.

If tallyVariants() did #2, then #1 would be painful to achieve, so if #1
sounds useful at all, then we might want to hold off on #2.

Thoughts?

Michael




On Fri, Feb 20, 2015 at 2:00 PM, Thomas Sandmann <sandmann.thomas at gene.com>
wrote:

  
  
1 day later
#
Hi Michael,

ah, I see. I hadn't realized that returning the pileups separately for each
region could be a desired feature, but that makes sense. I agree, as it is
easy for the user to 'reduce' the ranges beforehand your first option (e.g.
returning the ID of the range) would be more flexible.

Perhaps you would consider adding a sentence to the documentation of
'which' on BamTallyParam's help page explaining that users might want to
'reduce' their ranges beforehand if they are only interested in a single
tally for each base ?

Thanks a lot !
Thomas
#
We just have to decide which is the more useful interpretation of which --
as a simple restriction, or as a vector of meaningful locii, which will be
analyzed individually? I would actually favor the first one (the same as
yours), just because it's simpler. To keep track of the query ranges, we
would need to add a new column to the returned object, which will more
often than not just be clutter. I guess we could introduce a new parameter,
"reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it
instead adds the column mapping back to the original which ranges.


On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <sandmann.thomas at gene.com>
wrote:

  
  
#
Personally, I don't have a use case with "meaningful loci" worth tracking,
so keeping it simple would work for me.

Incidentally, would it be good to deal with the 'which' parameter in a
consistent way across different methods ? I just saw this recent post on
the mailing list in which a used got confused by duplicate counts returned
after passing 'which' to scanBamParam:

https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html


---

Thomas Sandmann, PhD
Computational biologist

Genentech, Inc.
1 DNA Way
South San Francisco, CA 94080
USA

Phone: +1 650 225 6273
Fax: +1 650 225 5389
Email: sandmann.thomas at gene.com

"If a man will begin with certainties, he shall end in doubts; but if he
will be content to begin with doubts he shall end in certainties." -- Sir
Francis Bacon


On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
lawrence.michael at gene.com> wrote:

            

  
  
#
We should at leaast try to avoid surprising the user. Seems like most
people expect "which" to be a simple restriction, so I think for now I will
just reduce the which, and if someone has a use case for separate queries,
we can address it in the future.

On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <sandmann.thomas at gene.com>
wrote:

  
  
#
Hi Michael and Thomas,

I ran into the same problem in the past (i.e. when I started working
with functions like scanBam I expected them not to return the same
alignment multiple times)

One thing to consider might be that returning alignments multiple
times is consistent with the behavior of the samtools view command.
Quoting from the samtools manual:

?Important note: when multiple regions are given, some alignments may
be output multiple times if they overlap more than one of the
specified regions.?

Maybe there is an argument for keeping things consistent with
samtools? As you said, if documented properly, the user can decide
whether to reduce regions specified in which or not.

Leonard


On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
#
Maybe Rsamtools would want to follow this precedent. I think there might be
a difference between fishing out alignments from a SAM/BAM, and deriving a
summary (tallyVariants) from a BAM. It seems like an argument could be made
for a tally set to not contain duplicates.

On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein <
goldstein.leonard at gene.com> wrote:

            

  
  
#
On 02/23/2015 11:05 AM, Leonard Goldstein wrote:
Thanks Leonard for pointing this out. This is indeed the reason why all
the functions in Rsamtools and GenomicAlignments that take a 'which'
argument (via a ScanBamParam object) treat it "the samtools way", that
is, as a vector of meaningful loci.

Most of them track the index of the individual loci via a "which_label"
metadata column. See for example Rsamtools::pileup() and all the
readGAlignment*() functions in the GenomicAlignments package.
FWIW the man page for the readGAlignment*() functions contains the
following note:

      Note that a given record is loaded one time for each region it
      belongs to (this is a scanBam() feature, readGAlignments()
      is based on scanBam()).

but maybe this should be emphasized a little bit more.

Cheers,
H.

  
    
#
Sounds very sensible not to double count in the context of tallying
variants. I was more concerned with reducing which as the default
behavior for scanBam and other functions.

I wanted to bring up the samtools behavior as - for me at least -
inconsistencies between Rsamtools and samtools have been another
source of confusion in the past (e.g. different naming conventions for
fields like isize vs TLEN etc.)

Leonard


On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
1 day later
#
Related to my post on a separate thread
(https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
I think that if 'which' is not being reduced by default, a simple
example showing the effects of this could be included in the functions
that have such an argument. Also note that 'reducing' could lead to
unintended results.

For example, in the help page for GenomicAlignments::readGAlignments,
after the 'gal4' example it would be nice to add something like this:


## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.

which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
    seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)


length(gal4)

## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)

## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)








Here's the output:
+                        mustWork=TRUE)
+                     seq2=IRanges(c(100, 1000), c(1000, 2000)))
GAlignments object with 2404 alignments and 0 metadata columns:
         seqnames strand       cigar    qwidth     start       end
width     njunc
            <Rle>  <Rle> <character> <integer> <integer> <integer>
<integer> <integer>
     [1]     seq1      +         35M        35       970      1004
   35         0
     [2]     seq1      +         35M        35       971      1005
   35         0
     [3]     seq1      +         35M        35       972      1006
   35         0
     [4]     seq1      +         35M        35       973      1007
   35         0
     [5]     seq1      +         35M        35       974      1008
   35         0
     ...      ...    ...         ...       ...       ...       ...
  ...       ...
  [2400]     seq2      +         35M        35      1524      1558
   35         0
  [2401]     seq2      +         35M        35      1524      1558
   35         0
  [2402]     seq2      -         35M        35      1528      1562
   35         0
  [2403]     seq2      -         35M        35      1532      1566
   35         0
  [2404]     seq2      -         35M        35      1533      1567
   35         0
  -------
  seqinfo: 2 sequences from an unspecified genome
+     seq2=IRanges(c(100, 1000), c(1000, 2000)))
[1] 2404
[1] 3014
[1] 2343
Session info-----------------------------------------------------------------------------------------------------------
 setting  value
 version  R Under development (unstable) (2014-11-01 r66923)
 system   x86_64, darwin10.8.0
 ui       AQUA
 language (EN)
 collate  en_US.UTF-8
 tz       America/New_York

Packages---------------------------------------------------------------------------------------------------------------
 package           * version date       source
 base64enc           0.1.2   2014-06-26 CRAN (R 3.2.0)
 BatchJobs           1.4     2014-09-24 CRAN (R 3.2.0)
 BBmisc              1.7     2014-06-21 CRAN (R 3.2.0)
 BiocGenerics      * 0.13.4  2014-12-31 Bioconductor
 BiocParallel        1.1.13  2015-01-27 Bioconductor
 Biostrings        * 2.35.8  2015-02-14 Bioconductor
 bitops              1.0.6   2013-08-17 CRAN (R 3.2.0)
 brew                1.0.6   2011-04-13 CRAN (R 3.2.0)
 checkmate           1.5.0   2014-10-19 CRAN (R 3.2.0)
 codetools           0.2.9   2014-08-21 CRAN (R 3.2.0)
 DBI                 0.3.1   2014-09-24 CRAN (R 3.2.0)
 devtools            1.6.1   2014-10-07 CRAN (R 3.2.0)
 digest              0.6.4   2013-12-03 CRAN (R 3.2.0)
 fail                1.2     2013-09-19 CRAN (R 3.2.0)
 foreach             1.4.2   2014-04-11 CRAN (R 3.2.0)
 GenomeInfoDb      * 1.3.13  2015-02-13 Bioconductor
 GenomicAlignments * 1.3.27  2015-01-26 Bioconductor
 GenomicRanges     * 1.19.37 2015-02-13 Bioconductor
 IRanges           * 2.1.38  2015-02-08 Bioconductor
 iterators           1.0.7   2014-04-11 CRAN (R 3.2.0)
 Rsamtools         * 1.19.27 2015-02-07 Bioconductor
 RSQLite             1.0.0   2014-10-25 CRAN (R 3.2.0)
 rstudioapi          0.1     2014-03-27 CRAN (R 3.2.0)
 S4Vectors         * 0.5.20  2015-02-19 Bioconductor
 sendmailR           1.2.1   2014-09-21 CRAN (R 3.2.0)
 stringr             0.6.2   2012-12-06 CRAN (R 3.2.0)
 XVector           * 0.7.4   2015-02-08 Bioconductor
 zlibbioc            1.13.1  2015-02-11 Bioconductor
On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein
<goldstein.leonard at gene.com> wrote:
#
I think we need to be a little careful of trying to know the users
intentions better than they do here.  Reduce is a (very) easy operation to
do on a GRanges, so if the user didn't, are we really safe assuming they
"meant to" when passing the GRanges as a which?

I would argue for "the samtools way", not because samtools does it (though
consistency is good) but because it allows the user to do more things,
while not making it that painful to do the thing that they might want most
often.

I agree with Michael that an additional argument might be a good middle
ground.

~G

On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres <lcollado at jhu.edu>
wrote:

  
    
#
I'm not convinced the samtools way lets the user do anything useful with
tallyVariants. At least, I can't think of one. The arguments for
consistency with samtools do not really hold at the summary level.
Ultimately, you want a summary of each position. Having the summaries
duplicated in the output does not really help, unless there is an easy way
to map back to the original locii. But I'm not going to add that complexity
until there is an actual use case. Right now, users are universally
surprised by the repetition of the results. A parameter could be added when
the need arises.
On Wed, Feb 25, 2015 at 6:37 AM, Gabe Becker <becker.gabe at gene.com> wrote:

            

  
  
#
Hi,
On 02/25/2015 06:37 AM, Gabe Becker wrote:
Maybe another good middle ground is to issue a warning when 'which'
contains overlapping ranges. The warning could suggest to the user
to reduce() the ranges first. Maybe the warning should also point
out that reducing doesn't completely eliminate the risk of selecting
the same record more than once (at least that's the case for the
readGAlignment* functions, but I don't know if it holds for
BamTallyParam). The risk is actually higher with paired-end reads
where the same pair is selected once for each range in 'which' that
overlaps with at least one of the 2 mates in the pair.

But as already discussed here (or maybe it was on the old bioconductor
list, don't remember), better solutions to the "duplicated selection"
problem could be achieved via one of the following:

   (1) Expose some sort of unique id for the records in a BAM file.
       I agree with Michael that "duplicated selection" is incompatible
       with summarization tools like BamTallyParam or pileup(). Having
       access to a unique id would completely solve the problem.

   (2) Introduce an argument similar to 'which' but with a slightly
       different semantic e.g. select records that *start* in one of
       the specified ranges (for single-end reads), or select pairs of
       records for which the *first* mate starts in one of the specified
       ranges.
       Advantages of this semantic:

         (a) If the ranges don't overlap, then no duplicates.

         (b) It can be used in the context of tiling the genome and
             processing the reads by tile. This plays well with
             parallelization (the semantic of 'which' does not).

       Inconvenient: the arbitrary nature of the selection criteria
       and its lack of symmetry is incompatible with summarization
       tools like BamTallyParam or pileup().

Note that (1) and (2) are not exclusive.

Cheers,
H.

  
    
#
Thanks Leonardo. I'll add this to the man page. This will definitely
help clarify the "duplicated selection" issue.

H.
On 02/24/2015 07:40 PM, Leonardo Collado Torres wrote:

  
    
#
FWIW, and after checking how pileup() handles this, I thought I should
report:

   library(GenomicAlignments)
   bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")

Then:

   > stackStringsFromBam(bamfile, param="seq1:1-6")
     A DNAStringSet instance of length 4
       width seq
   [1]     6 CACTAG
   [2]     6 ++CTAG
   [3]     6 ++++AG
   [4]     6 +++++G

   > which <- GRanges("seq1", IRanges(c(1, 5, 1), c(2, 6, 3)))
   > pileup(bamfile, scanBamParam=ScanBamParam(which=which))
     seqnames pos strand nucleotide count which_label
   1     seq1   1      +          C     1    seq1:1-2
   2     seq1   2      +          A     1    seq1:1-2
   3     seq1   5      +          A     3    seq1:5-6
   4     seq1   6      +          G     4    seq1:5-6
   5     seq1   1      +          C     1    seq1:1-3
   6     seq1   2      +          A     1    seq1:1-3
   7     seq1   3      +          C     2    seq1:1-3

This output looks clean to me: (a) Nucleotide positions that belong
to more than 1 range in 'which' are treated as distinct positions,
and (b) reads that overlap with 2 non-overlapping ranges (e.g.
read #1 and ranges #1 and #2) only contribute at most once at each
position covered by these ranges.

H.
On 02/25/2015 01:21 PM, Herv? Pag?s wrote: