[Bioc-devel] BamTallyParam argument 'which'
I'm not convinced the samtools way lets the user do anything useful with tallyVariants. At least, I can't think of one. The arguments for consistency with samtools do not really hold at the summary level. Ultimately, you want a summary of each position. Having the summaries duplicated in the output does not really help, unless there is an easy way to map back to the original locii. But I'm not going to add that complexity until there is an actual use case. Right now, users are universally surprised by the repetition of the results. A parameter could be added when the need arises.
On Wed, Feb 25, 2015 at 6:37 AM, Gabe Becker <becker.gabe at gene.com> wrote:
I think we need to be a little careful of trying to know the users intentions better than they do here. Reduce is a (very) easy operation to do on a GRanges, so if the user didn't, are we really safe assuming they "meant to" when passing the GRanges as a which? I would argue for "the samtools way", not because samtools does it (though consistency is good) but because it allows the user to do more things, while not making it that painful to do the thing that they might want most often. I agree with Michael that an additional argument might be a good middle ground. ~G On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres <lcollado at jhu.edu
wrote:
Related to my post on a separate thread (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html), I think that if 'which' is not being reduced by default, a simple example showing the effects of this could be included in the functions that have such an argument. Also note that 'reducing' could lead to unintended results. For example, in the help page for GenomicAlignments::readGAlignments, after the 'gal4' example it would be nice to add something like this: ## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000))) param_dups <- ScanBamParam(which=which_dups) param_reduced <- ScanBamParam(which=reduce(which_dups)) gal4_dups <- readGAlignments(bamfile, param=param_dups) gal4_reduced <- readGAlignments(bamfile, param=param_reduced) length(gal4) ## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups) ## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced) Here's the output:
library('GenomicAlignments')
## Code already included in ?readGAlignments
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
+ mustWork=TRUE)
which <- RangesList(seq1=IRanges(1000, 2000),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
param <- ScanBamParam(which=which) gal4 <- readGAlignments(bamfile, param=param) gal4
GAlignments object with 2404 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end
width njunc
<Rle> <Rle> <character> <integer> <integer> <integer>
<integer> <integer>
[1] seq1 + 35M 35 970 1004
35 0
[2] seq1 + 35M 35 971 1005
35 0
[3] seq1 + 35M 35 972 1006
35 0
[4] seq1 + 35M 35 973 1007
35 0
[5] seq1 + 35M 35 974 1008
35 0
... ... ... ... ... ... ...
... ...
[2400] seq2 + 35M 35 1524 1558
35 0
[2401] seq2 + 35M 35 1524 1558
35 0
[2402] seq2 - 35M 35 1528 1562
35 0
[2403] seq2 - 35M 35 1532 1566
35 0
[2404] seq2 - 35M 35 1533 1567
35 0
-------
seqinfo: 2 sequences from an unspecified genome
## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups) param_reduced <- ScanBamParam(which=reduce(which_dups)) gal4_dups <- readGAlignments(bamfile, param=param_dups) gal4_reduced <- readGAlignments(bamfile, param=param_reduced) length(gal4)
[1] 2404
## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups)
[1] 3014
## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced)
[1] 2343
options(width = 120) devtools::session_info()
Session info----------------------------------------------------------------------------------------------------------- setting value version R Under development (unstable) (2014-11-01 r66923) system x86_64, darwin10.8.0 ui AQUA language (EN) collate en_US.UTF-8 tz America/New_York Packages--------------------------------------------------------------------------------------------------------------- package * version date source base64enc 0.1.2 2014-06-26 CRAN (R 3.2.0) BatchJobs 1.4 2014-09-24 CRAN (R 3.2.0) BBmisc 1.7 2014-06-21 CRAN (R 3.2.0) BiocGenerics * 0.13.4 2014-12-31 Bioconductor BiocParallel 1.1.13 2015-01-27 Bioconductor Biostrings * 2.35.8 2015-02-14 Bioconductor bitops 1.0.6 2013-08-17 CRAN (R 3.2.0) brew 1.0.6 2011-04-13 CRAN (R 3.2.0) checkmate 1.5.0 2014-10-19 CRAN (R 3.2.0) codetools 0.2.9 2014-08-21 CRAN (R 3.2.0) DBI 0.3.1 2014-09-24 CRAN (R 3.2.0) devtools 1.6.1 2014-10-07 CRAN (R 3.2.0) digest 0.6.4 2013-12-03 CRAN (R 3.2.0) fail 1.2 2013-09-19 CRAN (R 3.2.0) foreach 1.4.2 2014-04-11 CRAN (R 3.2.0) GenomeInfoDb * 1.3.13 2015-02-13 Bioconductor GenomicAlignments * 1.3.27 2015-01-26 Bioconductor GenomicRanges * 1.19.37 2015-02-13 Bioconductor IRanges * 2.1.38 2015-02-08 Bioconductor iterators 1.0.7 2014-04-11 CRAN (R 3.2.0) Rsamtools * 1.19.27 2015-02-07 Bioconductor RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0) rstudioapi 0.1 2014-03-27 CRAN (R 3.2.0) S4Vectors * 0.5.20 2015-02-19 Bioconductor sendmailR 1.2.1 2014-09-21 CRAN (R 3.2.0) stringr 0.6.2 2012-12-06 CRAN (R 3.2.0) XVector * 0.7.4 2015-02-08 Bioconductor zlibbioc 1.13.1 2015-02-11 Bioconductor
On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein <goldstein.leonard at gene.com> wrote:
Sounds very sensible not to double count in the context of tallying variants. I was more concerned with reducing which as the default behavior for scanBam and other functions. I wanted to bring up the samtools behavior as - for me at least - inconsistencies between Rsamtools and samtools have been another source of confusion in the past (e.g. different naming conventions for fields like isize vs TLEN etc.) Leonard On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Maybe Rsamtools would want to follow this precedent. I think there
might be
a difference between fishing out alignments from a SAM/BAM, and
deriving a
summary (tallyVariants) from a BAM. It seems like an argument could be
made
for a tally set to not contain duplicates. On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein <goldstein.leonard at gene.com> wrote:
Hi Michael and Thomas, I ran into the same problem in the past (i.e. when I started working with functions like scanBam I expected them not to return the same alignment multiple times) One thing to consider might be that returning alignments multiple times is consistent with the behavior of the samtools view command. Quoting from the samtools manual: ?Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.? Maybe there is an argument for keeping things consistent with samtools? As you said, if documented properly, the user can decide whether to reduce regions specified in which or not. Leonard On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
We should at leaast try to avoid surprising the user. Seems like
most
people expect "which" to be a simple restriction, so I think for
now I
will just reduce the which, and if someone has a use case for separate queries, we can address it in the future. On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <sandmann.thomas at gene.com> wrote:
Personally, I don't have a use case with "meaningful loci" worth tracking, so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter
in a
consistent way across different methods ? I just saw this recent
post
on the mailing list in which a used got confused by duplicate counts returned after passing 'which' to scanBamParam:
--- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.thomas at gene.com "If a man will begin with certainties, he shall end in doubts; but
if
he will be content to begin with doubts he shall end in certainties."
--
Sir Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence < lawrence.michael at gene.com> wrote:
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii,
which
will be analyzed individually? I would actually favor the first one
(the
same as yours), just because it's simpler. To keep track of the query
ranges,
we would need to add a new column to the returned object, which will
more
often than not just be clutter. I guess we could introduce a new parameter, "reduceWhich" which defaults to TRUE and reduces the which. If
FALSE,
it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann < sandmann.thomas at gene.com> wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups
separately
for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your
first
option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the
documentation of
'which' on BamTallyParam's help page explaining that users might
want
to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Gabriel Becker, Ph.D Computational Biologist Genentech Research