Hi Michael,
I noticed that when the tallyVariants function receives a 'which' arguments
(via BamTallyParam), that contains overlapping or duplicated regions,
duplicated rows are returned.
(See below for an example.)
It took me a little while to understand where I was picking duplicates.
Would it be useful to 'reduce' the 'which' GRanges/RangesList object by
default, e.g. before tallying variants, to make sure each base is only
tallied once ?
Best,
Thomas
library(VariantTools)
## 'which' is a set of non-overlapping regions
tally.param <- TallyVariantsParam(gmapR::TP53Genome(),
high_base_quality = 23L,
which = gmapR::TP53Which())
bams <- LungCancerLines::LungCancerBamFiles()
raw.variants <- tallyVariants(bams$H1993, tally.param)
any(duplicated( raw.variants )) ## FALSE
## 'which' is a set of duplicated regions
tally.param <- TallyVariantsParam(
gmapR::TP53Genome(),
high_base_quality = 23L,
which = c(
gmapR::TP53Which(),
gmapR::TP53Which()
)
)
raw.variants <- tallyVariants(bams$H1993, tally.param)
any(duplicated( raw.variants )) ## TRUE
sort(raw.variants)[1:4]
### SessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices
[6] utils datasets methods base
other attached packages:
[1] VariantTools_1.8.0 VariantAnnotation_1.12.9
[3] Rsamtools_1.18.2 Biostrings_2.34.1
[5] XVector_0.6.0 GenomicRanges_1.18.4
[7] GenomeInfoDb_1.2.4 IRanges_2.0.1
[9] S4Vectors_0.4.0 BiocGenerics_0.12.1
[11] BiocInstaller_1.16.1 roxygen2_4.1.0
[13] devtools_1.7.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.1 base64enc_0.1-2
[3] BatchJobs_1.5 BBmisc_1.9
[5] Biobase_2.26.0 BiocParallel_1.0.3
[7] biomaRt_2.22.0 bitops_1.0-6
[9] brew_1.0-6 BSgenome_1.34.1
[11] checkmate_1.5.1 codetools_0.2-10
[13] DBI_0.3.1 digest_0.6.8
[15] fail_1.2 foreach_1.4.2
[17] GenomicAlignments_1.2.1 GenomicFeatures_1.18.3
[19] gmapR_1.8.0 grid_3.1.2
[21] iterators_1.0.7 lattice_0.20-29
[23] LungCancerLines_0.3.1 Matrix_1.1-5
[25] Rcpp_0.11.4 RCurl_1.95-4.5
[27] RSQLite_1.0.0 rtracklayer_1.26.2
[29] sendmailR_1.2-1 stringr_0.6.2
[31] tools_3.1.2 XML_3.98-1.1
[33] zlibbioc_1.12.0
[Bioc-devel] BamTallyParam argument 'which'
16 messages · Thomas Sandmann, Michael Lawrence, Leonard Goldstein +3 more
Hmm. I guess we could do at least two things: 1) Return the ID of the which range for each variant, like readVcf does with its paramRangeID column in the rowData. 2) Do as you suggest, and reduce() the which. Obviously, these address different use cases. The user can always achieve #2 by just reduce()ing the which range first. #1 would be more difficult for the user to implement. If tallyVariants() did #2, then #1 would be painful to achieve, so if #1 sounds useful at all, then we might want to hold off on #2. Thoughts? Michael On Fri, Feb 20, 2015 at 2:00 PM, Thomas Sandmann <sandmann.thomas at gene.com> wrote:
Hi Michael,
I noticed that when the tallyVariants function receives a 'which'
arguments (via BamTallyParam), that contains overlapping or duplicated
regions, duplicated rows are returned.
(See below for an example.)
It took me a little while to understand where I was picking duplicates.
Would it be useful to 'reduce' the 'which' GRanges/RangesList object by
default, e.g. before tallying variants, to make sure each base is only
tallied once ?
Best,
Thomas
library(VariantTools)
## 'which' is a set of non-overlapping regions
tally.param <- TallyVariantsParam(gmapR::TP53Genome(),
high_base_quality = 23L,
which = gmapR::TP53Which())
bams <- LungCancerLines::LungCancerBamFiles()
raw.variants <- tallyVariants(bams$H1993, tally.param)
any(duplicated( raw.variants )) ## FALSE
## 'which' is a set of duplicated regions
tally.param <- TallyVariantsParam(
gmapR::TP53Genome(),
high_base_quality = 23L,
which = c(
gmapR::TP53Which(),
gmapR::TP53Which()
)
)
raw.variants <- tallyVariants(bams$H1993, tally.param)
any(duplicated( raw.variants )) ## TRUE
sort(raw.variants)[1:4]
### SessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices
[6] utils datasets methods base
other attached packages:
[1] VariantTools_1.8.0 VariantAnnotation_1.12.9
[3] Rsamtools_1.18.2 Biostrings_2.34.1
[5] XVector_0.6.0 GenomicRanges_1.18.4
[7] GenomeInfoDb_1.2.4 IRanges_2.0.1
[9] S4Vectors_0.4.0 BiocGenerics_0.12.1
[11] BiocInstaller_1.16.1 roxygen2_4.1.0
[13] devtools_1.7.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.1 base64enc_0.1-2
[3] BatchJobs_1.5 BBmisc_1.9
[5] Biobase_2.26.0 BiocParallel_1.0.3
[7] biomaRt_2.22.0 bitops_1.0-6
[9] brew_1.0-6 BSgenome_1.34.1
[11] checkmate_1.5.1 codetools_0.2-10
[13] DBI_0.3.1 digest_0.6.8
[15] fail_1.2 foreach_1.4.2
[17] GenomicAlignments_1.2.1 GenomicFeatures_1.18.3
[19] gmapR_1.8.0 grid_3.1.2
[21] iterators_1.0.7 lattice_0.20-29
[23] LungCancerLines_0.3.1 Matrix_1.1-5
[25] Rcpp_0.11.4 RCurl_1.95-4.5
[27] RSQLite_1.0.0 rtracklayer_1.26.2
[29] sendmailR_1.2-1 stringr_0.6.2
[31] tools_3.1.2 XML_3.98-1.1
[33] zlibbioc_1.12.0
1 day later
Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your first option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation of 'which' on BamTallyParam's help page explaining that users might want to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii, which will be analyzed individually? I would actually favor the first one (the same as yours), just because it's simpler. To keep track of the query ranges, we would need to add a new column to the returned object, which will more often than not just be clutter. I guess we could introduce a new parameter, "reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <sandmann.thomas at gene.com> wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your first option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation of 'which' on BamTallyParam's help page explaining that users might want to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
Personally, I don't have a use case with "meaningful loci" worth tracking, so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter in a consistent way across different methods ? I just saw this recent post on the mailing list in which a used got confused by duplicate counts returned after passing 'which' to scanBamParam: https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html --- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.thomas at gene.com "If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties." -- Sir Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
lawrence.michael at gene.com> wrote:
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii, which will be analyzed individually? I would actually favor the first one (the same as yours), just because it's simpler. To keep track of the query ranges, we would need to add a new column to the returned object, which will more often than not just be clutter. I guess we could introduce a new parameter, "reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <sandmann.thomas at gene.com
wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your first option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation of 'which' on BamTallyParam's help page explaining that users might want to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
We should at leaast try to avoid surprising the user. Seems like most people expect "which" to be a simple restriction, so I think for now I will just reduce the which, and if someone has a use case for separate queries, we can address it in the future. On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <sandmann.thomas at gene.com> wrote:
Personally, I don't have a use case with "meaningful loci" worth tracking, so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter in a consistent way across different methods ? I just saw this recent post on the mailing list in which a used got confused by duplicate counts returned after passing 'which' to scanBamParam: https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html --- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.thomas at gene.com "If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties." -- Sir Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence < lawrence.michael at gene.com> wrote:
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii, which will be analyzed individually? I would actually favor the first one (the same as yours), just because it's simpler. To keep track of the query ranges, we would need to add a new column to the returned object, which will more often than not just be clutter. I guess we could introduce a new parameter, "reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann < sandmann.thomas at gene.com> wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your first option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation of 'which' on BamTallyParam's help page explaining that users might want to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
Hi Michael and Thomas, I ran into the same problem in the past (i.e. when I started working with functions like scanBam I expected them not to return the same alignment multiple times) One thing to consider might be that returning alignments multiple times is consistent with the behavior of the samtools view command. Quoting from the samtools manual: ?Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.? Maybe there is an argument for keeping things consistent with samtools? As you said, if documented properly, the user can decide whether to reduce regions specified in which or not. Leonard On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
We should at leaast try to avoid surprising the user. Seems like most people expect "which" to be a simple restriction, so I think for now I will just reduce the which, and if someone has a use case for separate queries, we can address it in the future. On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <sandmann.thomas at gene.com> wrote:
Personally, I don't have a use case with "meaningful loci" worth tracking, so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter in a consistent way across different methods ? I just saw this recent post on the mailing list in which a used got confused by duplicate counts returned after passing 'which' to scanBamParam: https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html --- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.thomas at gene.com "If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties." -- Sir Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence < lawrence.michael at gene.com> wrote:
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii, which will be analyzed individually? I would actually favor the first one (the same as yours), just because it's simpler. To keep track of the query ranges, we would need to add a new column to the returned object, which will more often than not just be clutter. I guess we could introduce a new parameter, "reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann < sandmann.thomas at gene.com> wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your first option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation of 'which' on BamTallyParam's help page explaining that users might want to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Maybe Rsamtools would want to follow this precedent. I think there might be a difference between fishing out alignments from a SAM/BAM, and deriving a summary (tallyVariants) from a BAM. It seems like an argument could be made for a tally set to not contain duplicates. On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein <
goldstein.leonard at gene.com> wrote:
Hi Michael and Thomas, I ran into the same problem in the past (i.e. when I started working with functions like scanBam I expected them not to return the same alignment multiple times) One thing to consider might be that returning alignments multiple times is consistent with the behavior of the samtools view command. Quoting from the samtools manual: ?Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.? Maybe there is an argument for keeping things consistent with samtools? As you said, if documented properly, the user can decide whether to reduce regions specified in which or not. Leonard On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
We should at leaast try to avoid surprising the user. Seems like most people expect "which" to be a simple restriction, so I think for now I
will
just reduce the which, and if someone has a use case for separate
queries,
we can address it in the future. On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <
sandmann.thomas at gene.com>
wrote:
Personally, I don't have a use case with "meaningful loci" worth
tracking,
so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter in a consistent way across different methods ? I just saw this recent post on the mailing list in which a used got confused by duplicate counts
returned
after passing 'which' to scanBamParam: https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html --- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.thomas at gene.com "If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties." --
Sir
Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence < lawrence.michael at gene.com> wrote:
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii, which
will
be analyzed individually? I would actually favor the first one (the
same as
yours), just because it's simpler. To keep track of the query ranges,
we
would need to add a new column to the returned object, which will more often than not just be clutter. I guess we could introduce a new
parameter,
"reduceWhich" which defaults to TRUE and reduces the which. If FALSE,
it
instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann < sandmann.thomas at gene.com> wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I
agree, as
it is easy for the user to 'reduce' the ranges beforehand your first
option
(e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation of 'which' on BamTallyParam's help page explaining that users might want
to
'reduce' their ranges beforehand if they are only interested in a
single
tally for each base ? Thanks a lot ! Thomas
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
On 02/23/2015 11:05 AM, Leonard Goldstein wrote:
Hi Michael and Thomas, I ran into the same problem in the past (i.e. when I started working with functions like scanBam I expected them not to return the same alignment multiple times) One thing to consider might be that returning alignments multiple times is consistent with the behavior of the samtools view command. Quoting from the samtools manual: ?Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.?
Thanks Leonard for pointing this out. This is indeed the reason why all
the functions in Rsamtools and GenomicAlignments that take a 'which'
argument (via a ScanBamParam object) treat it "the samtools way", that
is, as a vector of meaningful loci.
Most of them track the index of the individual loci via a "which_label"
metadata column. See for example Rsamtools::pileup() and all the
readGAlignment*() functions in the GenomicAlignments package.
FWIW the man page for the readGAlignment*() functions contains the
following note:
Note that a given record is loaded one time for each region it
belongs to (this is a scanBam() feature, readGAlignments()
is based on scanBam()).
but maybe this should be emphasized a little bit more.
Cheers,
H.
Maybe there is an argument for keeping things consistent with samtools? As you said, if documented properly, the user can decide whether to reduce regions specified in which or not. Leonard On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
We should at leaast try to avoid surprising the user. Seems like most people expect "which" to be a simple restriction, so I think for now I will just reduce the which, and if someone has a use case for separate queries, we can address it in the future. On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <sandmann.thomas at gene.com> wrote:
Personally, I don't have a use case with "meaningful loci" worth tracking, so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter in a consistent way across different methods ? I just saw this recent post on the mailing list in which a used got confused by duplicate counts returned after passing 'which' to scanBamParam: https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html --- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.thomas at gene.com "If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties." -- Sir Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence < lawrence.michael at gene.com> wrote:
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii, which will be analyzed individually? I would actually favor the first one (the same as yours), just because it's simpler. To keep track of the query ranges, we would need to add a new column to the returned object, which will more often than not just be clutter. I guess we could introduce a new parameter, "reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann < sandmann.thomas at gene.com> wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your first option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation of 'which' on BamTallyParam's help page explaining that users might want to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
Sounds very sensible not to double count in the context of tallying variants. I was more concerned with reducing which as the default behavior for scanBam and other functions. I wanted to bring up the samtools behavior as - for me at least - inconsistencies between Rsamtools and samtools have been another source of confusion in the past (e.g. different naming conventions for fields like isize vs TLEN etc.) Leonard On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
Maybe Rsamtools would want to follow this precedent. I think there might be a difference between fishing out alignments from a SAM/BAM, and deriving a summary (tallyVariants) from a BAM. It seems like an argument could be made for a tally set to not contain duplicates. On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein <goldstein.leonard at gene.com> wrote:
Hi Michael and Thomas, I ran into the same problem in the past (i.e. when I started working with functions like scanBam I expected them not to return the same alignment multiple times) One thing to consider might be that returning alignments multiple times is consistent with the behavior of the samtools view command. Quoting from the samtools manual: ?Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.? Maybe there is an argument for keeping things consistent with samtools? As you said, if documented properly, the user can decide whether to reduce regions specified in which or not. Leonard On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
We should at leaast try to avoid surprising the user. Seems like most people expect "which" to be a simple restriction, so I think for now I will just reduce the which, and if someone has a use case for separate queries, we can address it in the future. On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <sandmann.thomas at gene.com> wrote:
Personally, I don't have a use case with "meaningful loci" worth tracking, so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter in a consistent way across different methods ? I just saw this recent post on the mailing list in which a used got confused by duplicate counts returned after passing 'which' to scanBamParam: https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html --- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.thomas at gene.com "If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties." -- Sir Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence < lawrence.michael at gene.com> wrote:
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii, which will be analyzed individually? I would actually favor the first one (the same as yours), just because it's simpler. To keep track of the query ranges, we would need to add a new column to the returned object, which will more often than not just be clutter. I guess we could introduce a new parameter, "reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann < sandmann.thomas at gene.com> wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your first option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation of 'which' on BamTallyParam's help page explaining that users might want to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
1 day later
Related to my post on a separate thread (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html), I think that if 'which' is not being reduced by default, a simple example showing the effects of this could be included in the functions that have such an argument. Also note that 'reducing' could lead to unintended results. For example, in the help page for GenomicAlignments::readGAlignments, after the 'gal4' example it would be nice to add something like this: ## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000))) param_dups <- ScanBamParam(which=which_dups) param_reduced <- ScanBamParam(which=reduce(which_dups)) gal4_dups <- readGAlignments(bamfile, param=param_dups) gal4_reduced <- readGAlignments(bamfile, param=param_reduced) length(gal4) ## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups) ## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced) Here's the output:
library('GenomicAlignments')
## Code already included in ?readGAlignments
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
+ mustWork=TRUE)
which <- RangesList(seq1=IRanges(1000, 2000),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
param <- ScanBamParam(which=which) gal4 <- readGAlignments(bamfile, param=param) gal4
GAlignments object with 2404 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end
width njunc
<Rle> <Rle> <character> <integer> <integer> <integer>
<integer> <integer>
[1] seq1 + 35M 35 970 1004
35 0
[2] seq1 + 35M 35 971 1005
35 0
[3] seq1 + 35M 35 972 1006
35 0
[4] seq1 + 35M 35 973 1007
35 0
[5] seq1 + 35M 35 974 1008
35 0
... ... ... ... ... ... ...
... ...
[2400] seq2 + 35M 35 1524 1558
35 0
[2401] seq2 + 35M 35 1524 1558
35 0
[2402] seq2 - 35M 35 1528 1562
35 0
[2403] seq2 - 35M 35 1532 1566
35 0
[2404] seq2 - 35M 35 1533 1567
35 0
-------
seqinfo: 2 sequences from an unspecified genome
## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups) param_reduced <- ScanBamParam(which=reduce(which_dups)) gal4_dups <- readGAlignments(bamfile, param=param_dups) gal4_reduced <- readGAlignments(bamfile, param=param_reduced) length(gal4)
[1] 2404
## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups)
[1] 3014
## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced)
[1] 2343
options(width = 120) devtools::session_info()
Session info----------------------------------------------------------------------------------------------------------- setting value version R Under development (unstable) (2014-11-01 r66923) system x86_64, darwin10.8.0 ui AQUA language (EN) collate en_US.UTF-8 tz America/New_York Packages--------------------------------------------------------------------------------------------------------------- package * version date source base64enc 0.1.2 2014-06-26 CRAN (R 3.2.0) BatchJobs 1.4 2014-09-24 CRAN (R 3.2.0) BBmisc 1.7 2014-06-21 CRAN (R 3.2.0) BiocGenerics * 0.13.4 2014-12-31 Bioconductor BiocParallel 1.1.13 2015-01-27 Bioconductor Biostrings * 2.35.8 2015-02-14 Bioconductor bitops 1.0.6 2013-08-17 CRAN (R 3.2.0) brew 1.0.6 2011-04-13 CRAN (R 3.2.0) checkmate 1.5.0 2014-10-19 CRAN (R 3.2.0) codetools 0.2.9 2014-08-21 CRAN (R 3.2.0) DBI 0.3.1 2014-09-24 CRAN (R 3.2.0) devtools 1.6.1 2014-10-07 CRAN (R 3.2.0) digest 0.6.4 2013-12-03 CRAN (R 3.2.0) fail 1.2 2013-09-19 CRAN (R 3.2.0) foreach 1.4.2 2014-04-11 CRAN (R 3.2.0) GenomeInfoDb * 1.3.13 2015-02-13 Bioconductor GenomicAlignments * 1.3.27 2015-01-26 Bioconductor GenomicRanges * 1.19.37 2015-02-13 Bioconductor IRanges * 2.1.38 2015-02-08 Bioconductor iterators 1.0.7 2014-04-11 CRAN (R 3.2.0) Rsamtools * 1.19.27 2015-02-07 Bioconductor RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0) rstudioapi 0.1 2014-03-27 CRAN (R 3.2.0) S4Vectors * 0.5.20 2015-02-19 Bioconductor sendmailR 1.2.1 2014-09-21 CRAN (R 3.2.0) stringr 0.6.2 2012-12-06 CRAN (R 3.2.0) XVector * 0.7.4 2015-02-08 Bioconductor zlibbioc 1.13.1 2015-02-11 Bioconductor
On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein
<goldstein.leonard at gene.com> wrote:
Sounds very sensible not to double count in the context of tallying variants. I was more concerned with reducing which as the default behavior for scanBam and other functions. I wanted to bring up the samtools behavior as - for me at least - inconsistencies between Rsamtools and samtools have been another source of confusion in the past (e.g. different naming conventions for fields like isize vs TLEN etc.) Leonard On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Maybe Rsamtools would want to follow this precedent. I think there might be a difference between fishing out alignments from a SAM/BAM, and deriving a summary (tallyVariants) from a BAM. It seems like an argument could be made for a tally set to not contain duplicates. On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein <goldstein.leonard at gene.com> wrote:
Hi Michael and Thomas, I ran into the same problem in the past (i.e. when I started working with functions like scanBam I expected them not to return the same alignment multiple times) One thing to consider might be that returning alignments multiple times is consistent with the behavior of the samtools view command. Quoting from the samtools manual: ?Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.? Maybe there is an argument for keeping things consistent with samtools? As you said, if documented properly, the user can decide whether to reduce regions specified in which or not. Leonard On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
We should at leaast try to avoid surprising the user. Seems like most people expect "which" to be a simple restriction, so I think for now I will just reduce the which, and if someone has a use case for separate queries, we can address it in the future. On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <sandmann.thomas at gene.com> wrote:
Personally, I don't have a use case with "meaningful loci" worth tracking, so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter in a consistent way across different methods ? I just saw this recent post on the mailing list in which a used got confused by duplicate counts returned after passing 'which' to scanBamParam: https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html --- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.thomas at gene.com "If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties." -- Sir Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence < lawrence.michael at gene.com> wrote:
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii, which will be analyzed individually? I would actually favor the first one (the same as yours), just because it's simpler. To keep track of the query ranges, we would need to add a new column to the returned object, which will more often than not just be clutter. I guess we could introduce a new parameter, "reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann < sandmann.thomas at gene.com> wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your first option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation of 'which' on BamTallyParam's help page explaining that users might want to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
I think we need to be a little careful of trying to know the users intentions better than they do here. Reduce is a (very) easy operation to do on a GRanges, so if the user didn't, are we really safe assuming they "meant to" when passing the GRanges as a which? I would argue for "the samtools way", not because samtools does it (though consistency is good) but because it allows the user to do more things, while not making it that painful to do the thing that they might want most often. I agree with Michael that an additional argument might be a good middle ground. ~G On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres <lcollado at jhu.edu> wrote:
Related to my post on a separate thread (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html), I think that if 'which' is not being reduced by default, a simple example showing the effects of this could be included in the functions that have such an argument. Also note that 'reducing' could lead to unintended results. For example, in the help page for GenomicAlignments::readGAlignments, after the 'gal4' example it would be nice to add something like this: ## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000))) param_dups <- ScanBamParam(which=which_dups) param_reduced <- ScanBamParam(which=reduce(which_dups)) gal4_dups <- readGAlignments(bamfile, param=param_dups) gal4_reduced <- readGAlignments(bamfile, param=param_reduced) length(gal4) ## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups) ## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced) Here's the output:
library('GenomicAlignments')
## Code already included in ?readGAlignments
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
+ mustWork=TRUE)
which <- RangesList(seq1=IRanges(1000, 2000),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
param <- ScanBamParam(which=which) gal4 <- readGAlignments(bamfile, param=param) gal4
GAlignments object with 2404 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end
width njunc
<Rle> <Rle> <character> <integer> <integer> <integer>
<integer> <integer>
[1] seq1 + 35M 35 970 1004
35 0
[2] seq1 + 35M 35 971 1005
35 0
[3] seq1 + 35M 35 972 1006
35 0
[4] seq1 + 35M 35 973 1007
35 0
[5] seq1 + 35M 35 974 1008
35 0
... ... ... ... ... ... ...
... ...
[2400] seq2 + 35M 35 1524 1558
35 0
[2401] seq2 + 35M 35 1524 1558
35 0
[2402] seq2 - 35M 35 1528 1562
35 0
[2403] seq2 - 35M 35 1532 1566
35 0
[2404] seq2 - 35M 35 1533 1567
35 0
-------
seqinfo: 2 sequences from an unspecified genome
## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups) param_reduced <- ScanBamParam(which=reduce(which_dups)) gal4_dups <- readGAlignments(bamfile, param=param_dups) gal4_reduced <- readGAlignments(bamfile, param=param_reduced) length(gal4)
[1] 2404
## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups)
[1] 3014
## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced)
[1] 2343
options(width = 120) devtools::session_info()
Session info----------------------------------------------------------------------------------------------------------- setting value version R Under development (unstable) (2014-11-01 r66923) system x86_64, darwin10.8.0 ui AQUA language (EN) collate en_US.UTF-8 tz America/New_York Packages--------------------------------------------------------------------------------------------------------------- package * version date source base64enc 0.1.2 2014-06-26 CRAN (R 3.2.0) BatchJobs 1.4 2014-09-24 CRAN (R 3.2.0) BBmisc 1.7 2014-06-21 CRAN (R 3.2.0) BiocGenerics * 0.13.4 2014-12-31 Bioconductor BiocParallel 1.1.13 2015-01-27 Bioconductor Biostrings * 2.35.8 2015-02-14 Bioconductor bitops 1.0.6 2013-08-17 CRAN (R 3.2.0) brew 1.0.6 2011-04-13 CRAN (R 3.2.0) checkmate 1.5.0 2014-10-19 CRAN (R 3.2.0) codetools 0.2.9 2014-08-21 CRAN (R 3.2.0) DBI 0.3.1 2014-09-24 CRAN (R 3.2.0) devtools 1.6.1 2014-10-07 CRAN (R 3.2.0) digest 0.6.4 2013-12-03 CRAN (R 3.2.0) fail 1.2 2013-09-19 CRAN (R 3.2.0) foreach 1.4.2 2014-04-11 CRAN (R 3.2.0) GenomeInfoDb * 1.3.13 2015-02-13 Bioconductor GenomicAlignments * 1.3.27 2015-01-26 Bioconductor GenomicRanges * 1.19.37 2015-02-13 Bioconductor IRanges * 2.1.38 2015-02-08 Bioconductor iterators 1.0.7 2014-04-11 CRAN (R 3.2.0) Rsamtools * 1.19.27 2015-02-07 Bioconductor RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0) rstudioapi 0.1 2014-03-27 CRAN (R 3.2.0) S4Vectors * 0.5.20 2015-02-19 Bioconductor sendmailR 1.2.1 2014-09-21 CRAN (R 3.2.0) stringr 0.6.2 2012-12-06 CRAN (R 3.2.0) XVector * 0.7.4 2015-02-08 Bioconductor zlibbioc 1.13.1 2015-02-11 Bioconductor
On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein <goldstein.leonard at gene.com> wrote:
Sounds very sensible not to double count in the context of tallying variants. I was more concerned with reducing which as the default behavior for scanBam and other functions. I wanted to bring up the samtools behavior as - for me at least - inconsistencies between Rsamtools and samtools have been another source of confusion in the past (e.g. different naming conventions for fields like isize vs TLEN etc.) Leonard On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Maybe Rsamtools would want to follow this precedent. I think there
might be
a difference between fishing out alignments from a SAM/BAM, and
deriving a
summary (tallyVariants) from a BAM. It seems like an argument could be
made
for a tally set to not contain duplicates. On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein <goldstein.leonard at gene.com> wrote:
Hi Michael and Thomas, I ran into the same problem in the past (i.e. when I started working with functions like scanBam I expected them not to return the same alignment multiple times) One thing to consider might be that returning alignments multiple times is consistent with the behavior of the samtools view command. Quoting from the samtools manual: ?Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.? Maybe there is an argument for keeping things consistent with samtools? As you said, if documented properly, the user can decide whether to reduce regions specified in which or not. Leonard On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
We should at leaast try to avoid surprising the user. Seems like most people expect "which" to be a simple restriction, so I think for now
I
will just reduce the which, and if someone has a use case for separate queries, we can address it in the future. On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <sandmann.thomas at gene.com> wrote:
Personally, I don't have a use case with "meaningful loci" worth tracking, so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter
in a
consistent way across different methods ? I just saw this recent
post
on the mailing list in which a used got confused by duplicate counts returned after passing 'which' to scanBamParam: https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html --- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.thomas at gene.com "If a man will begin with certainties, he shall end in doubts; but
if
he will be content to begin with doubts he shall end in certainties."
--
Sir Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence < lawrence.michael at gene.com> wrote:
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii,
which
will be analyzed individually? I would actually favor the first one (the same as yours), just because it's simpler. To keep track of the query
ranges,
we would need to add a new column to the returned object, which will
more
often than not just be clutter. I guess we could introduce a new parameter, "reduceWhich" which defaults to TRUE and reduces the which. If
FALSE,
it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann < sandmann.thomas at gene.com> wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your
first
option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation
of
'which' on BamTallyParam's help page explaining that users might
want
to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Gabriel Becker, Ph.D Computational Biologist Genentech Research [[alternative HTML version deleted]]
I'm not convinced the samtools way lets the user do anything useful with tallyVariants. At least, I can't think of one. The arguments for consistency with samtools do not really hold at the summary level. Ultimately, you want a summary of each position. Having the summaries duplicated in the output does not really help, unless there is an easy way to map back to the original locii. But I'm not going to add that complexity until there is an actual use case. Right now, users are universally surprised by the repetition of the results. A parameter could be added when the need arises.
On Wed, Feb 25, 2015 at 6:37 AM, Gabe Becker <becker.gabe at gene.com> wrote:
I think we need to be a little careful of trying to know the users intentions better than they do here. Reduce is a (very) easy operation to do on a GRanges, so if the user didn't, are we really safe assuming they "meant to" when passing the GRanges as a which? I would argue for "the samtools way", not because samtools does it (though consistency is good) but because it allows the user to do more things, while not making it that painful to do the thing that they might want most often. I agree with Michael that an additional argument might be a good middle ground. ~G On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres <lcollado at jhu.edu
wrote:
Related to my post on a separate thread (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html), I think that if 'which' is not being reduced by default, a simple example showing the effects of this could be included in the functions that have such an argument. Also note that 'reducing' could lead to unintended results. For example, in the help page for GenomicAlignments::readGAlignments, after the 'gal4' example it would be nice to add something like this: ## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000))) param_dups <- ScanBamParam(which=which_dups) param_reduced <- ScanBamParam(which=reduce(which_dups)) gal4_dups <- readGAlignments(bamfile, param=param_dups) gal4_reduced <- readGAlignments(bamfile, param=param_reduced) length(gal4) ## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups) ## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced) Here's the output:
library('GenomicAlignments')
## Code already included in ?readGAlignments
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
+ mustWork=TRUE)
which <- RangesList(seq1=IRanges(1000, 2000),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
param <- ScanBamParam(which=which) gal4 <- readGAlignments(bamfile, param=param) gal4
GAlignments object with 2404 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end
width njunc
<Rle> <Rle> <character> <integer> <integer> <integer>
<integer> <integer>
[1] seq1 + 35M 35 970 1004
35 0
[2] seq1 + 35M 35 971 1005
35 0
[3] seq1 + 35M 35 972 1006
35 0
[4] seq1 + 35M 35 973 1007
35 0
[5] seq1 + 35M 35 974 1008
35 0
... ... ... ... ... ... ...
... ...
[2400] seq2 + 35M 35 1524 1558
35 0
[2401] seq2 + 35M 35 1524 1558
35 0
[2402] seq2 - 35M 35 1528 1562
35 0
[2403] seq2 - 35M 35 1532 1566
35 0
[2404] seq2 - 35M 35 1533 1567
35 0
-------
seqinfo: 2 sequences from an unspecified genome
## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups) param_reduced <- ScanBamParam(which=reduce(which_dups)) gal4_dups <- readGAlignments(bamfile, param=param_dups) gal4_reduced <- readGAlignments(bamfile, param=param_reduced) length(gal4)
[1] 2404
## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups)
[1] 3014
## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced)
[1] 2343
options(width = 120) devtools::session_info()
Session info----------------------------------------------------------------------------------------------------------- setting value version R Under development (unstable) (2014-11-01 r66923) system x86_64, darwin10.8.0 ui AQUA language (EN) collate en_US.UTF-8 tz America/New_York Packages--------------------------------------------------------------------------------------------------------------- package * version date source base64enc 0.1.2 2014-06-26 CRAN (R 3.2.0) BatchJobs 1.4 2014-09-24 CRAN (R 3.2.0) BBmisc 1.7 2014-06-21 CRAN (R 3.2.0) BiocGenerics * 0.13.4 2014-12-31 Bioconductor BiocParallel 1.1.13 2015-01-27 Bioconductor Biostrings * 2.35.8 2015-02-14 Bioconductor bitops 1.0.6 2013-08-17 CRAN (R 3.2.0) brew 1.0.6 2011-04-13 CRAN (R 3.2.0) checkmate 1.5.0 2014-10-19 CRAN (R 3.2.0) codetools 0.2.9 2014-08-21 CRAN (R 3.2.0) DBI 0.3.1 2014-09-24 CRAN (R 3.2.0) devtools 1.6.1 2014-10-07 CRAN (R 3.2.0) digest 0.6.4 2013-12-03 CRAN (R 3.2.0) fail 1.2 2013-09-19 CRAN (R 3.2.0) foreach 1.4.2 2014-04-11 CRAN (R 3.2.0) GenomeInfoDb * 1.3.13 2015-02-13 Bioconductor GenomicAlignments * 1.3.27 2015-01-26 Bioconductor GenomicRanges * 1.19.37 2015-02-13 Bioconductor IRanges * 2.1.38 2015-02-08 Bioconductor iterators 1.0.7 2014-04-11 CRAN (R 3.2.0) Rsamtools * 1.19.27 2015-02-07 Bioconductor RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0) rstudioapi 0.1 2014-03-27 CRAN (R 3.2.0) S4Vectors * 0.5.20 2015-02-19 Bioconductor sendmailR 1.2.1 2014-09-21 CRAN (R 3.2.0) stringr 0.6.2 2012-12-06 CRAN (R 3.2.0) XVector * 0.7.4 2015-02-08 Bioconductor zlibbioc 1.13.1 2015-02-11 Bioconductor
On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein <goldstein.leonard at gene.com> wrote:
Sounds very sensible not to double count in the context of tallying variants. I was more concerned with reducing which as the default behavior for scanBam and other functions. I wanted to bring up the samtools behavior as - for me at least - inconsistencies between Rsamtools and samtools have been another source of confusion in the past (e.g. different naming conventions for fields like isize vs TLEN etc.) Leonard On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Maybe Rsamtools would want to follow this precedent. I think there
might be
a difference between fishing out alignments from a SAM/BAM, and
deriving a
summary (tallyVariants) from a BAM. It seems like an argument could be
made
for a tally set to not contain duplicates. On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein <goldstein.leonard at gene.com> wrote:
Hi Michael and Thomas, I ran into the same problem in the past (i.e. when I started working with functions like scanBam I expected them not to return the same alignment multiple times) One thing to consider might be that returning alignments multiple times is consistent with the behavior of the samtools view command. Quoting from the samtools manual: ?Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.? Maybe there is an argument for keeping things consistent with samtools? As you said, if documented properly, the user can decide whether to reduce regions specified in which or not. Leonard On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
We should at leaast try to avoid surprising the user. Seems like
most
people expect "which" to be a simple restriction, so I think for
now I
will just reduce the which, and if someone has a use case for separate queries, we can address it in the future. On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <sandmann.thomas at gene.com> wrote:
Personally, I don't have a use case with "meaningful loci" worth tracking, so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter
in a
consistent way across different methods ? I just saw this recent
post
on the mailing list in which a used got confused by duplicate counts returned after passing 'which' to scanBamParam:
--- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.thomas at gene.com "If a man will begin with certainties, he shall end in doubts; but
if
he will be content to begin with doubts he shall end in certainties."
--
Sir Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence < lawrence.michael at gene.com> wrote:
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii,
which
will be analyzed individually? I would actually favor the first one
(the
same as yours), just because it's simpler. To keep track of the query
ranges,
we would need to add a new column to the returned object, which will
more
often than not just be clutter. I guess we could introduce a new parameter, "reduceWhich" which defaults to TRUE and reduces the which. If
FALSE,
it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann < sandmann.thomas at gene.com> wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups
separately
for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your
first
option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the
documentation of
'which' on BamTallyParam's help page explaining that users might
want
to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Gabriel Becker, Ph.D Computational Biologist Genentech Research
Hi,
On 02/25/2015 06:37 AM, Gabe Becker wrote:
I think we need to be a little careful of trying to know the users intentions better than they do here. Reduce is a (very) easy operation to do on a GRanges, so if the user didn't, are we really safe assuming they "meant to" when passing the GRanges as a which? I would argue for "the samtools way", not because samtools does it (though consistency is good) but because it allows the user to do more things, while not making it that painful to do the thing that they might want most often. I agree with Michael that an additional argument might be a good middle ground.
Maybe another good middle ground is to issue a warning when 'which'
contains overlapping ranges. The warning could suggest to the user
to reduce() the ranges first. Maybe the warning should also point
out that reducing doesn't completely eliminate the risk of selecting
the same record more than once (at least that's the case for the
readGAlignment* functions, but I don't know if it holds for
BamTallyParam). The risk is actually higher with paired-end reads
where the same pair is selected once for each range in 'which' that
overlaps with at least one of the 2 mates in the pair.
But as already discussed here (or maybe it was on the old bioconductor
list, don't remember), better solutions to the "duplicated selection"
problem could be achieved via one of the following:
(1) Expose some sort of unique id for the records in a BAM file.
I agree with Michael that "duplicated selection" is incompatible
with summarization tools like BamTallyParam or pileup(). Having
access to a unique id would completely solve the problem.
(2) Introduce an argument similar to 'which' but with a slightly
different semantic e.g. select records that *start* in one of
the specified ranges (for single-end reads), or select pairs of
records for which the *first* mate starts in one of the specified
ranges.
Advantages of this semantic:
(a) If the ranges don't overlap, then no duplicates.
(b) It can be used in the context of tiling the genome and
processing the reads by tile. This plays well with
parallelization (the semantic of 'which' does not).
Inconvenient: the arbitrary nature of the selection criteria
and its lack of symmetry is incompatible with summarization
tools like BamTallyParam or pileup().
Note that (1) and (2) are not exclusive.
Cheers,
H.
~G
On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres
<lcollado at jhu.edu <mailto:lcollado at jhu.edu>> wrote:
Related to my post on a separate thread
(https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
I think that if 'which' is not being reduced by default, a simple
example showing the effects of this could be included in the functions
that have such an argument. Also note that 'reducing' could lead to
unintended results.
For example, in the help page for GenomicAlignments::readGAlignments,
after the 'gal4' example it would be nice to add something like this:
## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.
which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
length(gal4)
## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)
## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)
Here's the output:
> library('GenomicAlignments')
>
> ## Code already included in ?readGAlignments
> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
+ mustWork=TRUE)
> which <- RangesList(seq1=IRanges(1000, 2000),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
> param <- ScanBamParam(which=which)
> gal4 <- readGAlignments(bamfile, param=param)
> gal4
GAlignments object with 2404 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end
width njunc
<Rle> <Rle> <character> <integer> <integer> <integer>
<integer> <integer>
[1] seq1 + 35M 35 970 1004
35 0
[2] seq1 + 35M 35 971 1005
35 0
[3] seq1 + 35M 35 972 1006
35 0
[4] seq1 + 35M 35 973 1007
35 0
[5] seq1 + 35M 35 974 1008
35 0
... ... ... ... ... ... ...
... ...
[2400] seq2 + 35M 35 1524 1558
35 0
[2401] seq2 + 35M 35 1524 1558
35 0
[2402] seq2 - 35M 35 1528 1562
35 0
[2403] seq2 - 35M 35 1532 1566
35 0
[2404] seq2 - 35M 35 1533 1567
35 0
-------
seqinfo: 2 sequences from an unspecified genome
>
> ## Note that if overlapping ranges are provided in 'which'
> ## reads could be selected more than once. This would artificually
> ## increase the coverage or affect other downstream results.
> ## If you 'reduce' the ranges, reads that originally overlapped
> ## two disjoint segments will be included.
>
> which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
> param_dups <- ScanBamParam(which=which_dups)
> param_reduced <- ScanBamParam(which=reduce(which_dups))
> gal4_dups <- readGAlignments(bamfile, param=param_dups)
> gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>
>
> length(gal4)
[1] 2404
>
> ## Duplicates some reads. In this case, all the ones between
> ## bases 1000 and 2000 on seq1.
> length(gal4_dups)
[1] 3014
>
> ## Includes some reads that mapped around base 1000 in seq2
> ## that were excluded in gal4.
> length(gal4_reduced)
[1] 2343
>
>
>
>
>
> options(width = 120)
> devtools::session_info()
Session
info-----------------------------------------------------------------------------------------------------------
setting value
version R Under development (unstable) (2014-11-01 r66923)
system x86_64, darwin10.8.0
ui AQUA
language (EN)
collate en_US.UTF-8
tz America/New_York
Packages---------------------------------------------------------------------------------------------------------------
package * version date source
base64enc 0.1.2 2014-06-26 CRAN (R 3.2.0)
BatchJobs 1.4 2014-09-24 CRAN (R 3.2.0)
BBmisc 1.7 2014-06-21 CRAN (R 3.2.0)
BiocGenerics * 0.13.4 2014-12-31 Bioconductor
BiocParallel 1.1.13 2015-01-27 Bioconductor
Biostrings * 2.35.8 2015-02-14 Bioconductor
bitops 1.0.6 2013-08-17 CRAN (R 3.2.0)
brew 1.0.6 2011-04-13 CRAN (R 3.2.0)
checkmate 1.5.0 2014-10-19 CRAN (R 3.2.0)
codetools 0.2.9 2014-08-21 CRAN (R 3.2.0)
DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
devtools 1.6.1 2014-10-07 CRAN (R 3.2.0)
digest 0.6.4 2013-12-03 CRAN (R 3.2.0)
fail 1.2 2013-09-19 CRAN (R 3.2.0)
foreach 1.4.2 2014-04-11 CRAN (R 3.2.0)
GenomeInfoDb * 1.3.13 2015-02-13 Bioconductor
GenomicAlignments * 1.3.27 2015-01-26 Bioconductor
GenomicRanges * 1.19.37 2015-02-13 Bioconductor
IRanges * 2.1.38 2015-02-08 Bioconductor
iterators 1.0.7 2014-04-11 CRAN (R 3.2.0)
Rsamtools * 1.19.27 2015-02-07 Bioconductor
RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
rstudioapi 0.1 2014-03-27 CRAN (R 3.2.0)
S4Vectors * 0.5.20 2015-02-19 Bioconductor
sendmailR 1.2.1 2014-09-21 CRAN (R 3.2.0)
stringr 0.6.2 2012-12-06 CRAN (R 3.2.0)
XVector * 0.7.4 2015-02-08 Bioconductor
zlibbioc 1.13.1 2015-02-11 Bioconductor
>
On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein
<goldstein.leonard at gene.com <mailto:goldstein.leonard at gene.com>> wrote:
> Sounds very sensible not to double count in the context of tallying
> variants. I was more concerned with reducing which as the default
> behavior for scanBam and other functions.
>
> I wanted to bring up the samtools behavior as - for me at least -
> inconsistencies between Rsamtools and samtools have been another
> source of confusion in the past (e.g. different naming
conventions for
> fields like isize vs TLEN etc.)
>
> Leonard
>
>
> On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence
> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>> Maybe Rsamtools would want to follow this precedent. I think
there might be
>> a difference between fishing out alignments from a SAM/BAM, and
deriving a
>> summary (tallyVariants) from a BAM. It seems like an argument
could be made
>> for a tally set to not contain duplicates.
>>
>> On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein
>> <goldstein.leonard at gene.com <mailto:goldstein.leonard at gene.com>>
wrote:
>>>
>>> Hi Michael and Thomas,
>>>
>>> I ran into the same problem in the past (i.e. when I started
working
>>> with functions like scanBam I expected them not to return the same
>>> alignment multiple times)
>>>
>>> One thing to consider might be that returning alignments multiple
>>> times is consistent with the behavior of the samtools view command.
>>> Quoting from the samtools manual:
>>>
>>> ?Important note: when multiple regions are given, some
alignments may
>>> be output multiple times if they overlap more than one of the
>>> specified regions.?
>>>
>>> Maybe there is an argument for keeping things consistent with
>>> samtools? As you said, if documented properly, the user can decide
>>> whether to reduce regions specified in which or not.
>>>
>>> Leonard
>>>
>>>
>>> On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
>>> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>>
wrote:
>>> > We should at leaast try to avoid surprising the user. Seems
like most
>>> > people expect "which" to be a simple restriction, so I think
for now I
>>> > will
>>> > just reduce the which, and if someone has a use case for separate
>>> > queries,
>>> > we can address it in the future.
>>> >
>>> > On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann
>>> > <sandmann.thomas at gene.com <mailto:sandmann.thomas at gene.com>>
>>> > wrote:
>>> >
>>> >> Personally, I don't have a use case with "meaningful loci" worth
>>> >> tracking,
>>> >> so keeping it simple would work for me.
>>> >>
>>> >> Incidentally, would it be good to deal with the 'which'
parameter in a
>>> >> consistent way across different methods ? I just saw this
recent post
>>> >> on
>>> >> the mailing list in which a used got confused by duplicate
counts
>>> >> returned
>>> >> after passing 'which' to scanBamParam:
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> ---
>>> >>
>>> >> Thomas Sandmann, PhD
>>> >> Computational biologist
>>> >>
>>> >> Genentech, Inc.
>>> >> 1 DNA Way
>>> >> South San Francisco, CA 94080
>>> >> USA
>>> >>
>>> >> Phone: +1 650 225 6273 <tel:%2B1%20650%20225%206273>
>>> >> Fax: +1 650 225 5389 <tel:%2B1%20650%20225%205389>
>>> >> Email: sandmann.thomas at gene.com
<mailto:sandmann.thomas at gene.com>
>>> >>
>>> >> "If a man will begin with certainties, he shall end in
doubts; but if
>>> >> he
>>> >> will be content to begin with doubts he shall end in
certainties." --
>>> >> Sir
>>> >> Francis Bacon
>>> >>
>>> >>
>>> >> On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
>>> >> lawrence.michael at gene.com
<mailto:lawrence.michael at gene.com>> wrote:
>>> >>
>>> >>> We just have to decide which is the more useful
interpretation of
>>> >>> which
>>> >>> -- as a simple restriction, or as a vector of meaningful
locii, which
>>> >>> will
>>> >>> be analyzed individually? I would actually favor the first
one (the
>>> >>> same as
>>> >>> yours), just because it's simpler. To keep track of the
query ranges,
>>> >>> we
>>> >>> would need to add a new column to the returned object,
which will more
>>> >>> often than not just be clutter. I guess we could introduce
a new
>>> >>> parameter,
>>> >>> "reduceWhich" which defaults to TRUE and reduces the which.
If FALSE,
>>> >>> it
>>> >>> instead adds the column mapping back to the original which
ranges.
>>> >>>
>>> >>>
>>> >>> On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
>>> >>> sandmann.thomas at gene.com <mailto:sandmann.thomas at gene.com>>
wrote:
>>> >>>
>>> >>>> Hi Michael,
>>> >>>>
>>> >>>> ah, I see. I hadn't realized that returning the pileups
separately
>>> >>>> for
>>> >>>> each region could be a desired feature, but that makes
sense. I
>>> >>>> agree, as
>>> >>>> it is easy for the user to 'reduce' the ranges beforehand
your first
>>> >>>> option
>>> >>>> (e.g. returning the ID of the range) would be more flexible.
>>> >>>>
>>> >>>> Perhaps you would consider adding a sentence to the
documentation of
>>> >>>> 'which' on BamTallyParam's help page explaining that users
might want
>>> >>>> to
>>> >>>> 'reduce' their ranges beforehand if they are only
interested in a
>>> >>>> single
>>> >>>> tally for each base ?
>>> >>>>
>>> >>>> Thanks a lot !
>>> >>>> Thomas
>>> >>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>> > [[alternative HTML version deleted]]
>>> >
>>> > _______________________________________________
>>> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
>>
>>
>
> _______________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
_______________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
Thanks Leonardo. I'll add this to the man page. This will definitely help clarify the "duplicated selection" issue. H.
On 02/24/2015 07:40 PM, Leonardo Collado Torres wrote:
Related to my post on a separate thread (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html), I think that if 'which' is not being reduced by default, a simple example showing the effects of this could be included in the functions that have such an argument. Also note that 'reducing' could lead to unintended results. For example, in the help page for GenomicAlignments::readGAlignments, after the 'gal4' example it would be nice to add something like this: ## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000))) param_dups <- ScanBamParam(which=which_dups) param_reduced <- ScanBamParam(which=reduce(which_dups)) gal4_dups <- readGAlignments(bamfile, param=param_dups) gal4_reduced <- readGAlignments(bamfile, param=param_reduced) length(gal4) ## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups) ## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced) Here's the output:
library('GenomicAlignments')
## Code already included in ?readGAlignments
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
+ mustWork=TRUE)
which <- RangesList(seq1=IRanges(1000, 2000),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
param <- ScanBamParam(which=which) gal4 <- readGAlignments(bamfile, param=param) gal4
GAlignments object with 2404 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end
width njunc
<Rle> <Rle> <character> <integer> <integer> <integer>
<integer> <integer>
[1] seq1 + 35M 35 970 1004
35 0
[2] seq1 + 35M 35 971 1005
35 0
[3] seq1 + 35M 35 972 1006
35 0
[4] seq1 + 35M 35 973 1007
35 0
[5] seq1 + 35M 35 974 1008
35 0
... ... ... ... ... ... ...
... ...
[2400] seq2 + 35M 35 1524 1558
35 0
[2401] seq2 + 35M 35 1524 1558
35 0
[2402] seq2 - 35M 35 1528 1562
35 0
[2403] seq2 - 35M 35 1532 1566
35 0
[2404] seq2 - 35M 35 1533 1567
35 0
-------
seqinfo: 2 sequences from an unspecified genome
## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups) param_reduced <- ScanBamParam(which=reduce(which_dups)) gal4_dups <- readGAlignments(bamfile, param=param_dups) gal4_reduced <- readGAlignments(bamfile, param=param_reduced) length(gal4)
[1] 2404
## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups)
[1] 3014
## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced)
[1] 2343
options(width = 120) devtools::session_info()
Session info----------------------------------------------------------------------------------------------------------- setting value version R Under development (unstable) (2014-11-01 r66923) system x86_64, darwin10.8.0 ui AQUA language (EN) collate en_US.UTF-8 tz America/New_York Packages--------------------------------------------------------------------------------------------------------------- package * version date source base64enc 0.1.2 2014-06-26 CRAN (R 3.2.0) BatchJobs 1.4 2014-09-24 CRAN (R 3.2.0) BBmisc 1.7 2014-06-21 CRAN (R 3.2.0) BiocGenerics * 0.13.4 2014-12-31 Bioconductor BiocParallel 1.1.13 2015-01-27 Bioconductor Biostrings * 2.35.8 2015-02-14 Bioconductor bitops 1.0.6 2013-08-17 CRAN (R 3.2.0) brew 1.0.6 2011-04-13 CRAN (R 3.2.0) checkmate 1.5.0 2014-10-19 CRAN (R 3.2.0) codetools 0.2.9 2014-08-21 CRAN (R 3.2.0) DBI 0.3.1 2014-09-24 CRAN (R 3.2.0) devtools 1.6.1 2014-10-07 CRAN (R 3.2.0) digest 0.6.4 2013-12-03 CRAN (R 3.2.0) fail 1.2 2013-09-19 CRAN (R 3.2.0) foreach 1.4.2 2014-04-11 CRAN (R 3.2.0) GenomeInfoDb * 1.3.13 2015-02-13 Bioconductor GenomicAlignments * 1.3.27 2015-01-26 Bioconductor GenomicRanges * 1.19.37 2015-02-13 Bioconductor IRanges * 2.1.38 2015-02-08 Bioconductor iterators 1.0.7 2014-04-11 CRAN (R 3.2.0) Rsamtools * 1.19.27 2015-02-07 Bioconductor RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0) rstudioapi 0.1 2014-03-27 CRAN (R 3.2.0) S4Vectors * 0.5.20 2015-02-19 Bioconductor sendmailR 1.2.1 2014-09-21 CRAN (R 3.2.0) stringr 0.6.2 2012-12-06 CRAN (R 3.2.0) XVector * 0.7.4 2015-02-08 Bioconductor zlibbioc 1.13.1 2015-02-11 Bioconductor
On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein <goldstein.leonard at gene.com> wrote:
Sounds very sensible not to double count in the context of tallying variants. I was more concerned with reducing which as the default behavior for scanBam and other functions. I wanted to bring up the samtools behavior as - for me at least - inconsistencies between Rsamtools and samtools have been another source of confusion in the past (e.g. different naming conventions for fields like isize vs TLEN etc.) Leonard On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Maybe Rsamtools would want to follow this precedent. I think there might be a difference between fishing out alignments from a SAM/BAM, and deriving a summary (tallyVariants) from a BAM. It seems like an argument could be made for a tally set to not contain duplicates. On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein <goldstein.leonard at gene.com> wrote:
Hi Michael and Thomas, I ran into the same problem in the past (i.e. when I started working with functions like scanBam I expected them not to return the same alignment multiple times) One thing to consider might be that returning alignments multiple times is consistent with the behavior of the samtools view command. Quoting from the samtools manual: ?Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.? Maybe there is an argument for keeping things consistent with samtools? As you said, if documented properly, the user can decide whether to reduce regions specified in which or not. Leonard On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
We should at leaast try to avoid surprising the user. Seems like most people expect "which" to be a simple restriction, so I think for now I will just reduce the which, and if someone has a use case for separate queries, we can address it in the future. On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <sandmann.thomas at gene.com> wrote:
Personally, I don't have a use case with "meaningful loci" worth tracking, so keeping it simple would work for me. Incidentally, would it be good to deal with the 'which' parameter in a consistent way across different methods ? I just saw this recent post on the mailing list in which a used got confused by duplicate counts returned after passing 'which' to scanBamParam: https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html --- Thomas Sandmann, PhD Computational biologist Genentech, Inc. 1 DNA Way South San Francisco, CA 94080 USA Phone: +1 650 225 6273 Fax: +1 650 225 5389 Email: sandmann.thomas at gene.com "If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts he shall end in certainties." -- Sir Francis Bacon On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence < lawrence.michael at gene.com> wrote:
We just have to decide which is the more useful interpretation of which -- as a simple restriction, or as a vector of meaningful locii, which will be analyzed individually? I would actually favor the first one (the same as yours), just because it's simpler. To keep track of the query ranges, we would need to add a new column to the returned object, which will more often than not just be clutter. I guess we could introduce a new parameter, "reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it instead adds the column mapping back to the original which ranges. On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann < sandmann.thomas at gene.com> wrote:
Hi Michael, ah, I see. I hadn't realized that returning the pileups separately for each region could be a desired feature, but that makes sense. I agree, as it is easy for the user to 'reduce' the ranges beforehand your first option (e.g. returning the ID of the range) would be more flexible. Perhaps you would consider adding a sentence to the documentation of 'which' on BamTallyParam's help page explaining that users might want to 'reduce' their ranges beforehand if they are only interested in a single tally for each base ? Thanks a lot ! Thomas
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
FWIW, and after checking how pileup() handles this, I thought I should
report:
library(GenomicAlignments)
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
Then:
> stackStringsFromBam(bamfile, param="seq1:1-6")
A DNAStringSet instance of length 4
width seq
[1] 6 CACTAG
[2] 6 ++CTAG
[3] 6 ++++AG
[4] 6 +++++G
> which <- GRanges("seq1", IRanges(c(1, 5, 1), c(2, 6, 3)))
> pileup(bamfile, scanBamParam=ScanBamParam(which=which))
seqnames pos strand nucleotide count which_label
1 seq1 1 + C 1 seq1:1-2
2 seq1 2 + A 1 seq1:1-2
3 seq1 5 + A 3 seq1:5-6
4 seq1 6 + G 4 seq1:5-6
5 seq1 1 + C 1 seq1:1-3
6 seq1 2 + A 1 seq1:1-3
7 seq1 3 + C 2 seq1:1-3
This output looks clean to me: (a) Nucleotide positions that belong
to more than 1 range in 'which' are treated as distinct positions,
and (b) reads that overlap with 2 non-overlapping ranges (e.g.
read #1 and ranges #1 and #2) only contribute at most once at each
position covered by these ranges.
H.
On 02/25/2015 01:21 PM, Herv? Pag?s wrote:
Hi, On 02/25/2015 06:37 AM, Gabe Becker wrote:
I think we need to be a little careful of trying to know the users intentions better than they do here. Reduce is a (very) easy operation to do on a GRanges, so if the user didn't, are we really safe assuming they "meant to" when passing the GRanges as a which? I would argue for "the samtools way", not because samtools does it (though consistency is good) but because it allows the user to do more things, while not making it that painful to do the thing that they might want most often. I agree with Michael that an additional argument might be a good middle ground.
Maybe another good middle ground is to issue a warning when 'which'
contains overlapping ranges. The warning could suggest to the user
to reduce() the ranges first. Maybe the warning should also point
out that reducing doesn't completely eliminate the risk of selecting
the same record more than once (at least that's the case for the
readGAlignment* functions, but I don't know if it holds for
BamTallyParam). The risk is actually higher with paired-end reads
where the same pair is selected once for each range in 'which' that
overlaps with at least one of the 2 mates in the pair.
But as already discussed here (or maybe it was on the old bioconductor
list, don't remember), better solutions to the "duplicated selection"
problem could be achieved via one of the following:
(1) Expose some sort of unique id for the records in a BAM file.
I agree with Michael that "duplicated selection" is incompatible
with summarization tools like BamTallyParam or pileup(). Having
access to a unique id would completely solve the problem.
(2) Introduce an argument similar to 'which' but with a slightly
different semantic e.g. select records that *start* in one of
the specified ranges (for single-end reads), or select pairs of
records for which the *first* mate starts in one of the specified
ranges.
Advantages of this semantic:
(a) If the ranges don't overlap, then no duplicates.
(b) It can be used in the context of tiling the genome and
processing the reads by tile. This plays well with
parallelization (the semantic of 'which' does not).
Inconvenient: the arbitrary nature of the selection criteria
and its lack of symmetry is incompatible with summarization
tools like BamTallyParam or pileup().
Note that (1) and (2) are not exclusive.
Cheers,
H.
~G
On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres
<lcollado at jhu.edu <mailto:lcollado at jhu.edu>> wrote:
Related to my post on a separate thread
(https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
I think that if 'which' is not being reduced by default, a simple
example showing the effects of this could be included in the
functions
that have such an argument. Also note that 'reducing' could lead to
unintended results.
For example, in the help page for GenomicAlignments::readGAlignments,
after the 'gal4' example it would be nice to add something like this:
## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.
which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
length(gal4)
## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)
## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)
Here's the output:
> library('GenomicAlignments')
>
> ## Code already included in ?readGAlignments
> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
+ mustWork=TRUE)
> which <- RangesList(seq1=IRanges(1000, 2000),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
> param <- ScanBamParam(which=which)
> gal4 <- readGAlignments(bamfile, param=param)
> gal4
GAlignments object with 2404 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end
width njunc
<Rle> <Rle> <character> <integer> <integer> <integer>
<integer> <integer>
[1] seq1 + 35M 35 970 1004
35 0
[2] seq1 + 35M 35 971 1005
35 0
[3] seq1 + 35M 35 972 1006
35 0
[4] seq1 + 35M 35 973 1007
35 0
[5] seq1 + 35M 35 974 1008
35 0
... ... ... ... ... ... ...
... ...
[2400] seq2 + 35M 35 1524 1558
35 0
[2401] seq2 + 35M 35 1524 1558
35 0
[2402] seq2 - 35M 35 1528 1562
35 0
[2403] seq2 - 35M 35 1532 1566
35 0
[2404] seq2 - 35M 35 1533 1567
35 0
-------
seqinfo: 2 sequences from an unspecified genome
>
> ## Note that if overlapping ranges are provided in 'which'
> ## reads could be selected more than once. This would artificually
> ## increase the coverage or affect other downstream results.
> ## If you 'reduce' the ranges, reads that originally overlapped
> ## two disjoint segments will be included.
>
> which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
> param_dups <- ScanBamParam(which=which_dups)
> param_reduced <- ScanBamParam(which=reduce(which_dups))
> gal4_dups <- readGAlignments(bamfile, param=param_dups)
> gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>
>
> length(gal4)
[1] 2404
>
> ## Duplicates some reads. In this case, all the ones between
> ## bases 1000 and 2000 on seq1.
> length(gal4_dups)
[1] 3014
>
> ## Includes some reads that mapped around base 1000 in seq2
> ## that were excluded in gal4.
> length(gal4_reduced)
[1] 2343
>
>
>
>
>
> options(width = 120)
> devtools::session_info()
Session
info-----------------------------------------------------------------------------------------------------------
setting value
version R Under development (unstable) (2014-11-01 r66923)
system x86_64, darwin10.8.0
ui AQUA
language (EN)
collate en_US.UTF-8
tz America/New_York
Packages---------------------------------------------------------------------------------------------------------------
package * version date source
base64enc 0.1.2 2014-06-26 CRAN (R 3.2.0)
BatchJobs 1.4 2014-09-24 CRAN (R 3.2.0)
BBmisc 1.7 2014-06-21 CRAN (R 3.2.0)
BiocGenerics * 0.13.4 2014-12-31 Bioconductor
BiocParallel 1.1.13 2015-01-27 Bioconductor
Biostrings * 2.35.8 2015-02-14 Bioconductor
bitops 1.0.6 2013-08-17 CRAN (R 3.2.0)
brew 1.0.6 2011-04-13 CRAN (R 3.2.0)
checkmate 1.5.0 2014-10-19 CRAN (R 3.2.0)
codetools 0.2.9 2014-08-21 CRAN (R 3.2.0)
DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
devtools 1.6.1 2014-10-07 CRAN (R 3.2.0)
digest 0.6.4 2013-12-03 CRAN (R 3.2.0)
fail 1.2 2013-09-19 CRAN (R 3.2.0)
foreach 1.4.2 2014-04-11 CRAN (R 3.2.0)
GenomeInfoDb * 1.3.13 2015-02-13 Bioconductor
GenomicAlignments * 1.3.27 2015-01-26 Bioconductor
GenomicRanges * 1.19.37 2015-02-13 Bioconductor
IRanges * 2.1.38 2015-02-08 Bioconductor
iterators 1.0.7 2014-04-11 CRAN (R 3.2.0)
Rsamtools * 1.19.27 2015-02-07 Bioconductor
RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
rstudioapi 0.1 2014-03-27 CRAN (R 3.2.0)
S4Vectors * 0.5.20 2015-02-19 Bioconductor
sendmailR 1.2.1 2014-09-21 CRAN (R 3.2.0)
stringr 0.6.2 2012-12-06 CRAN (R 3.2.0)
XVector * 0.7.4 2015-02-08 Bioconductor
zlibbioc 1.13.1 2015-02-11 Bioconductor
>
On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein
<goldstein.leonard at gene.com <mailto:goldstein.leonard at gene.com>>
wrote:
> Sounds very sensible not to double count in the context of
tallying
> variants. I was more concerned with reducing which as the default
> behavior for scanBam and other functions.
>
> I wanted to bring up the samtools behavior as - for me at least -
> inconsistencies between Rsamtools and samtools have been another
> source of confusion in the past (e.g. different naming
conventions for
> fields like isize vs TLEN etc.)
>
> Leonard
>
>
> On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence
> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>>
wrote:
>> Maybe Rsamtools would want to follow this precedent. I think
there might be
>> a difference between fishing out alignments from a SAM/BAM, and
deriving a
>> summary (tallyVariants) from a BAM. It seems like an argument
could be made
>> for a tally set to not contain duplicates.
>>
>> On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein
>> <goldstein.leonard at gene.com <mailto:goldstein.leonard at gene.com>>
wrote:
>>>
>>> Hi Michael and Thomas,
>>>
>>> I ran into the same problem in the past (i.e. when I started
working
>>> with functions like scanBam I expected them not to return the
same
>>> alignment multiple times)
>>>
>>> One thing to consider might be that returning alignments
multiple
>>> times is consistent with the behavior of the samtools view
command.
>>> Quoting from the samtools manual:
>>>
>>> ?Important note: when multiple regions are given, some
alignments may
>>> be output multiple times if they overlap more than one of the
>>> specified regions.?
>>>
>>> Maybe there is an argument for keeping things consistent with
>>> samtools? As you said, if documented properly, the user can
decide
>>> whether to reduce regions specified in which or not.
>>>
>>> Leonard
>>>
>>>
>>> On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
>>> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>>
wrote:
>>> > We should at leaast try to avoid surprising the user. Seems
like most
>>> > people expect "which" to be a simple restriction, so I think
for now I
>>> > will
>>> > just reduce the which, and if someone has a use case for
separate
>>> > queries,
>>> > we can address it in the future.
>>> >
>>> > On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann
>>> > <sandmann.thomas at gene.com <mailto:sandmann.thomas at gene.com>>
>>> > wrote:
>>> >
>>> >> Personally, I don't have a use case with "meaningful loci"
worth
>>> >> tracking,
>>> >> so keeping it simple would work for me.
>>> >>
>>> >> Incidentally, would it be good to deal with the 'which'
parameter in a
>>> >> consistent way across different methods ? I just saw this
recent post
>>> >> on
>>> >> the mailing list in which a used got confused by duplicate
counts
>>> >> returned
>>> >> after passing 'which' to scanBamParam:
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> ---
>>> >>
>>> >> Thomas Sandmann, PhD
>>> >> Computational biologist
>>> >>
>>> >> Genentech, Inc.
>>> >> 1 DNA Way
>>> >> South San Francisco, CA 94080
>>> >> USA
>>> >>
>>> >> Phone: +1 650 225 6273 <tel:%2B1%20650%20225%206273>
>>> >> Fax: +1 650 225 5389 <tel:%2B1%20650%20225%205389>
>>> >> Email: sandmann.thomas at gene.com
<mailto:sandmann.thomas at gene.com>
>>> >>
>>> >> "If a man will begin with certainties, he shall end in
doubts; but if
>>> >> he
>>> >> will be content to begin with doubts he shall end in
certainties." --
>>> >> Sir
>>> >> Francis Bacon
>>> >>
>>> >>
>>> >> On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
>>> >> lawrence.michael at gene.com
<mailto:lawrence.michael at gene.com>> wrote:
>>> >>
>>> >>> We just have to decide which is the more useful
interpretation of
>>> >>> which
>>> >>> -- as a simple restriction, or as a vector of meaningful
locii, which
>>> >>> will
>>> >>> be analyzed individually? I would actually favor the first
one (the
>>> >>> same as
>>> >>> yours), just because it's simpler. To keep track of the
query ranges,
>>> >>> we
>>> >>> would need to add a new column to the returned object,
which will more
>>> >>> often than not just be clutter. I guess we could introduce
a new
>>> >>> parameter,
>>> >>> "reduceWhich" which defaults to TRUE and reduces the which.
If FALSE,
>>> >>> it
>>> >>> instead adds the column mapping back to the original which
ranges.
>>> >>>
>>> >>>
>>> >>> On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
>>> >>> sandmann.thomas at gene.com <mailto:sandmann.thomas at gene.com>>
wrote:
>>> >>>
>>> >>>> Hi Michael,
>>> >>>>
>>> >>>> ah, I see. I hadn't realized that returning the pileups
separately
>>> >>>> for
>>> >>>> each region could be a desired feature, but that makes
sense. I
>>> >>>> agree, as
>>> >>>> it is easy for the user to 'reduce' the ranges beforehand
your first
>>> >>>> option
>>> >>>> (e.g. returning the ID of the range) would be more
flexible.
>>> >>>>
>>> >>>> Perhaps you would consider adding a sentence to the
documentation of
>>> >>>> 'which' on BamTallyParam's help page explaining that users
might want
>>> >>>> to
>>> >>>> 'reduce' their ranges beforehand if they are only
interested in a
>>> >>>> single
>>> >>>> tally for each base ?
>>> >>>>
>>> >>>> Thanks a lot !
>>> >>>> Thomas
>>> >>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>> > [[alternative HTML version deleted]]
>>> >
>>> > _______________________________________________
>>> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
>>
>>
>
> _______________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
_______________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319