Hi Martin, Marc,
I'm now implementing the use of BamFile objects in easyRNASeq and I like them. I think it would be very useful if when constructing a BamFile the existence of the path and index could be tested; i.e. this works: BamFile("test.bam","test.bam.bai") although these files do not exist. Is there a reason that this validation is not done? If there is, could a validation parameter be added (set to FALSE by default to keep the current behavior) that would check for the files' existence? The same goes for the yieldSize argument, i.e. this works BamFile("test.bam","test.bam.bai",yieldSize=-1), although I'm not sure what a -1 yieldSize means. I can of course do these validations within easyRNASeq, but anyone else building packages on top of BamFile would probably want to do the same...
A related point unclear at the moment in the documentation is what the index filename should be: i.e. scanBam expects as the index the same value as for the bam filename (that assumes the user has not renamed his bam.bai file and you never know what users might be doing... :-S ... ) but the BamFile Rd page says:
file: A character vector of BAM file paths
index: A character vector of indices (forBamFile);
so it's unclear to me what the index character vector should contain.
Thanks again for this set of class, they're really handy!
Here's my sessionInfo:
R Under development (unstable) (2012-10-02 r60861)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Rsamtools_1.11.14 Biostrings_2.27.8 GenomicRanges_1.11.21
[4] IRanges_1.17.24 BiocGenerics_0.5.6 BiocInstaller_1.9.6
loaded via a namespace (and not attached):
[1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
[Bioc-devel] BamFile validation
7 messages · Nicolas Delhomme, Henrik Bengtsson, Ryan +2 more
On Mon, Jan 7, 2013 at 12:32 PM, Nicolas Delhomme <delhomme at embl.de> wrote:
Hi Martin, Marc,
I'm now implementing the use of BamFile objects in easyRNASeq and I like them. I think it would be very useful if when constructing a BamFile the existence of the path and index could be tested; i.e. this works: BamFile("test.bam","test.bam.bai") although these files do not exist. Is there a reason that this validation is not done? If there is, could a validation parameter be added (set to FALSE by default to keep the current behavior) that would check for the files' existence?
Good idea - I propose argument 'mustExist'. My $0.02 /Henrik
The same goes for the yieldSize argument, i.e. this works BamFile("test.bam","test.bam.bai",yieldSize=-1), although I'm not sure what a -1 yieldSize means. I can of course do these validations within easyRNASeq, but anyone else building packages on top of BamFile would probably want to do the same...
A related point unclear at the moment in the documentation is what the index filename should be: i.e. scanBam expects as the index the same value as for the bam filename (that assumes the user has not renamed his bam.bai file and you never know what users might be doing... :-S ... ) but the BamFile Rd page says:
file: A character vector of BAM file paths
index: A character vector of indices (forBamFile);
so it's unclear to me what the index character vector should contain.
Thanks again for this set of class, they're really handy!
Here's my sessionInfo:
R Under development (unstable) (2012-10-02 r60861)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Rsamtools_1.11.14 Biostrings_2.27.8 GenomicRanges_1.11.21
[4] IRanges_1.17.24 BiocGenerics_0.5.6 BiocInstaller_1.9.6
loaded via a namespace (and not attached):
[1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20130107/fb87da5e/attachment.pl>
Just to clarify. I don't mean it needs to validate the BAM file (i.e. checking that it's properly formatted), so using file.exists on the provided file paths would be sufficient. --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany ---------------------------------------------------------------
On 8 Jan 2013, at 02:47, Ryan Thompson wrote:
Couldn't one test for existence by trying to open the BamFile object, and possibly read one sequence (or maybe just read the header since I guess a valid bam file can contain zero sequences)? On Jan 7, 2013 1:32 PM, "Henrik Bengtsson" <hb at biostat.ucsf.edu> wrote: On Mon, Jan 7, 2013 at 12:32 PM, Nicolas Delhomme <delhomme at embl.de> wrote:
Hi Martin, Marc,
I'm now implementing the use of BamFile objects in easyRNASeq and I like them. I think it would be very useful if when constructing a BamFile the existence of the path and index could be tested; i.e. this works: BamFile("test.bam","test.bam.bai") although these files do not exist. Is there a reason that this validation is not done? If there is, could a validation parameter be added (set to FALSE by default to keep the current behavior) that would check for the files' existence?
Good idea - I propose argument 'mustExist'. My $0.02 /Henrik
The same goes for the yieldSize argument, i.e. this works BamFile("test.bam","test.bam.bai",yieldSize=-1), although I'm not sure what a -1 yieldSize means. I can of course do these validations within easyRNASeq, but anyone else building packages on top of BamFile would probably want to do the same...
A related point unclear at the moment in the documentation is what the index filename should be: i.e. scanBam expects as the index the same value as for the bam filename (that assumes the user has not renamed his bam.bai file and you never know what users might be doing... :-S ... ) but the BamFile Rd page says:
file: A character vector of BAM file paths
index: A character vector of indices (forBamFile);
so it's unclear to me what the index character vector should contain.
Thanks again for this set of class, they're really handy!
Here's my sessionInfo:
R Under development (unstable) (2012-10-02 r60861)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Rsamtools_1.11.14 Biostrings_2.27.8 GenomicRanges_1.11.21
[4] IRanges_1.17.24 BiocGenerics_0.5.6 BiocInstaller_1.9.6
loaded via a namespace (and not attached):
[1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Hi there, FWIW system.file() has the 'mustWork' arg for this. Strange name though since the man page suggests it only tests for existence, not that the file can actually be open for reading. H.
On 01/07/2013 11:11 PM, Nicolas Delhomme wrote:
Just to clarify. I don't mean it needs to validate the BAM file (i.e. checking that it's properly formatted), so using file.exists on the provided file paths would be sufficient. --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 8 Jan 2013, at 02:47, Ryan Thompson wrote:
Couldn't one test for existence by trying to open the BamFile object, and possibly read one sequence (or maybe just read the header since I guess a valid bam file can contain zero sequences)? On Jan 7, 2013 1:32 PM, "Henrik Bengtsson" <hb at biostat.ucsf.edu> wrote: On Mon, Jan 7, 2013 at 12:32 PM, Nicolas Delhomme <delhomme at embl.de> wrote:
Hi Martin, Marc,
I'm now implementing the use of BamFile objects in easyRNASeq and I like them. I think it would be very useful if when constructing a BamFile the existence of the path and index could be tested; i.e. this works: BamFile("test.bam","test.bam.bai") although these files do not exist. Is there a reason that this validation is not done? If there is, could a validation parameter be added (set to FALSE by default to keep the current behavior) that would check for the files' existence?
Good idea - I propose argument 'mustExist'. My $0.02 /Henrik
The same goes for the yieldSize argument, i.e. this works BamFile("test.bam","test.bam.bai",yieldSize=-1), although I'm not sure what a -1 yieldSize means. I can of course do these validations within easyRNASeq, but anyone else building packages on top of BamFile would probably want to do the same...
A related point unclear at the moment in the documentation is what the index filename should be: i.e. scanBam expects as the index the same value as for the bam filename (that assumes the user has not renamed his bam.bai file and you never know what users might be doing... :-S ... ) but the BamFile Rd page says:
file: A character vector of BAM file paths
index: A character vector of indices (forBamFile);
so it's unclear to me what the index character vector should contain.
Thanks again for this set of class, they're really handy!
Here's my sessionInfo:
R Under development (unstable) (2012-10-02 r60861)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Rsamtools_1.11.14 Biostrings_2.27.8 GenomicRanges_1.11.21
[4] IRanges_1.17.24 BiocGenerics_0.5.6 BiocInstaller_1.9.6
loaded via a namespace (and not attached):
[1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
On 01/07/2013 12:32 PM, Nicolas Delhomme wrote:
Hi Martin, Marc,
I'm now implementing the use of BamFile objects in easyRNASeq and I like
them. I think it would be very useful if when constructing a BamFile the
existence of the path and index could be tested; i.e. this works:
BamFile("test.bam","test.bam.bai") although these files do not exist. Is
there a reason that this validation is not done? If there is, could a
validation parameter be added (set to FALSE by default to keep the current
behavior) that would check for the files' existence? The same goes for the
yieldSize argument, i.e. this works
BamFile("test.bam","test.bam.bai",yieldSize=-1), although I'm not sure what a
-1 yieldSize means. I can of course do these validations within easyRNASeq,
but anyone else building packages on top of BamFile would probably want to do
the same...
I want to be able to specify a BAM file without opening it, and then open it
later, e.g., in mclapply or after distributing to a cluster. Also, conceptually,
I want to distinguish between processing an entire BAM file -- provide me with
something for which isOpen(BamFile("foo")) == FALSE -- versus reading a chunk of
a BamFile, i.e., already open. So I separated BamFile creation from open().
I focus on open() in the above because opening the BAM file is a cheap way to
validate that the BAM file exists -- it could be local or remote (http or ftp,
so file.exists isn't sufficient) and even if the file 'exists' as Ryan mentions
it needs to actually be a BAM file so should, e.g., have a header. open() allows
for all of these possibilities. Also, the consequences of trying to open a
non-existent file results in a clear enough error
> open(BamFile("sdfs"))
Error in value[[3L]](cond) :
failed to open BamFile: file(s) do not exist:
'sdfs'
So against the votes of the other contributors to this thread, I haven't made a
change. Sorry about that.
I added a check that yieldSize is a non-negative scalar integer, or NA.
A related point unclear at the moment in the documentation is what the index filename should be: i.e. scanBam expects as the index the same value as for the bam filename (that assumes the user has not renamed his bam.bai file and you never know what users might be doing... :-S ... ) but the BamFile Rd page says: file: A character vector of BAM file paths
> index: A character vector of indices (forBamFile);
so it's unclear to me what the index character vector should contain.
Tried to clarify that, it's just a character vector containing the path to the index file. Generally, the code tries not to care about whether the index file is specified with a '.bai' extension, or without. Martin
Thanks again for this set of class, they're really handy! Here's my sessionInfo: R Under development (unstable) (2012-10-02 r60861) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.11.14 Biostrings_2.27.8 GenomicRanges_1.11.21 [4] IRanges_1.17.24 BiocGenerics_0.5.6 BiocInstaller_1.9.6 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0 Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
Hi Martin,
On 8 Jan 2013, at 19:53, Martin Morgan wrote:
On 01/07/2013 12:32 PM, Nicolas Delhomme wrote:
Hi Martin, Marc,
I'm now implementing the use of BamFile objects in easyRNASeq and I like
them. I think it would be very useful if when constructing a BamFile the
existence of the path and index could be tested; i.e. this works:
BamFile("test.bam","test.bam.bai") although these files do not exist. Is
there a reason that this validation is not done? If there is, could a
validation parameter be added (set to FALSE by default to keep the current
behavior) that would check for the files' existence? The same goes for the
yieldSize argument, i.e. this works
BamFile("test.bam","test.bam.bai",yieldSize=-1), although I'm not sure what a
-1 yieldSize means. I can of course do these validations within easyRNASeq,
but anyone else building packages on top of BamFile would probably want to do
the same...
I want to be able to specify a BAM file without opening it, and then open it later, e.g., in mclapply or after distributing to a cluster. Also, conceptually, I want to distinguish between processing an entire BAM file -- provide me with something for which isOpen(BamFile("foo")) == FALSE -- versus reading a chunk of a BamFile, i.e., already open. So I separated BamFile creation from open().
I focus on open() in the above because opening the BAM file is a cheap way to validate that the BAM file exists -- it could be local or remote (http or ftp, so file.exists isn't sufficient) and even if the file 'exists' as Ryan mentions it needs to actually be a BAM file so should, e.g., have a header. open() allows for all of these possibilities. Also, the consequences of trying to open a non-existent file results in a clear enough error
open(BamFile("sdfs"))
Error in value[[3L]](cond) : failed to open BamFile: file(s) do not exist: 'sdfs' So against the votes of the other contributors to this thread, I haven't made a change. Sorry about that.
No need to. I hadn't thought of a use case as those you presented above where not checking makes perfect sense. I'll use open for validating.
I added a check that yieldSize is a non-negative scalar integer, or NA.
Great thanks.
A related point unclear at the moment in the documentation is what the index filename should be: i.e. scanBam expects as the index the same value as for the bam filename (that assumes the user has not renamed his bam.bai file and you never know what users might be doing... :-S ... ) but the BamFile Rd page says: file: A character vector of BAM file paths index: A character vector of indices (forBamFile); so it's unclear to me what the index character vector should contain.
Tried to clarify that, it's just a character vector containing the path to the index file. Generally, the code tries not to care about whether the index file is specified with a '.bai' extension, or without.
That was my perception :-) just wanted to be sure. A related question, could you detail which functions require the bai index to be present and which ones "just" benefit from it? Cheers, Nico
Martin
Thanks again for this set of class, they're really handy! Here's my sessionInfo: R Under development (unstable) (2012-10-02 r60861) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.11.14 Biostrings_2.27.8 GenomicRanges_1.11.21 [4] IRanges_1.17.24 BiocGenerics_0.5.6 BiocInstaller_1.9.6 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_2.16.0 tools_2.16.0 zlibbioc_1.5.0 Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793