Hi all, I'm soliciting feedback on the idea of a general file 'registry' that would identify file types by their extensions. This is similar in spirit to FileForformat() in rtracklayer but a more general abstraction that could be used across packages. The goal is to allow a user to supply only file name(s) to a method instead of first creating a 'File' class such as BamFile, FaFile, BigWigFile etc. A first attempt at this is in the GenomicFileViews package (https://github.com/Bioconductor/GenomicFileViews). A registry (lookup) is created as an environment at load time: .fileTypeRegistry <- new.env(parent=emptyenv() Files are registered with an information triplet consisting of class, package and regular expression to identify the extension. In GenomicFileViews we register FaFileList, BamFileList and BigWigFileList but any 'File' class can be registered that has a constructor of the same name. .onLoad <- function(libname, pkgname) { registerFileType("FaFileList", "Rsamtools", "\\.fa$") registerFileType("FaFileList", "Rsamtools", "\\.fasta$") registerFileType("BamFileList", "Rsamtools", "\\.bam$") registerFileType("BigWigFileList", "rtracklayer", "\\.bw$") } The makeFileType() helper creates the appropriate class. This function is used behind the scenes to do the lookup and coerce to the correct 'File' class. > makeFileType(c("foo.bam", "bar.bam")) BamFileList of length 2 names(2): foo.bam bar.bam New types can be added at any time with registerFileType(): registerFileType(NewClass, NewPackage, "\\.NewExtension$") Thoughts: (1) If this sounds generally useful where should it live? rtracklayer, GenomicFileViews or other? Alternatively it could be its own lightweight package (FileRegister) that creates the registry and provides the helpers. It would be up to the package authors that depend on FileRegister to register their own files types at load time. (2) To avoid potential ambiguities maybe searching should be by regex and package name. Still a work in progress. Valerie
[Bioc-devel] file registry - feedback
18 messages · Gabriel Becker, Vincent Carey, Valerie Obenchain +3 more
Hi Val, I think it would help understand the motivations behind this proposal if you could give an example of a method where the user cannot supply a file name but has to create a 'File' (or 'FileList') object first. And how the file registry proposal below would help. It looks like you have such an example in the GenomicFileViews package. Do you think you could give more details? Thanks, H.
On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
Hi all, I'm soliciting feedback on the idea of a general file 'registry' that would identify file types by their extensions. This is similar in spirit to FileForformat() in rtracklayer but a more general abstraction that could be used across packages. The goal is to allow a user to supply only file name(s) to a method instead of first creating a 'File' class such as BamFile, FaFile, BigWigFile etc. A first attempt at this is in the GenomicFileViews package (https://github.com/Bioconductor/GenomicFileViews). A registry (lookup) is created as an environment at load time: .fileTypeRegistry <- new.env(parent=emptyenv() Files are registered with an information triplet consisting of class, package and regular expression to identify the extension. In GenomicFileViews we register FaFileList, BamFileList and BigWigFileList but any 'File' class can be registered that has a constructor of the same name. .onLoad <- function(libname, pkgname) { registerFileType("FaFileList", "Rsamtools", "\\.fa$") registerFileType("FaFileList", "Rsamtools", "\\.fasta$") registerFileType("BamFileList", "Rsamtools", "\\.bam$") registerFileType("BigWigFileList", "rtracklayer", "\\.bw$") } The makeFileType() helper creates the appropriate class. This function is used behind the scenes to do the lookup and coerce to the correct 'File' class.
> makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2 names(2): foo.bam bar.bam New types can be added at any time with registerFileType(): registerFileType(NewClass, NewPackage, "\\.NewExtension$") Thoughts: (1) If this sounds generally useful where should it live? rtracklayer, GenomicFileViews or other? Alternatively it could be its own lightweight package (FileRegister) that creates the registry and provides the helpers. It would be up to the package authors that depend on FileRegister to register their own files types at load time. (2) To avoid potential ambiguities maybe searching should be by regex and package name. Still a work in progress. Valerie
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140311/f691cec2/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140311/811a0a61/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140311/250282e2/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140311/f9f12d03/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140311/d1d72d2d/attachment.pl>
Hi Herve,
On 03/10/2014 10:31 PM, Herv? Pag?s wrote:
Hi Val, I think it would help understand the motivations behind this proposal if you could give an example of a method where the user cannot supply a file name but has to create a 'File' (or 'FileList') object first. And how the file registry proposal below would help. It looks like you have such an example in the GenomicFileViews package. Do you think you could give more details?
The most recent motivating use case was in creating subclasses of GenomicFileViews objects (BamFileViews, BigWigFileViews, etc.) We wanted to have a general constructor, something like GenomicFileViews(), that would create the appropriate subclass. However to create the correct subclass we needed to know if the files were bam, bw, fasta etc. Recognition of the file type by extension would allow us to do this with no further input from the user. Val
Thanks, H. On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
Hi all, I'm soliciting feedback on the idea of a general file 'registry' that would identify file types by their extensions. This is similar in spirit to FileForformat() in rtracklayer but a more general abstraction that could be used across packages. The goal is to allow a user to supply only file name(s) to a method instead of first creating a 'File' class such as BamFile, FaFile, BigWigFile etc. A first attempt at this is in the GenomicFileViews package (https://github.com/Bioconductor/GenomicFileViews). A registry (lookup) is created as an environment at load time: .fileTypeRegistry <- new.env(parent=emptyenv() Files are registered with an information triplet consisting of class, package and regular expression to identify the extension. In GenomicFileViews we register FaFileList, BamFileList and BigWigFileList but any 'File' class can be registered that has a constructor of the same name. .onLoad <- function(libname, pkgname) { registerFileType("FaFileList", "Rsamtools", "\\.fa$") registerFileType("FaFileList", "Rsamtools", "\\.fasta$") registerFileType("BamFileList", "Rsamtools", "\\.bam$") registerFileType("BigWigFileList", "rtracklayer", "\\.bw$") } The makeFileType() helper creates the appropriate class. This function is used behind the scenes to do the lookup and coerce to the correct 'File' class.
> makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2 names(2): foo.bam bar.bam New types can be added at any time with registerFileType(): registerFileType(NewClass, NewPackage, "\\.NewExtension$") Thoughts: (1) If this sounds generally useful where should it live? rtracklayer, GenomicFileViews or other? Alternatively it could be its own lightweight package (FileRegister) that creates the registry and provides the helpers. It would be up to the package authors that depend on FileRegister to register their own files types at load time. (2) To avoid potential ambiguities maybe searching should be by regex and package name. Still a work in progress. Valerie
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha at fhcrc.org Phone: (206) 667-3158 Fax: (206) 667-1319
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140311/cbf4563d/attachment.pl>
Hi,
On 03/11/2014 09:47 AM, Michael Lawrence wrote:
Except for the checksum, the existing File classes should support this, where the package provides a dataset via data() that is just the serialized File object (path). One could create a FileWithChecksum class that decorates a File object with a checksum. Any attempts to read the file are intercepted by the decorator, which verifies the checksum, and then delegates.
Neat. Sounds like this is worth pursuing.
Michael On Tue, Mar 11, 2014 at 8:53 AM, Vincent Carey <stvjc at channing.harvard.edu>wrote:
I'm going to suggest a use case that may motivate this type of development. Up to 2010 or so, data packages generally made sense. You have about 100-500MB of serialized or pre-serialized stuff. Installing it in an R package is unpleasant from a resource consumption perspective but it works, you can use data/extdata and work with data with programmatic access, documentation and checkability. More recently, it is easy to come across data resources that we'd like to have package-like control over/access to, but installing such packages makes no sense. The volume is too big, and you want to work with the resource with non-R tools as well from time to time. You don't want to move the data. We should have a protocol for "packaging" data without installing it. A digest of the raw data resource should be computed and kept in the registry. A registered file can be part of a package that can be checked and installed, but the data themselves do not move. Genomic data in S3 buckets should provide a basic use case. The digest is recomputed whenever we want to start working with the registry/package to verify that we are working with the intended artifact. On Tue, Mar 11, 2014 at 11:11 AM, Gabriel Becker <gmbecker at ucdavis.edu>wrote:
Would it be better to let the user (registerer) specify a function, which could be a simple class constructor or something more complex in cases where that would be useful?
Yes, good suggestion.
This could allow the concept to generalize to other things, such as databases that might need some startup machinery called before they are actually useful to the user.
The intent of the registry was to provide a way to lookup files by their extension. I'm not sure how this applies to the database example. Do you envision creating multiple databases throughout an R session (vs a single set up at load time)? For example if the file has type 'X' extension it becomes a type 'X' database etc.?
This would also deal with Michael's point about package/where since functions have their own "where" information. Unless I'm missing some other intent for specifying a specific package? ~G On Tue, Mar 11, 2014 at 5:59 AM, Michael Lawrence < lawrence.michael at gene.com
wrote:
rtracklayer essentially has this, although registration is implicit
through
extension of RTLFile or RsamtoolsFile, and the extension is taken from
the
class name. There is a BigWigFile, corresponding to ".bigwig", and that
is
extended by BWFile to support the ".bw" extension. The expectation is
that
other packages would extend RTLFile to implictly register handlers. I'm not sure there is a use case for generalization, but this proposal makes registration more explicit, which is probably a good thing. rtracklayer
was
just piggy backing on S4 registration. I'm a little bit confused by the use of Lists rather than individual
File
objects. Are you also proposing that all RTLFiles would need a corresponding List, and that there would need to be an RTLFileList
method
for the various generics?
No, I don't want to force the 'List' route. I was using them in GenomicFileViews so that's what I registered. The 'class' should be any class that has a constructor of the same name. Thinking about this more the 'class' probably should be the individual File object instead of the List object. Coercion to List can be done inside the helper.
It may not be necessary to specify the package name. There should be an environment (where) argument that defaults to topenv(parent.frame()),
and
that should suffice.
I'll look into this. Any comments on whether this should be it's own package or in an existing one? Thanks for the input. Valerie
Michael On Mon, Mar 10, 2014 at 8:46 PM, Valerie Obenchain <vobencha at fhcrc.org
wrote:
Hi all, I'm soliciting feedback on the idea of a general file 'registry' that would identify file types by their extensions. This is similar in
spirit
to
FileForformat() in rtracklayer but a more general abstraction that
could
be
used across packages. The goal is to allow a user to supply only file name(s) to a method instead of first creating a 'File' class such as BamFile, FaFile, BigWigFile etc. A first attempt at this is in the GenomicFileViews package ( https://github.com/Bioconductor/GenomicFileViews). A registry
(lookup)
is
created as an environment at load time: .fileTypeRegistry <- new.env(parent=emptyenv() Files are registered with an information triplet consisting of class, package and regular expression to identify the extension. In GenomicFileViews we register FaFileList, BamFileList and
BigWigFileList
but
any 'File' class can be registered that has a constructor of the same
name.
.onLoad <- function(libname, pkgname)
{
registerFileType("FaFileList", "Rsamtools", "\\.fa$")
registerFileType("FaFileList", "Rsamtools", "\\.fasta$")
registerFileType("BamFileList", "Rsamtools", "\\.bam$")
registerFileType("BigWigFileList", "rtracklayer", "\\.bw$")
}
The makeFileType() helper creates the appropriate class. This
function is
used behind the scenes to do the lookup and coerce to the correct
'File'
class.
makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2 names(2): foo.bam bar.bam New types can be added at any time with registerFileType(): registerFileType(NewClass, NewPackage, "\\.NewExtension$") Thoughts: (1) If this sounds generally useful where should it live? rtracklayer, GenomicFileViews or other? Alternatively it could be its own
lightweight
package (FileRegister) that creates the registry and provides the
helpers.
It would be up to the package authors that depend on FileRegister to register their own files types at load time. (2) To avoid potential ambiguities maybe searching should be by regex
and
package name. Still a work in progress. Valerie
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha at fhcrc.org Phone: (206) 667-3158 Fax: (206) 667-1319
On 03/11/2014 09:57 AM, Valerie Obenchain wrote:
Hi Herve, On 03/10/2014 10:31 PM, Herv? Pag?s wrote:
Hi Val, I think it would help understand the motivations behind this proposal if you could give an example of a method where the user cannot supply a file name but has to create a 'File' (or 'FileList') object first. And how the file registry proposal below would help. It looks like you have such an example in the GenomicFileViews package. Do you think you could give more details?
The most recent motivating use case was in creating subclasses of GenomicFileViews objects (BamFileViews, BigWigFileViews, etc.) We wanted to have a general constructor, something like GenomicFileViews(), that would create the appropriate subclass. However to create the correct subclass we needed to know if the files were bam, bw, fasta etc. Recognition of the file type by extension would allow us to do this with no further input from the user.
That helps, thanks! Having this kind of general constructor sounds like it could indeed be useful. Would be an opportunity to put all these *File classes (the 22 RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile subclasses defined in Rsamtools) under the same umbrella (i.e. a parent virtual class) and use the name of this virtual class (e.g. File) for the general constructor. Allowing a registration mechanism to extend the knowledge of this File() constructor is an implementation detail. I don't see a lot of benefit to it. Only a package that implements a concrete File subclass would actually need to register the new subclass. Sounds easy enough to ask to whoever has commit access to the File() code to modify it. This kind of update might also require adding the name of the package where the new File subclass is implemented to the Depends/Imports/Suggests of the package where File() lives, which is something that cannot be done via a registration mechanism. H.
Val
Thanks, H. On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
Hi all, I'm soliciting feedback on the idea of a general file 'registry' that would identify file types by their extensions. This is similar in spirit to FileForformat() in rtracklayer but a more general abstraction that could be used across packages. The goal is to allow a user to supply only file name(s) to a method instead of first creating a 'File' class such as BamFile, FaFile, BigWigFile etc. A first attempt at this is in the GenomicFileViews package (https://github.com/Bioconductor/GenomicFileViews). A registry (lookup) is created as an environment at load time: .fileTypeRegistry <- new.env(parent=emptyenv() Files are registered with an information triplet consisting of class, package and regular expression to identify the extension. In GenomicFileViews we register FaFileList, BamFileList and BigWigFileList but any 'File' class can be registered that has a constructor of the same name. .onLoad <- function(libname, pkgname) { registerFileType("FaFileList", "Rsamtools", "\\.fa$") registerFileType("FaFileList", "Rsamtools", "\\.fasta$") registerFileType("BamFileList", "Rsamtools", "\\.bam$") registerFileType("BigWigFileList", "rtracklayer", "\\.bw$") } The makeFileType() helper creates the appropriate class. This function is used behind the scenes to do the lookup and coerce to the correct 'File' class.
> makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2 names(2): foo.bam bar.bam New types can be added at any time with registerFileType(): registerFileType(NewClass, NewPackage, "\\.NewExtension$") Thoughts: (1) If this sounds generally useful where should it live? rtracklayer, GenomicFileViews or other? Alternatively it could be its own lightweight package (FileRegister) that creates the registry and provides the helpers. It would be up to the package authors that depend on FileRegister to register their own files types at load time. (2) To avoid potential ambiguities maybe searching should be by regex and package name. Still a work in progress. Valerie
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
On 03/11/2014 02:52 PM, Herv? Pag?s wrote:
On 03/11/2014 09:57 AM, Valerie Obenchain wrote:
Hi Herve, On 03/10/2014 10:31 PM, Herv? Pag?s wrote:
Hi Val, I think it would help understand the motivations behind this proposal if you could give an example of a method where the user cannot supply a file name but has to create a 'File' (or 'FileList') object first. And how the file registry proposal below would help. It looks like you have such an example in the GenomicFileViews package. Do you think you could give more details?
The most recent motivating use case was in creating subclasses of GenomicFileViews objects (BamFileViews, BigWigFileViews, etc.) We wanted to have a general constructor, something like GenomicFileViews(), that would create the appropriate subclass. However to create the correct subclass we needed to know if the files were bam, bw, fasta etc. Recognition of the file type by extension would allow us to do this with no further input from the user.
That helps, thanks! Having this kind of general constructor sounds like it could indeed be useful. Would be an opportunity to put all these *File classes (the 22 RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile subclasses defined in Rsamtools) under the same umbrella (i.e. a parent virtual class) and use the name of this virtual class (e.g. File) for the general constructor. Allowing a registration mechanism to extend the knowledge of this File() constructor is an implementation detail. I don't see a lot of benefit to it. Only a package that implements a concrete File subclass would actually need to register the new subclass. Sounds easy enough to ask to whoever has commit access to the File() code to modify it. This kind of update might also require adding the name of the package where the new File subclass is implemented to the Depends/Imports/Suggests of the package where File() lives, which is something that cannot be done via a registration mechanism.
This clean-up of the *File jungle would also be a good opportunity to:
- Choose what we want to do with reference classes: use them for all
the *File classes or for none of them. (Right now, those defined
in Rsamtools are reference classes, and those defined in
rtracklayer are not.)
- Move the I/O functionality currently in rtracklayer to a
separate package. Based on the number of contributed packages I
reviewed so far that were trying to reinvent the wheel because
they had no idea that the I/O function they needed was actually
in rtracklayer, I'd like to advocate for using a package name
that makes it very clear that it's all about I/O.
H.
H.
Val
Thanks, H. On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
Hi all, I'm soliciting feedback on the idea of a general file 'registry' that would identify file types by their extensions. This is similar in spirit to FileForformat() in rtracklayer but a more general abstraction that could be used across packages. The goal is to allow a user to supply only file name(s) to a method instead of first creating a 'File' class such as BamFile, FaFile, BigWigFile etc. A first attempt at this is in the GenomicFileViews package (https://github.com/Bioconductor/GenomicFileViews). A registry (lookup) is created as an environment at load time: .fileTypeRegistry <- new.env(parent=emptyenv() Files are registered with an information triplet consisting of class, package and regular expression to identify the extension. In GenomicFileViews we register FaFileList, BamFileList and BigWigFileList but any 'File' class can be registered that has a constructor of the same name. .onLoad <- function(libname, pkgname) { registerFileType("FaFileList", "Rsamtools", "\\.fa$") registerFileType("FaFileList", "Rsamtools", "\\.fasta$") registerFileType("BamFileList", "Rsamtools", "\\.bam$") registerFileType("BigWigFileList", "rtracklayer", "\\.bw$") } The makeFileType() helper creates the appropriate class. This function is used behind the scenes to do the lookup and coerce to the correct 'File' class.
> makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2 names(2): foo.bam bar.bam New types can be added at any time with registerFileType(): registerFileType(NewClass, NewPackage, "\\.NewExtension$") Thoughts: (1) If this sounds generally useful where should it live? rtracklayer, GenomicFileViews or other? Alternatively it could be its own lightweight package (FileRegister) that creates the registry and provides the helpers. It would be up to the package authors that depend on FileRegister to register their own files types at load time. (2) To avoid potential ambiguities maybe searching should be by regex and package name. Still a work in progress. Valerie
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Hi,
On 03/11/14 15:33, Herv? Pag?s wrote:
On 03/11/2014 02:52 PM, Herv? Pag?s wrote:
On 03/11/2014 09:57 AM, Valerie Obenchain wrote:
Hi Herve, On 03/10/2014 10:31 PM, Herv? Pag?s wrote:
Hi Val, I think it would help understand the motivations behind this proposal if you could give an example of a method where the user cannot supply a file name but has to create a 'File' (or 'FileList') object first. And how the file registry proposal below would help. It looks like you have such an example in the GenomicFileViews package. Do you think you could give more details?
The most recent motivating use case was in creating subclasses of GenomicFileViews objects (BamFileViews, BigWigFileViews, etc.) We wanted to have a general constructor, something like GenomicFileViews(), that would create the appropriate subclass. However to create the correct subclass we needed to know if the files were bam, bw, fasta etc. Recognition of the file type by extension would allow us to do this with no further input from the user.
That helps, thanks! Having this kind of general constructor sounds like it could indeed be useful. Would be an opportunity to put all these *File classes (the 22 RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile subclasses defined in Rsamtools) under the same umbrella (i.e. a parent virtual class) and use the name of this virtual class (e.g. File) for the general constructor. Allowing a registration mechanism to extend the knowledge of this File() constructor is an implementation detail. I don't see a lot of benefit to it. Only a package that implements a concrete File subclass would actually need to register the new subclass. Sounds easy enough to ask to whoever has commit access to the File() code to modify it. This kind of update might also require adding the name of the package where the new File subclass is implemented to the Depends/Imports/Suggests of the package where File() lives, which is something that cannot be done via a registration mechanism.
This clean-up of the *File jungle would also be a good opportunity to:
- Choose what we want to do with reference classes: use them for all
the *File classes or for none of them. (Right now, those defined
in Rsamtools are reference classes, and those defined in
rtracklayer are not.)
- Move the I/O functionality currently in rtracklayer to a
separate package. Based on the number of contributed packages I
reviewed so far that were trying to reinvent the wheel because
they had no idea that the I/O function they needed was actually
in rtracklayer, I'd like to advocate for using a package name
that makes it very clear that it's all about I/O.
Thanks for the suggestions. This re-org sounds good to me. As you say, unifying the *File classes in a single package would make them more visible to other developers and enforce consistent behavior. If you aren't in favor of a registration mechanism for 'discovery' how should a function with methods for many *File classes (e.g., import()) handle a character file name? import() uses FileForFormat() to discover the file type, makes the *File class and dispatches to the appropriate *File method. The registry was an attempt at generalizing this concept. What do you think about the use of a registry for Vince's idea of holding a digest/path reference to large data but not installing it until it's used? Other ideas of how / where this could be stored? Val
H.
H.
Val
Thanks, H. On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
Hi all, I'm soliciting feedback on the idea of a general file 'registry' that would identify file types by their extensions. This is similar in spirit to FileForformat() in rtracklayer but a more general abstraction that could be used across packages. The goal is to allow a user to supply only file name(s) to a method instead of first creating a 'File' class such as BamFile, FaFile, BigWigFile etc. A first attempt at this is in the GenomicFileViews package (https://github.com/Bioconductor/GenomicFileViews). A registry (lookup) is created as an environment at load time: .fileTypeRegistry <- new.env(parent=emptyenv() Files are registered with an information triplet consisting of class, package and regular expression to identify the extension. In GenomicFileViews we register FaFileList, BamFileList and BigWigFileList but any 'File' class can be registered that has a constructor of the same name. .onLoad <- function(libname, pkgname) { registerFileType("FaFileList", "Rsamtools", "\\.fa$") registerFileType("FaFileList", "Rsamtools", "\\.fasta$") registerFileType("BamFileList", "Rsamtools", "\\.bam$") registerFileType("BigWigFileList", "rtracklayer", "\\.bw$") } The makeFileType() helper creates the appropriate class. This function is used behind the scenes to do the lookup and coerce to the correct 'File' class.
> makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2 names(2): foo.bam bar.bam New types can be added at any time with registerFileType(): registerFileType(NewClass, NewPackage, "\\.NewExtension$") Thoughts: (1) If this sounds generally useful where should it live? rtracklayer, GenomicFileViews or other? Alternatively it could be its own lightweight package (FileRegister) that creates the registry and provides the helpers. It would be up to the package authors that depend on FileRegister to register their own files types at load time. (2) To avoid potential ambiguities maybe searching should be by regex and package name. Still a work in progress. Valerie
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140311/deaf8ac0/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140311/b970cdc9/attachment.pl>
Hi Val,
On 03/11/2014 08:57 PM, Valerie Obenchain wrote:
Hi, On 03/11/14 15:33, Herv? Pag?s wrote:
On 03/11/2014 02:52 PM, Herv? Pag?s wrote:
On 03/11/2014 09:57 AM, Valerie Obenchain wrote:
Hi Herve, On 03/10/2014 10:31 PM, Herv? Pag?s wrote:
Hi Val, I think it would help understand the motivations behind this proposal if you could give an example of a method where the user cannot supply a file name but has to create a 'File' (or 'FileList') object first. And how the file registry proposal below would help. It looks like you have such an example in the GenomicFileViews package. Do you think you could give more details?
The most recent motivating use case was in creating subclasses of GenomicFileViews objects (BamFileViews, BigWigFileViews, etc.) We wanted to have a general constructor, something like GenomicFileViews(), that would create the appropriate subclass. However to create the correct subclass we needed to know if the files were bam, bw, fasta etc. Recognition of the file type by extension would allow us to do this with no further input from the user.
That helps, thanks! Having this kind of general constructor sounds like it could indeed be useful. Would be an opportunity to put all these *File classes (the 22 RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile subclasses defined in Rsamtools) under the same umbrella (i.e. a parent virtual class) and use the name of this virtual class (e.g. File) for the general constructor. Allowing a registration mechanism to extend the knowledge of this File() constructor is an implementation detail. I don't see a lot of benefit to it. Only a package that implements a concrete File subclass would actually need to register the new subclass. Sounds easy enough to ask to whoever has commit access to the File() code to modify it. This kind of update might also require adding the name of the package where the new File subclass is implemented to the Depends/Imports/Suggests of the package where File() lives, which is something that cannot be done via a registration mechanism.
This clean-up of the *File jungle would also be a good opportunity to:
- Choose what we want to do with reference classes: use them for all
the *File classes or for none of them. (Right now, those defined
in Rsamtools are reference classes, and those defined in
rtracklayer are not.)
- Move the I/O functionality currently in rtracklayer to a
separate package. Based on the number of contributed packages I
reviewed so far that were trying to reinvent the wheel because
they had no idea that the I/O function they needed was actually
in rtracklayer, I'd like to advocate for using a package name
that makes it very clear that it's all about I/O.
Thanks for the suggestions. This re-org sounds good to me. As you say, unifying the *File classes in a single package would make them more visible to other developers and enforce consistent behavior. If you aren't in favor of a registration mechanism for 'discovery' how should a function with methods for many *File classes (e.g., import()) handle a character file name? import() uses FileForFormat() to discover the file type, makes the *File class and dispatches to the appropriate *File method. The registry was an attempt at generalizing this concept.
It would use File() (or whatever the name of this general constructor
would be) instead of FileForFormat(). Right now FileForFormat() seems
to know how to map almost any supported file extension to the
corresponding *File class:
> class(FileForFormat("aaa.bed"))
[1] "BEDFile"
attr(,"package")
[1] "rtracklayer"
> class(FileForFormat("aaa.gff"))
[1] "GFFFile"
attr(,"package")
[1] "rtracklayer"
> class(FileForFormat("aaa.wig"))
[1] "WIGFile"
attr(,"package")
[1] "rtracklayer"
> class(FileForFormat("aaa.xz"))
[1] "XZFile"
attr(,"package")
[1] "rtracklayer"
> class(FileForFormat("aaa.2bit"))
[1] "TwoBitFile"
attr(,"package")
[1] "rtracklayer"
> class(FileForFormat("aaa.bam"))
[1] "BamFile"
attr(,"package")
[1] "Rsamtools"
> class(FileForFormat("aaa.fasta"))
[1] "FastaFile"
attr(,"package")
[1] "rtracklayer"
etc...
To my knowledge it doesn't use a registration mechanism for that.
It only fails for a couple of common extensions e.g.:
> class(FileForFormat("aaa.fa"))
Error in FileForFormat("aaa.fa") : Format 'fa' unsupported
but in such situation it seems easy enough to just add the
".fa" -> FastaFile mapping as permanent knowledge to FileForFormat(),
rather than having a registration mechanism that allows Rsamtools
to dynamically add the mapping at load time.
Another situation that can happen is that a new *File class is
implemented in a package (say foo) other than rtracklayer or
Rsamtools to support a new file format/extension. Again it sounds
easy enough to add the new mapping as permanent knowledge to
FileForFormat(). Bioconductor is a collaborative environment where
developers are encouraged to talk to each other :-)
Now I'm assuming that most of the file formats that are meaningful
for Bioconductor are already covered with an "ext" -> *File mapping
and that the need for new mappings is going to be a rare event.
But maybe I'm wrong. IMO a registration mechanism would be justified
if: (1) we anticipate many new *File classes and new mappings to come,
and (2) many of the new *File classes are going to be implemented
in many different packages outside rtracklayer and Rsamtools.
What do you think about the use of a registry for Vince's idea of holding a digest/path reference to large data but not installing it until it's used? Other ideas of how / where this could be stored?
Maybe it's related to the file registry thing. But maybe it's a slightly different topic. A real detailed example would help. H. PS: Note that the ".fasta" extension is currently mapped to FastaFile (defined in rtracklayer) and FaFile (defined in Rsamtools). If we had a registration mechanism in place to let rtracklayer and Rsamtools register mappings at load time, we would end up with one mapping overriding the other. A registration mechanism that allows any package to register mappings would make his kind of conflicts possible and probably more frequent than we would like.
Val
H.
H.
Val
Thanks, H. On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
Hi all, I'm soliciting feedback on the idea of a general file 'registry' that would identify file types by their extensions. This is similar in spirit to FileForformat() in rtracklayer but a more general abstraction that could be used across packages. The goal is to allow a user to supply only file name(s) to a method instead of first creating a 'File' class such as BamFile, FaFile, BigWigFile etc. A first attempt at this is in the GenomicFileViews package (https://github.com/Bioconductor/GenomicFileViews). A registry (lookup) is created as an environment at load time: .fileTypeRegistry <- new.env(parent=emptyenv() Files are registered with an information triplet consisting of class, package and regular expression to identify the extension. In GenomicFileViews we register FaFileList, BamFileList and BigWigFileList but any 'File' class can be registered that has a constructor of the same name. .onLoad <- function(libname, pkgname) { registerFileType("FaFileList", "Rsamtools", "\\.fa$") registerFileType("FaFileList", "Rsamtools", "\\.fasta$") registerFileType("BamFileList", "Rsamtools", "\\.bam$") registerFileType("BigWigFileList", "rtracklayer", "\\.bw$") } The makeFileType() helper creates the appropriate class. This function is used behind the scenes to do the lookup and coerce to the correct 'File' class.
> makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2 names(2): foo.bam bar.bam New types can be added at any time with registerFileType(): registerFileType(NewClass, NewPackage, "\\.NewExtension$") Thoughts: (1) If this sounds generally useful where should it live? rtracklayer, GenomicFileViews or other? Alternatively it could be its own lightweight package (FileRegister) that creates the registry and provides the helpers. It would be up to the package authors that depend on FileRegister to register their own files types at load time. (2) To avoid potential ambiguities maybe searching should be by regex and package name. Still a work in progress. Valerie
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140312/3e8a3a74/attachment.pl>
13 days later
Hi, This discussion went off-line and I wanted to give a summary of what we decided to go with. We'll create a new package, BiocFile, that has a minimal API. API: - 'File' class (virtual, reference class) and constructor - close / open / isOpen - import / export - file registry We won't require existing *File classes to implement yield but would 'recommend' that new *File classes do. By getting this structure in place we can guide future *File developments in a consistent direction even if we can't harmonize all current classes. I'll start work on this after the release. Thanks again for the input. Valerie
On 03/11/2014 10:23 PM, Michael Lawrence wrote:
On Tue, Mar 11, 2014 at 3:33 PM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:
On 03/11/2014 02:52 PM, Herv? Pag?s wrote:
On 03/11/2014 09:57 AM, Valerie Obenchain wrote:
Hi Herve,
On 03/10/2014 10:31 PM, Herv? Pag?s wrote:
Hi Val,
I think it would help understand the motivations behind
this proposal
if you could give an example of a method where the user
cannot supply
a file name but has to create a 'File' (or 'FileList')
object first.
And how the file registry proposal below would help.
It looks like you have such an example in the
GenomicFileViews package.
Do you think you could give more details?
The most recent motivating use case was in creating
subclasses of
GenomicFileViews objects (BamFileViews, BigWigFileViews,
etc.) We wanted
to have a general constructor, something like
GenomicFileViews(), that
would create the appropriate subclass. However to create the
correct
subclass we needed to know if the files were bam, bw, fasta etc.
Recognition of the file type by extension would allow us to
do this with
no further input from the user.
That helps, thanks!
Having this kind of general constructor sounds like it could
indeed be
useful. Would be an opportunity to put all these *File classes
(the 22
RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile
subclasses defined in Rsamtools) under the same umbrella (i.e. a
parent
virtual class) and use the name of this virtual class (e.g.
File) for
the general constructor.
Allowing a registration mechanism to extend the knowledge of
this File()
constructor is an implementation detail. I don't see a lot of
benefit to
it. Only a package that implements a concrete File subclass would
actually need to register the new subclass. Sounds easy enough
to ask
to whoever has commit access to the File() code to modify it. This
kind of update might also require adding the name of the package
where
the new File subclass is implemented to the Depends/Imports/Suggests
of the package where File() lives, which is something that cannot be
done via a registration mechanism.
This clean-up of the *File jungle would also be a good opportunity to:
- Choose what we want to do with reference classes: use them for all
the *File classes or for none of them. (Right now, those defined
in Rsamtools are reference classes, and those defined in
rtracklayer are not.)
- Move the I/O functionality currently in rtracklayer to a
separate package. Based on the number of contributed packages I
reviewed so far that were trying to reinvent the wheel because
they had no idea that the I/O function they needed was actually
in rtracklayer, I'd like to advocate for using a package name
that makes it very clear that it's all about I/O.
I can see some benefit in renaming/reorganizing, but if they weren't
able to perform a simple google search for functionality, I don't think
the name of the package was the problem. "read gff bioconductor" returns
rtracklayer as the top hit.
H.
H.
Val
Thanks,
H.
On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
Hi all,
I'm soliciting feedback on the idea of a general
file 'registry' that
would identify file types by their extensions. This
is similar in
spirit
to FileForformat() in rtracklayer but a more general
abstraction that
could be used across packages. The goal is to allow
a user to supply
only file name(s) to a method instead of first
creating a 'File' class
such as BamFile, FaFile, BigWigFile etc.
A first attempt at this is in the GenomicFileViews
package
(https://github.com/__Bioconductor/GenomicFileViews
<https://github.com/Bioconductor/GenomicFileViews>)__.
A registry (lookup)
is created as an environment at load time:
.fileTypeRegistry <- new.env(parent=emptyenv()
Files are registered with an information triplet
consisting of class,
package and regular expression to identify the
extension. In
GenomicFileViews we register FaFileList, BamFileList
and BigWigFileList
but any 'File' class can be registered that has a
constructor of the
same name.
.onLoad <- function(libname, pkgname)
{
registerFileType("FaFileList", "Rsamtools",
"\\.fa$")
registerFileType("FaFileList", "Rsamtools",
"\\.fasta$")
registerFileType("BamFileList"__, "Rsamtools",
"\\.bam$")
registerFileType("__BigWigFileList",
"rtracklayer", "\\.bw$")
}
The makeFileType() helper creates the appropriate
class. This function
is used behind the scenes to do the lookup and
coerce to the correct
'File' class.
> makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2
names(2): foo.bam bar.bam
New types can be added at any time with
registerFileType():
registerFileType(NewClass, NewPackage,
"\\.NewExtension$")
Thoughts:
(1) If this sounds generally useful where should it
live? rtracklayer,
GenomicFileViews or other? Alternatively it could be
its own
lightweight
package (FileRegister) that creates the registry and
provides the
helpers. It would be up to the package authors that
depend on
FileRegister to register their own files types at
load time.
(2) To avoid potential ambiguities maybe searching
should be by regex
and package name. Still a work in progress.
Valerie
_________________________________________________
Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha at fhcrc.org Phone: (206) 667-3158 Fax: (206) 667-1319