Skip to content

[Bioc-devel] file registry - feedback

18 messages · Gabriel Becker, Vincent Carey, Valerie Obenchain +3 more

#
Hi all,

I'm soliciting feedback on the idea of a general file 'registry' that 
would identify file types by their extensions. This is similar in spirit 
to FileForformat() in rtracklayer but a more general abstraction that 
could be used across packages. The goal is to allow a user to supply 
only file name(s) to a method instead of first creating a 'File' class 
such as BamFile, FaFile, BigWigFile etc.

A first attempt at this is in the GenomicFileViews package 
(https://github.com/Bioconductor/GenomicFileViews). A registry (lookup) 
is created as an environment at load time:

.fileTypeRegistry <- new.env(parent=emptyenv()

Files are registered with an information triplet consisting of class, 
package and regular expression to identify the extension. In 
GenomicFileViews we register FaFileList, BamFileList and BigWigFileList 
but any 'File' class can be registered that has a constructor of the 
same name.

.onLoad <- function(libname, pkgname)
{
     registerFileType("FaFileList", "Rsamtools", "\\.fa$")
     registerFileType("FaFileList", "Rsamtools", "\\.fasta$")
     registerFileType("BamFileList", "Rsamtools", "\\.bam$")
     registerFileType("BigWigFileList", "rtracklayer", "\\.bw$")
}

The makeFileType() helper creates the appropriate class. This function 
is used behind the scenes to do the lookup and coerce to the correct 
'File' class.

 > makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2
names(2): foo.bam bar.bam

New types can be added at any time with registerFileType():

registerFileType(NewClass, NewPackage, "\\.NewExtension$")


Thoughts:

(1) If this sounds generally useful where should it live? rtracklayer, 
GenomicFileViews or other? Alternatively it could be its own lightweight 
package (FileRegister) that creates the registry and provides the 
helpers. It would be up to the package authors that depend on 
FileRegister to register their own files types at load time.

(2) To avoid potential ambiguities maybe searching should be by regex 
and package name. Still a work in progress.


Valerie
#
Hi Val,

I think it would help understand the motivations behind this proposal
if you could give an example of a method where the user cannot supply
a file name but has to create a 'File' (or 'FileList') object first.
And how the file registry proposal below would help.
It looks like you have such an example in the GenomicFileViews package.
Do you think you could give more details?

Thanks,
H.
On 03/10/2014 08:46 PM, Valerie Obenchain wrote:

  
    
#
Hi Herve,
On 03/10/2014 10:31 PM, Herv? Pag?s wrote:
The most recent motivating use case was in creating subclasses of 
GenomicFileViews objects (BamFileViews, BigWigFileViews, etc.) We wanted 
to have a general constructor, something like GenomicFileViews(), that 
would create the appropriate subclass. However to create the correct 
subclass we needed to know if the files were bam, bw, fasta etc. 
Recognition of the file type by extension would allow us to do this with 
no further input from the user.

Val

  
    
#
Hi,
On 03/11/2014 09:47 AM, Michael Lawrence wrote:
Neat. Sounds like this is worth pursuing.
Yes, good suggestion.
The intent of the registry was to provide a way to lookup files by their 
extension. I'm not sure how this applies to the database example. Do you 
envision creating multiple databases throughout an R session (vs a 
single set up at load time)? For example if the file has type 'X' 
extension it becomes a type 'X' database etc.?
No, I don't want to force the 'List' route. I was using them in 
GenomicFileViews so that's what I registered. The 'class' should be any 
class that has a constructor of the same name. Thinking about this more 
the 'class' probably should be the individual File object instead of the 
List object. Coercion to List can be done inside the helper.
I'll look into this.


Any comments on whether this should be it's own package or in an 
existing one?


Thanks for the input.
Valerie

  
    
#
On 03/11/2014 09:57 AM, Valerie Obenchain wrote:
That helps, thanks!

Having this kind of general constructor sounds like it could indeed be
useful. Would be an opportunity to put all these *File classes (the 22
RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile
subclasses defined in Rsamtools) under the same umbrella (i.e. a parent
virtual class) and use the name of this virtual class (e.g. File) for
the general constructor.

Allowing a registration mechanism to extend the knowledge of this File()
constructor is an implementation detail. I don't see a lot of benefit to
it. Only a package that implements a concrete File subclass would 
actually need to register the new subclass. Sounds easy enough to ask
to whoever has commit access to the File() code to modify it. This
kind of update might also require adding the name of the package where
the new File subclass is implemented to the Depends/Imports/Suggests
of the package where File() lives, which is something that cannot be
done via a registration mechanism.

H.

  
    
#
On 03/11/2014 02:52 PM, Herv? Pag?s wrote:
This clean-up of the *File jungle would also be a good opportunity to:

   - Choose what we want to do with reference classes: use them for all
     the *File classes or for none of them. (Right now, those defined
     in Rsamtools are reference classes, and those defined in
     rtracklayer are not.)

   - Move the I/O functionality currently in rtracklayer to a
     separate package. Based on the number of contributed packages I
     reviewed so far that were trying to reinvent the wheel because
     they had no idea that the I/O function they needed was actually
     in rtracklayer, I'd like to advocate for using a package name
     that makes it very clear that it's all about I/O.

H.

  
    
#
Hi,
On 03/11/14 15:33, Herv? Pag?s wrote:
Thanks for the suggestions. This re-org sounds good to me. As you say, 
unifying the *File classes in a single package would make them more 
visible to other developers and enforce consistent behavior.

If you aren't in favor of a registration mechanism for 'discovery' how 
should a function with methods for many *File classes (e.g., import()) 
handle a character file name? import() uses FileForFormat() to discover 
the file type, makes the *File class and dispatches to the appropriate 
*File method. The registry was an attempt at generalizing this concept.

What do you think about the use of a registry for Vince's idea of 
holding a digest/path reference to large data but not installing it 
until it's used? Other ideas of how / where this could be stored?

Val
#
Hi Val,
On 03/11/2014 08:57 PM, Valerie Obenchain wrote:
It would use File() (or whatever the name of this general constructor
would be) instead of FileForFormat(). Right now FileForFormat() seems
to know how to map almost any supported file extension to the
corresponding *File class:

   > class(FileForFormat("aaa.bed"))
   [1] "BEDFile"
   attr(,"package")
   [1] "rtracklayer"

   > class(FileForFormat("aaa.gff"))
   [1] "GFFFile"
   attr(,"package")
   [1] "rtracklayer"

   > class(FileForFormat("aaa.wig"))
   [1] "WIGFile"
   attr(,"package")
   [1] "rtracklayer"

   > class(FileForFormat("aaa.xz"))
   [1] "XZFile"
   attr(,"package")
   [1] "rtracklayer"

   > class(FileForFormat("aaa.2bit"))
   [1] "TwoBitFile"
   attr(,"package")
   [1] "rtracklayer"

   > class(FileForFormat("aaa.bam"))
   [1] "BamFile"
   attr(,"package")
   [1] "Rsamtools"

   > class(FileForFormat("aaa.fasta"))
   [1] "FastaFile"
   attr(,"package")
   [1] "rtracklayer"

   etc...

To my knowledge it doesn't use a registration mechanism for that.
It only fails for a couple of common extensions e.g.:

   > class(FileForFormat("aaa.fa"))
   Error in FileForFormat("aaa.fa") : Format 'fa' unsupported

but in such situation it seems easy enough to just add the
".fa" -> FastaFile mapping as permanent knowledge to FileForFormat(),
rather than having a registration mechanism that allows Rsamtools
to dynamically add the mapping at load time.

Another situation that can happen is that a new *File class is
implemented in a package (say foo) other than rtracklayer or
Rsamtools to support a new file format/extension. Again it sounds
easy enough to add the new mapping as permanent knowledge to
FileForFormat(). Bioconductor is a collaborative environment where
developers are encouraged to talk to each other :-)

Now I'm assuming that most of the file formats that are meaningful
for Bioconductor are already covered with an "ext" -> *File mapping
and that the need for new mappings is going to be a rare event.
But maybe I'm wrong. IMO a registration mechanism would be justified
if: (1) we anticipate many new *File classes and new mappings to come,
and (2) many of the new *File classes are going to be implemented
in many different packages outside rtracklayer and Rsamtools.
Maybe it's related to the file registry thing. But maybe it's a slightly
different topic. A real detailed example would help.

H.

PS: Note that the ".fasta" extension is currently mapped to FastaFile
(defined in rtracklayer) and FaFile (defined in Rsamtools).
If we had a registration mechanism in place to let rtracklayer
and Rsamtools register mappings at load time, we would end up with
one mapping overriding the other. A registration mechanism that
allows any package to register mappings would make his kind of
conflicts possible and probably more frequent than we would like.

  
    
13 days later
#
Hi,

This discussion went off-line and I wanted to give a summary of what we 
decided to go with.

We'll create a new package, BiocFile, that has a minimal API.

API:
- 'File' class (virtual, reference class) and constructor
- close / open / isOpen
- import / export
- file registry

We won't require existing *File classes to implement yield but would 
'recommend' that new *File classes do. By getting this structure in 
place we can guide future *File developments in a consistent direction 
even if we can't harmonize all current classes. I'll start work on this 
after the release.

Thanks again for the input.

Valerie
On 03/11/2014 10:23 PM, Michael Lawrence wrote: