[Bioc-devel] SQLite databases
Hi, Francois Pepin <fpepin at cs.mcgill.ca> writes:
I would personally very much appreciate something of the sort and I know several other of my collaborators would also. My personal favorite was the idea of a by-species package that would behave just like the chip annotation. To use EntrezID instead of the probe ids and to have all the xxxGO, xxxENZYME, xxxSYMBOL, etc.
This, in particular, is on the way. We plan to have <what>EG.db packages for <what> = human, mouse, and rat. These will replace the <what>LLMappings packages, be SQLite-based, and look as much as possible like the standard chip packages in terms of the maps provided and interface.
On Mon, 2007-06-11 at 08:52 -0400, Sean Davis wrote:
Now that RSQLite and DBI are really beginning to merge with Bioconductor tools, does it make sense to think about building data sources (SQLite databases) as a base for further development? As an example, might it make sense to include all of the data available at the Entrez Gene ftp site as a database file? Does a repository of such database files (and possibly supporting files) make sense? Making such files is pretty straightforward, but what makes the most sense for distribution? A full package with accessors, etc? A simple sqlite file? Something in between? I may be asking questions for which the answers are already known/decided, but it would be good to know anyway.
Our plan is to have all BioC annotation data packages be SQLite-based. There is a package in devel called AnnotationDbi and it implements an interface for SQLite-based ann pkgs that allows them to be used just like their environment-based cousins. We are actively working on this interface and making the set of SQLite-based packages complete. In the process of creating these packges, we are creating a new package building pipeline where we generate larger intermediate DBs from which the individual annotation packages are generated. At least in principle, these are along the lines of a SQLite DB containing data from the Entrez Gene ftp site. Whether these intermediate DBs will be of use to others isn't clear to me, but when our process gels a bit more, we will be happy to share what we have. Genrally, I think it will be useful to distribute SQLite DB versions of public annotation data since this will support: - general SQL querries - works platform - can be accessed from just about any programming language But in terms of making things easily accessible to Bioconductor users, simply making a SQLite DB file available is not, in general, going to be enough. If we want users to be able to access the data without writing SQL, then we will need careful study of the DB schema and interface classes that provide alternate query mechanisms. Best, + seth
Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org