Skip to content

[Bioc-devel] Update on SQLite-based annotation data package (prototype available)

6 messages · Wolfgang Huber, Vincent Carey, Seth Falcon +1 more

#
Hello all,

We are making progress on converting the annotation data packages to
use SQLite as the backend storage mechanism.

The devel annotation package repository has a prototype of a
SQLite-based annotation data package (hgu95av2db).  If you are running
R-devel, then you should be able to install it via biocLite (sorry,
only source package at this point).

The SQLite-based annotation packages depend on the AnnotationDbi
package which provides an environment-like interface that should be
backwards compatible.  Advanced users can get a connection to the DB
and issue raw SQL queries.  We are also planning to provide more
convenience/accessor functions along the lines of the annotate
package.

Our plan for the upcoming 2.0 release of Bioconductor is to include
both environment-based and SQLite-based annotation packages.

If you maintain a package that makes use of annotation data packages,
it would be good to see if the hgu95av2db prototype will work with
your code (if not, please let us know).

+ seth
#
Hi Seth,

I installed the package, but I get:
No documentation for 'hgu95av2db' in specified packages and libraries:
you could try 'help.search("hgu95av2db")'
No documentation for 'getDb' in specified packages and libraries:
you could try 'help.search("getDb")'
No documentation for 'hgu95av2CHRLOC' in specified packages and libraries:
you could try 'help.search("hgu95av2CHRLOC")'

and there is also no vignette
R version 2.5.0 Under development (unstable) (2007-01-22 r40543)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=it_IT.UTF-8;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] "tools"     "stats"     "graphics"  "grDevices" "utils"     "datasets"
[7] "methods"   "base"

other attached packages:
   hgu95av2db AnnotationDbi       RSQLite           DBI       Biobase
    "1.13.91"      "0.0.41"      "0.4-19"      "0.1-12"     "1.13.34"
     fortunes
      "1.3-2"
Cheers
 Wolfgang

  
    
#
Wolfgang Huber <huber at ebi.ac.uk> writes:
Yep.  It really is a prototype.  To get started, try pretending you
have called library(hgu95av2).  IOW, you should have all the same
"environments" (in quotes because now they are S4 instances) and can
treat them as such.

We will put some documentation together for the experimental APIs we
are working on, but things are in flux.  Herve has a vignette like
document that we will post asap.

Some notes on performance are worth noting...  The database approach
is going to be slower than having everything in memory for many
operations.  When retrieving annotation for reasonably small gene
lists, the difference is not huge.  However, for operations that pull
everything from a given mapping, such as as.list(), you will see a
huge difference.  

So why are the SQLite-based packages a good thing?  Here are some
thoughts:

  1. They will allow us to deal with much larger data collections.
     The environment-based packages require being able to have all of
     the data in memory at once and provide no easy way to unload the
     data once it has been loaded.  The SQLite-based packages can
     easily handle much larger data sizes and pull only the requested
     data into memory at any one time.

  2. More flexible queries.  With the SQLite-based packages, many
     queries that currently require loops over possible many entire
     environments can be accomplished in one statement.  Using some
     simple SQL statements, I've been able to improve the performance
     of the hyperGTest function by 10x.  Focused queries will
     generally be much faster with the SQLite-based packages.

+ seth
#
do we need a sql tutorial doc (i know there are plenty on the
web but perhaps some that are focused on the types of queries to
be used here?)  helper code that 'translates' R-like actions to
SQL may be feasible for some of the more common tasks.
#
Vincent Carey 525-2265 <stvjc at channing.harvard.edu> writes:
I'm hoping that an alternative API will solidify Real Soon Now.  I
would much prefer promoting a well documented API than raw SQL.  Using
raw SQL is effective, but relies on the schema definition.

But perhaps my comments are orthogonal to your suggestion.  A SQL
tutorial with "translations" of R concepts is a great idea.

+ seth
#
I would only promote what Seth would prefer promoting.
Having an API could be a better solution, as it would allow to provide a
unified front-end to annotation packages while letting annotation to be
stored in a number of different backends (loaded environments like it was
the case, the coming SQLlite ones, remote SQL database, web-service,
etc...).
Having an API would also permit to make changes to the SQL schema without
causing a lot of trouble to all users.


Laurent