Skip to content

Storing R objects (was [R] advice requested re: building "good" system (R, SQL db) for handling large datasets)

4 messages · Sean Davis, Jeffrey Horner, Richard Pearson

#
(moved to R-sig-db from R-help)

Jeff,

I have a project where I want to create large numbers of large, complex 
objects (e.g. bioconductor ExpressionSet objects). I want to store these 
along with metadata (such as what raw data and parameters were used to 
create the object). I will later want to access subsets of these 
objects, with the subset specified by a query. It seems to me the 
natural way to do this would be to store the metadata and the objects 
themselves in database tables, and I have assumed that the objects would 
need to be serialised and stored as BLOBs. It sounds like at present 
there are no plans for infrastructure that would allow me to do this, 
but I would be interested to know if anyone plans to make such a 
scenario possible in the future.

I am assuming in the above that it is not possible to store arbitrarily 
complex R objects in a DB, without a lot of work coercing all the 
various slots in the object to data.frames, and saving the data.frames 
to different tables. I've had a quick scan through the documentation for 
DBI, RODBC, RMySQL and ROracle, but couldn't see any such functionality.

An alternative for my situation would be to store the R objects as files 
(using save) and store the metadata and filenames in a DB, but this 
seems to me to add an extra layer of complexity/maintenance. Finally, I 
could of course save everything as files, but one of the reasons for 
storing things in a DB is because I would like to create dynamic web 
pages linked to metadata and results data in the DB.

Best wishes

Richard.
Jeffrey Horner wrote:
#
On Feb 7, 2008 7:16 AM, Richard Pearson
<richard.pearson at postgrad.manchester.ac.uk> wrote:
This type of application comes up often in web design.  The general
thinking is that storing objects (such as images, etc.) on the disk is
just fine.  I would think that you would want to create functions
like:

queryMetadata() # returns a list of ExpressionSet keys
fetchExprSets() # takes a list of ExpressionSet keys and returns a
list of ExpressionSets
storeExprSetAndMetadata() #take an ExpressionSet, stores it, and
returns the associated unique key
....

These would allow you the flexibility of changing underlying storage
mechanisms as you go along to whatever you like without changing the
business code.  The concept of keeping the data model separate from
the rest of the code (that which controls the web application itself)
is one of the key concepts underlying the Model-View-Controller (MVC)
model of application design.

In practical terms, it seems that since R automatically serializes
objects efficiently and in a compressed format it would be appropriate
to use that mechanism as a first pass; it could be later modified if
necessary.

Just my $0.02 worth.

Sean
#
Richard Pearson wrote on 02/07/2008 06:16 AM:
Richard, I humbly suggest you actually benchmark how long it takes to 
retrieve a 2GB object from the filesystem into R. Then, add the time it 
takes to subset the object and print it on the console. Now, add the 
overhead of constructing the web page of that subset. Will the users of 
your web application wait that long for their results? Now swap out the 
filesystem and place the objects in the DB; that's obviously be slower, 
right?

Consider splitting your objects into a coherent db schema and only pull 
into R, or a web page, the parts that you want to analyze and display.

Jeff
#
I have perhaps confused the issue by mentioning the web application. The 
web application will only be based on small tables of results and 
metadata - I will not need any access to the large objects from the web 
application. I will however need access to the large objects from R, so 
I am thinking about how I should organise the storage of these objects. 
I think I will use Sean's fine suggestion (worth far more than $0.02!), 
but will store my large objects as files, rather than in the DB.

Many thanks to Jeff, Sean and Dirk for the great replies - much appreciated!

Richard.
Jeffrey Horner wrote: