Skip to content

RFC: large database interface

3 messages · Thomas Lumley, Egon Schmid, Ross Ihaka

#
I have been playing with a large database interface for R, and have
written one complete but useless demonstration and one incomplete but
potentially useful example (with memory mapping of a fixed-format ASCII
file). The idea is to make the file appear like a matrix or data frame but
not have to read it into the R heap.

A description and code can be found at
http://www.biostat.washington.edu/~thomas/Rdb.html
                                          Rdb.nw  (noweb literate program)
                                          Rdb.c
                                          Rdb.R

Comments?

Thomas Lumley
------------------------------------------------------+------
Biostatistics		: "Never attribute to malice what  :
Uni of Washington	:  can be adequately explained by  :
Box 357232		:  incompetence" - Hanlon's Razor  :
Seattle WA 98195-7232	:				   :
------------------------------------------------------------


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Thomas Lumley wrote:
Well, there is a web interface through the Apache module PHP Hypertext
Preprocessor. At http://www.php.net/ there are plenty more database
interfaces.

Personaly I think it would a great idea to interface large datasets with
netCDF 

	http://www.unidata.ucar.edu/packages/netcdf
"Why not use an existing database management system for storing
array-oriented data? Relational database software is not suitable for
the kinds of data access supported by the netCDF interface.

First, existing database systems that support the relational model do
not support multidimensional objects (arrays) as a basic unit of data
access. Representing arrays as relations makes some useful kinds of data
access awkward and provides little support for the abstractions of
multidimensional data and coordinate systems. A quite different data
model is needed for array-oriented data to facilitate its retrieval,
modification, mathematical manipulation and visualization.

Related to this is a second problem with general-purpose database
systems: their poor performance on large arrays. Collections of
satellite images, scientific model outputs and long-term global weather
observations are beyond the capabilities of most database systems to
organize and index for efficient retrieval.

Finally, general-purpose database systems provide, at significant cost
in terms of both resources and access performance, many facilities that
are not needed in the analysis, management, and display of
array-oriented data. For example, elaborate update facilities, audit
trails, report formatting, and mechanisms designed for
transaction-processing are unnecessary for most scientific
applications."

On Feb 3 there was a small thread on this mailing list. 

-Egon
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Egon Schmid writes:
> Thomas Lumley wrote:
> > 
 > > I have been playing with a large database interface for R, and have
 > > written one complete but useless demonstration and one incomplete but
 > > potentially useful example (with memory mapping of a fixed-format ASCII
 > > file). The idea is to make the file appear like a matrix or data frame but
 > > not have to read it into the R heap.
 > > 
 > > A description and code can be found at
 > > http://www.biostat.washington.edu/~thomas/Rdb.html
 > >                                           Rdb.nw  (noweb literate program)
 > >                                           Rdb.c
 > >                                           Rdb.R
 > > 
 > > Comments?

This looks very interesting.  It would be nice if the such an
interface were written in a way that could be customized to a variety
of applications.  Being able to read spreadsheets is one thing which
comes to mind.  It might be nice (for example) to have a rather
complex initialization procedure which inspects the dataset thoroughly
and determines things like variable types (if the database does not
contain this information).

Egon Schmid writes:
 > Well, there is a web interface through the Apache module PHP Hypertext
 > Preprocessor. At http://www.php.net/ there are plenty more database
 > interfaces.
 > 
 > Personaly I think it would a great idea to interface large datasets with
 > netCDF 
 >
 > "Why not use an existing database management system for storing
 > array-oriented data? Relational database software is not suitable for
 > the kinds of data access supported by the netCDF interface.
 > 

Hmm. Over the past week I have been looking at NetCDF because GMT
(The Generic Mapping Tools)
	http://www.soest.hawaii.edu/wessel/gmt.html
use NetCDF to store their maps.

[ The maps are rather better than the Becker and Wilks ones because
  they are based on the World Vector Shoreline as well as the CIA WDB
  that B&W use.  They also have the maps prepared pretty well for
  plotting.  The only place where B&W are better is in the naming
  of places ... ]

I wasn't thinking about pulling the data from these maps into R, but
rather just rendering them on a graphics device so that they could
then be added to.

I suspect that when I've done that I'll probably know enough to create
an R/NetCDF link, perhaps using a framework of the type Thomas
proposes.
	Ross
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._