Hi,
Well, earlier this week I committed to taking a look at the Rdbi
package and to producing a first draft for the common database
interface (which I'm attaching as two text files: DBI.tex and
figure1.eps).
The Rdbi package looks quite reasonable to me. It defines most of
the core functionality we need, and I think that it wouldn't be too
hard to make existing packages be compatible with it. I also have
some suggestions and comments (in no particular order).
1. R/Splus compatibility.
From an R/Splus compatibility point of view, I'd suggest
renaming Rdbi to something neutral. Initially I had thought
as RSdbi, but why not simply call it DBI -- and thus friendly
to both.
2. I would add 2 more generic functions
dbRemoveTable(con, tbl.name, ...)
dbExistsTable(con, tbl.name,...)
in addition to the dbReadTable, dbWriteTable, etc.
Yes, these are needed. I must have forgotten to move them over from RPgSQL.
The reason is that with these 2 we would have all the methods
require to be able to attach() any DBMS to the search path (thus
would allow us to simply attach(conn) and treat the remote tables
as simple data.frames). R does not yet provide facilities for
attaching arbitrary databases, but I believe that Duncan (and John?)
will be doing this soon(?).
This has been my dream all along---to be able to attach a database and
use the native R interface to access the data. I'd be very excited to
see some general hooks into the 'attach' command. Even still, this is
actually not at all simple to implement without write access to the
database. Also, as I've learned SQL, I find I want to run queries
directly rather than use R on proxies. Anyway, it would be a nice feature.
3. I would add another class to group the various driver classes
(RODBC, RPgSQL, etc.)
I used class 'PgSQL' in Rdbi.PgSQL, but the names are arbitrary.
What I meant to say was that we should add/consider a virtual driver
class to group all actual drivers. Conceptually all the drivers
(RPgSQL, RODBC, etc) extend the driver virtual class.
4. Result class
Rdbi defines only one class Rdbi.result to represent the result
of a query, yet there are 2 types of queries -- those that produce
output and those that don't. I can see that for simplicity having
only one is advantageous, but perhaps we should distinguish them
(the result of SELECT-like queries that produce output could
extend the results that don't).
Interesting. I hadn't considered this. I don't see it as immediately
usefull. Perhaps some mock code examples would help.
5. I would replace the generic functions
dbConnectionInfo(conn)
dbResultInfo(res)
with a single dbGetInfo(obj), say, that dispatches on
connections, results (and drivers, see 3. above) to retrieve
meta-data for those objects.
(Probably I would leave dbColumnInfo(res) as defined in the
Rdbi package to explicitly extract meta-data from the individual
columns of the result set.)
6. dbReconnect()
From the comments in the Rdbi code, it appears that it's meant to
take a DBMS connection object (even from a previous session) and
re-instantiate the (broken) connection. It seems to me that this
would require storing passwords in the regular .RData or .Data.
From a security point of view, I'm uncomfortable doing this.
Is there a way of re-connecting without saving passwords in
obvious places? (I guess that this is more of an implementation
issue?)
Perhaps we should just leave this out for now. I once thought it would
be nice if table proxies could reinitialize their connections
automatically, but it not essential at this time.
7. Exceptions.
We need a dbGetException(conn) generic to report if and when
there are errors in a DBMS connection "conn"; if there has been
an error, it should report the code (number) and/or message.
In Rdbi, these are available using dbConnectionInfo(conn). Do we need a
seperate function? (I'm not saying we don't.)
I guess not. The important thing is the ability to obtain the status
of the operation, so having it as part of the dbConnectionInfo is
indeed fine. Perhaps the DBI should then specify the member names,
say "error.number" and "error.message" (say)?
8. Version information.
I think we want to ensure that packages that implement the DBI
report what version of the DBI they implements -- this could
easily be reported by the dbGetInfo() method of the driver. (We
may even require that each driver implementation reports its
version number.)
9. Common utilities.
Most packages that will implement the DBI will need some
common utilities, e.g., functions to map R/S names to SQL
valid identifiers, to "guess" the SQL type of R/S objects, etc.
We should provide these common methods in the DBI (of course,
individual packages may choose to extend them or completely
override them).
I put these in util.R in Rdbi.
Oops, obviously I missed them. Now I see the make.db.names() in
PgSQL that I also missed.
10. Data mappings.
As Tim has already pointed out, this needs more thought. The
packages RJava, RPerl, RPython (and the netscape R plugin)
have dealt with this issue (just like the R-Excel,R-DCOM, etc.)
We may want to implement the same converters ideas in some
of those?
Unless we are prepared to store metadata in seperate tables (requires
write access to the database), we're pretty much limited to dealing with
unique types defined in the database. My solution (types.R in
Rdbi.PgSQL) was a function that returns the SQL type given an R object,
a function that takes an R object and creates a corresponding SQL
formatted string and a function that takes an SQL formatted string and
converts it back to the R type. These could be made into generic methods.
11. Multiple instance of connections, result sets, etc.
The Rdbi package is silent re: allowing multiple DBMS object
simultaneously, i.e., multiple open drivers, DBMS connections,
result sets (RODBC, for instance, already implement multiple
connections).
This was true in RPgSQL, but not Rdbi. In Rdbi, you can have as many
'conn' and 'result' objects open as resources permit. I don't have the
concept of a driver object, because I'm not convinced its needed.
Strictly speaking, probably we could get away without it, but I
think having a driver class extended by the various implementations
could be useful. For instance, one thing that driver objects can
provide is help with conversions. Data type mappings probably
are best specified for drivers -- not connections nor result sets.
For instance, suppose dbDataType is a generic that returns the
"closest" data type for the R object x on DBMS drv
dbDataType(drv, x)
then as new drivers get written, they extend dbDataType in the
obvious way. It would be possible to specify the driver as a
character string instead of the driver object and use a switch
statement inside the function, but this would be error-prone and
would require keeping the dbDataType up-to-date.
Similarly (but perhaps not as interesting) for mapping names
(identifiers) from R and the DBMS, as in make.db.names in RPgSQL.
Another use is for listing all open connections (not unlike what
showConnections/getConnections currently do in R). See below.