Rdbi package [forwarded msg] - R-SIG-DB

Mon, Oct 1, 2001 12:19 AM #

Ok.  Will wait till tomorrow morning to allow for further reactions.

Best,
-k

David James

Mon, Oct 1, 2001 1:40 PM #

Kurt Hornik wrote:

Not that I anticipate any major issues, but could we wait until the end
of this week?  I'd like to check a couple of things (one that Torsten
brought to my attention last week and has to do with data conversions,
and portability to Splus being the other).

Thanks,

David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636

Tim Keitt

Mon, Oct 1, 2001 2:05 PM #

I haven't had time to fix up the data conversion stuff. I have some old 
routines from RPgSQL in Rdbi.PgSQL. The choice is to leave the 
conversion behavior undefined and let each specific implementation 
handle conversion however they see fit, or to define a set of generic 
functions to be subclassed. The latter is preferable, but probably 
difficult to get right given the heterogeneity among potential data 
sources. (Or more accurately, I mean that its difficult to do cleanly; 
if you cover all the marginal cases, it will get unwieldy.)

Tim

David James wrote:

I'm not running Splus, so I haven't tried to make it compatible. It 
would seem to be a good thing to do, although I admit, I've been pretty 
unhappy with Mathsoft in the past because of their callous attitude 
towards UNIX, which (as far as I know) is the environment where S 
originated.

Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/

David James

Mon, Oct 1, 2001 2:45 PM #

Timothy H. Keitt wrote:

No need to do this, for the time being.  If I'm not mistaken, we're 
taking the Rdbi package to be a starting point for a common interface, 
so by necessity it will have to evolve as we gain experience with it
and its various driver implementations.  

I still think that we need a document (in English, that is) to describe 
the interface, its classes and generic functions, perhaps the various data 
conversion mechanisms, ways to query the metadata, and so forth
e.g., what to do with SQL keywords in R/S objects that we export to DBMS.
Also, it is in this kind of document were we could identify core and optional
features.  (Both the python and perl interfaces have such a 
specification document.)

I understand, and I wouldn't expect people to be doing both R and 
S implementations.  All I'm suggesting is to see if we could follow 
some simple guidelines to make it easier for other people to port 
from one system to the other.  My guess is that it'll be users and/or
some of us --not Insightful-- doing the porting.

David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636

Tim Keitt

Mon, Oct 1, 2001 3:44 PM #

David James wrote:

Agreed.

Agreed, also, but its not likely I will be able to do this anytime soon.

T.

Tim

David James wrote:

Kurt Hornik wrote:

M Edward Borasky writes:

Sounds good to me. At some point in the near future I may get a chance
to test it on a Windows box with MS Access; I've been using RODBC for
that.

Ok.  Will wait till tomorrow morning to allow for further reactions.

Not that I anticipate any major issues, but could we wait until the end
of this week?  I'd like to check a couple of things (one that Torsten
brought to my attention last week and has to do with data conversions,
and portability to Splus being the other).

I'm not running Splus, so I haven't tried to make it compatible. It 
would seem to be a good thing to do, although I admit, I've been pretty 
unhappy with Mathsoft in the past because of their callous attitude 
towards UNIX, which (as far as I know) is the environment where S 
originated.

I understand, and I wouldn't expect people to be doing both R and 
S implementations.  All I'm suggesting is to see if we could follow 
some simple guidelines to make it easier for other people to port 
from one system to the other.  My guess is that it'll be users and/or
some of us --not Insightful-- doing the porting.

Thanks,

Best,
-k

-----Original Message-----
From: r-sig-db-admin at stat.math.ethz.ch
[mailto:r-sig-db-admin at stat.math.ethz.ch]On Behalf Of Kurt Hornik
Sent: Sunday, September 30, 2001 2:58 AM
To: Kurt.Hornik at ci.tuwien.ac.at
Cc: David James; R-SIG-DB at stat.math.ethz.ch; Timothy H. Keitt; Saikat
DebRoy; Torsten Hothorn
Subject: [R-sig-DB] Re: Rdbi package [forwarded msg]
I have not seen any followup on this for almost a month.

Hence, I assume that everyone is happy with the new design, and suggest
to move Tim's package from contrib/Devel to contrib.

-k

-- 
Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/

Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/

M. Edward (Ed) Borasky

Mon, Oct 1, 2001 4:36 PM #

On Mon, 1 Oct 2001, David James wrote:

Fine with me ... I need to do some kind of hack with RODBC and Rdbi to get to
my Access databases; the PostGres interface is the only one available so far.

--
znmeb at aracnet.com (M. Edward Borasky) http://www.aracnet.com/~znmeb

You can't make a sow's ear out of a silk purse either.

Kurt Hornik

Tue, Oct 2, 2001 2:28 AM #

End of this week is fine.

-k

David James

Tue, Oct 2, 2001 5:11 AM #

Timothy H. Keitt wrote:

No problem, I'll come up with a first draft by the end of this week.

T.

Tim

David James wrote:

Kurt Hornik wrote:

M Edward Borasky writes:

Sounds good to me. At some point in the near future I may get a chance
to test it on a Windows box with MS Access; I've been using RODBC for
that.

Ok.  Will wait till tomorrow morning to allow for further reactions.

Not that I anticipate any major issues, but could we wait until the end
of this week?  I'd like to check a couple of things (one that Torsten
brought to my attention last week and has to do with data conversions,
and portability to Splus being the other).

I'm not running Splus, so I haven't tried to make it compatible. It 
would seem to be a good thing to do, although I admit, I've been pretty 
unhappy with Mathsoft in the past because of their callous attitude 
towards UNIX, which (as far as I know) is the environment where S 
originated.

I understand, and I wouldn't expect people to be doing both R and 
S implementations.  All I'm suggesting is to see if we could follow 
some simple guidelines to make it easier for other people to port 
from one system to the other.  My guess is that it'll be users and/or
some of us --not Insightful-- doing the porting.

Thanks,

Best,
-k

-----Original Message-----
From: r-sig-db-admin at stat.math.ethz.ch
[mailto:r-sig-db-admin at stat.math.ethz.ch]On Behalf Of Kurt Hornik
Sent: Sunday, September 30, 2001 2:58 AM
To: Kurt.Hornik at ci.tuwien.ac.at
Cc: David James; R-SIG-DB at stat.math.ethz.ch; Timothy H. Keitt; Saikat
DebRoy; Torsten Hothorn
Subject: [R-sig-DB] Re: Rdbi package [forwarded msg]
I have not seen any followup on this for almost a month.

Hence, I assume that everyone is happy with the new design, and suggest
to move Tim's package from contrib/Devel to contrib.

-k

-- 
Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/


-- 
Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/

David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636

David James

Fri, Oct 5, 2001 9:04 PM #

Kurt Hornik wrote:

Hi,

Well, earlier this week I committed to taking a look at the Rdbi
package and to producing a first draft for the common database
interface (which I'm attaching as two text files: DBI.tex and
figure1.eps).

The Rdbi package looks quite reasonable to me.  It defines most of
the core functionality we need, and I think that it wouldn't be too
hard to make existing packages be compatible with it.  I also have
some suggestions and comments (in no particular order).

1. R/Splus compatibility.

   From an R/Splus compatibility point of view, I'd suggest
   renaming Rdbi to something neutral.  Initially I had thought 
   as RSdbi, but why not simply call it DBI -- and thus friendly
   to both.

2. I would add 2 more generic functions 

       dbRemoveTable(con, tbl.name, ...)
       dbExistsTable(con, tbl.name,...)  

   in addition to the dbReadTable, dbWriteTable, etc. 

   The reason is that with these 2 we would have all the methods
   require to be able to attach() any DBMS to the search path (thus
   would allow us to simply attach(conn) and treat the remote tables
   as simple data.frames).  R does not yet provide facilities for
   attaching arbitrary databases, but I believe that Duncan (and John?) 
   will be doing this soon(?).

3. I would add another class to group the various driver classes
   (RODBC, RPgSQL, etc.)

4. Result class.

   Rdbi defines only one class Rdbi.result to represent the result
   of a query, yet there are 2 types of queries -- those that produce
   output and those that don't.  I can see that for simplicity having
   only one is advantageous, but perhaps we should distinguish them
   (the result of SELECT-like queries that produce output could
   extend the results that don't).

5. I would replace the generic functions

         dbConnectionInfo(conn)
         dbResultInfo(res)
   
    with a single dbGetInfo(obj), say, that dispatches on
    connections, results (and drivers, see 3. above) to retrieve
    meta-data for those objects.

    (Probably I would leave dbColumnInfo(res) as defined in the
    Rdbi package to explicitly extract meta-data from the individual
    columns of the result set.)

6. dbReconnect() 

   From the comments in the Rdbi code, it appears that it's meant to
   take a DBMS connection object (even from a previous session) and
   re-instantiate the (broken) connection.  It seems to me that this
   would require storing passwords in the regular .RData or .Data.
   From a security point of view, I'm uncomfortable doing this.
   Is there a way of re-connecting without saving passwords in
   obvious places?  (I guess that this is more of an implementation
   issue?)

7. Exceptions.

   We need a dbGetException(conn) generic to report if and when
   there are errors in a DBMS connection "conn";  if there has been
   an error, it should report the code (number) and/or message.

8. Version information.

   I think we want to ensure that packages that implement the DBI
   report what version of the DBI they implements -- this could
   easily be reported by the dbGetInfo() method of the driver. (We
   may even require that each driver implementation reports its
   version number.)

9. Common utilities.

   Most packages that will implement the DBI will need some
   common utilities, e.g., functions to map R/S names to SQL
   valid identifiers, to "guess" the SQL type of R/S objects, etc.
   We should provide these common methods in the DBI (of course,
   individual packages may choose to extend them or completely
   override them).

10. Data mappings.

    As Tim has already pointed out, this needs more thought. The
    packages RJava, RPerl, RPython (and the netscape R plugin)
    have dealt with this issue (just like the R-Excel,R-DCOM, etc.)
    We may want to implement the same converters ideas in some
    of those?

11. Multiple instance of connections, result sets, etc.

    The Rdbi package is silent re: allowing multiple DBMS object
    simultaneously, i.e., multiple open drivers, DBMS connections,
    result sets (RODBC, for instance, already implement multiple
    connections).
 
    I would suggest that the DBI should be explicit about
    the possibility for having multiple instances of any of
    these object.  Of course, each DBMS driver should be free to
    implement either single or multiple instances as it sees fit.
    We then should define a couple of generic functions in the
    spirit of getConnections (available both in R and Splus),
    but for DBI objects, etc.

      dbGetConnections(drv)     # return connections on driver drv
      dbGetResults(conn)        # return result sets on connection conn
      
    these could return a container (i.e., a list) of connections and
    result objects, respectively (possibly with only one object).

12. At one point we had thought about making the DBMS interface 
    general enough so that non-relational DBMS could be covered 
    under the same API (HDF5 comes to mind, but also Berkeley DB).
    Is this reasonable?  Or are they so different to relational DBMS 
    that it's  just not feasible?  Or is it us that we have been too
    narrowly focus?  Somebody that's using non-relational should take
    a closer look and let us know....


Now, let me move on to the draft paper on a common interface.

Based on the Rdbi package, the above points, and previous
discussions in various lists/forums, I've put together a rough
draft that attempts to flush out in enough detail a framework for
implementing the interface.  I hope that after a few iterations,
it can be used as a guide by DBMS package developers.

I found that describing the interface in terms of classes is a lot
more natural (and easier) than in terms of (only) generic functions and
methods.  Moreover, as I looked at other DB API's (Python, Perl, Java)
I felt more strongly that perhaps we should be using version 4
class-based instead of traditional version 3 object-based programming.
With the current library(methods) in R-devel in quite a reasonable shape, 
I think the DBI is an ideal candidate project for using more advanced
programming tools.  

So, let me throw in the idea of building the DBI using version 4
style classes.  I believe that most of the (current) Rdbi could
be re-used, and I'd volunteer to migrate it to the version 4 style.

Regards,

David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636

-------------- next part --------------
A non-text attachment was scrubbed...
Name: DBI.tex
Type: application/x-tex
Size: 23865 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-db/attachments/20011006/87f9f84b/attachment.tex>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: figure1.eps
Type: application/postscript
Size: 8110 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-db/attachments/20011006/87f9f84b/attachment.eps>

Kurt Hornik

Sun, Oct 7, 2001 11:17 AM #

I had raised this with Tim.  One could argue for `SDBI' a la Omegahat
but I do not think the `S' is necessary.

I will wait with moving up the packages until the naming has been
deciced.

-k

David James

Sun, Oct 7, 2001 7:24 PM #

Sorry, I forgot to include the file "biblio.bib" needed by
DBI.tex.  I'm attaching it below.

David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636

David James

Sun, Oct 7, 2001 7:25 PM #

David James wrote:

David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: biblio.bib
URL: <https://stat.ethz.ch/pipermail/r-sig-db/attachments/20011007/9269caf5/attachment.ksh>

Tim Keitt

Mon, Oct 8, 2001 10:51 AM #

I think we should hold off on the packages as David's document specifies 
a much bigger interface than is in Rdbi. We need some time to discuss 
David's proposal. I'm fine with 'DBI' or even 'database' for the package 
name. I hope to write some comments in the next week or so.

Tim

Kurt Hornik wrote:

Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/

Tim Keitt

Mon, Oct 8, 2001 12:19 PM #

David James wrote:

Agreed.

Yes, these are needed. I must have forgotten to move them over from RPgSQL.

This has been my dream all along---to be able to attach a database and 
use the native R interface to access the data. I'd be very excited to 
see some general hooks into the 'attach' command. Even still, this is 
actually not at all simple to implement without write access to the 
database. Also, as I've learned SQL, I find I want to run queries 
directly rather than use R on proxies. Anyway, it would be a nice feature.

I used class 'PgSQL' in Rdbi.PgSQL, but the names are arbitrary.

Interesting. I hadn't considered this. I don't see it as immediately 
usefull. Perhaps some mock code examples would help.

Agreed.

Perhaps we should just leave this out for now. I once thought it would 
be nice if table proxies could reinitialize their connections 
automatically, but it not essential at this time.

In Rdbi, these are available using dbConnectionInfo(conn). Do we need a 
seperate function? (I'm not saying we don't.)

Agreed.

I put these in util.R in Rdbi.

Unless we are prepared to store metadata in seperate tables (requires 
write access to the database), we're pretty much limited to dealing with 
unique types defined in the database. My solution (types.R in 
Rdbi.PgSQL) was a function that returns the SQL type given an R object, 
a function that takes an R object and creates a corresponding SQL 
formatted string and a function that takes an SQL formatted string and 
converts it back to the R type. These could be made into generic methods.

This was true in RPgSQL, but not Rdbi. In Rdbi, you can have as many 
'conn' and 'result' objects open as resources permit. I don't have the 
concept of a driver object, because I'm not convinced its needed.

Its reasonable, but is it essential? Would anybody use these functions?

This is a good point.

Agreed. Does R yet support v.4 style classes? Is there a good 
reference/tutorial?

Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/

David James

Mon, Oct 8, 2001 7:15 PM #

Timothy H. Keitt wrote:

David James wrote:

Hi,

Well, earlier this week I committed to taking a look at the Rdbi
package and to producing a first draft for the common database
interface (which I'm attaching as two text files: DBI.tex and
figure1.eps).

The Rdbi package looks quite reasonable to me.  It defines most of
the core functionality we need, and I think that it wouldn't be too
hard to make existing packages be compatible with it.  I also have
some suggestions and comments (in no particular order).

1. R/Splus compatibility.

  From an R/Splus compatibility point of view, I'd suggest
  renaming Rdbi to something neutral.  Initially I had thought 
  as RSdbi, but why not simply call it DBI -- and thus friendly
  to both.

Agreed.


2. I would add 2 more generic functions 

      dbRemoveTable(con, tbl.name, ...)
      dbExistsTable(con, tbl.name,...)  

  in addition to the dbReadTable, dbWriteTable, etc.

Yes, these are needed. I must have forgotten to move them over from RPgSQL.


  The reason is that with these 2 we would have all the methods
  require to be able to attach() any DBMS to the search path (thus
  would allow us to simply attach(conn) and treat the remote tables
  as simple data.frames).  R does not yet provide facilities for
  attaching arbitrary databases, but I believe that Duncan (and John?) 
  will be doing this soon(?).

This has been my dream all along---to be able to attach a database and 
use the native R interface to access the data. I'd be very excited to 
see some general hooks into the 'attach' command. Even still, this is 
actually not at all simple to implement without write access to the 
database. Also, as I've learned SQL, I find I want to run queries 
directly rather than use R on proxies. Anyway, it would be a nice feature.


3. I would add another class to group the various driver classes
  (RODBC, RPgSQL, etc.)

I used class 'PgSQL' in Rdbi.PgSQL, but the names are arbitrary.

What I meant to say was that we should add/consider a virtual driver
class to group all actual drivers.  Conceptually all the drivers
(RPgSQL, RODBC, etc) extend the driver virtual class.


4. Result class

  Rdbi defines only one class Rdbi.result to represent the result
  of a query, yet there are 2 types of queries -- those that produce
  output and those that don't.  I can see that for simplicity having
  only one is advantageous, but perhaps we should distinguish them
  (the result of SELECT-like queries that produce output could
  extend the results that don't).

Interesting. I hadn't considered this. I don't see it as immediately 
usefull. Perhaps some mock code examples would help.


5. I would replace the generic functions

        dbConnectionInfo(conn)
        dbResultInfo(res)
  
   with a single dbGetInfo(obj), say, that dispatches on
   connections, results (and drivers, see 3. above) to retrieve
   meta-data for those objects.

   (Probably I would leave dbColumnInfo(res) as defined in the
   Rdbi package to explicitly extract meta-data from the individual
   columns of the result set.)

Agreed.


6. dbReconnect() 

  From the comments in the Rdbi code, it appears that it's meant to
  take a DBMS connection object (even from a previous session) and
  re-instantiate the (broken) connection.  It seems to me that this
  would require storing passwords in the regular .RData or .Data.
  From a security point of view, I'm uncomfortable doing this.
  Is there a way of re-connecting without saving passwords in
  obvious places?  (I guess that this is more of an implementation
  issue?)

Perhaps we should just leave this out for now. I once thought it would 
be nice if table proxies could reinitialize their connections 
automatically, but it not essential at this time.


7. Exceptions.

  We need a dbGetException(conn) generic to report if and when
  there are errors in a DBMS connection "conn";  if there has been
  an error, it should report the code (number) and/or message.

In Rdbi, these are available using dbConnectionInfo(conn). Do we need a 
seperate function? (I'm not saying we don't.)

I guess not.  The important thing is the ability to obtain the status
of the operation, so having it as part of the dbConnectionInfo is
indeed fine.  Perhaps the DBI should then specify the member names,
say "error.number" and "error.message" (say)?

Oops, obviously I missed them.  Now I see the make.db.names() in
PgSQL that I also missed.

Strictly speaking, probably we could get away without it, but I
think having a driver class extended by the various implementations
could be useful.  For instance, one thing that driver objects can
provide is help with conversions.  Data type mappings probably
are best specified for drivers -- not connections nor result sets.
For instance, suppose dbDataType is a generic that returns the
"closest" data type for the R object x on DBMS drv

     dbDataType(drv, x)

then as new drivers get written, they extend dbDataType in the
obvious way.  It would be possible to specify the driver as a
character string instead of the driver object and use a switch
statement inside the function, but this would be error-prone and
would require keeping the dbDataType up-to-date.

Similarly (but perhaps not as interesting) for mapping names
(identifiers) from R and the DBMS, as in make.db.names in RPgSQL.

Another use is for listing all open connections (not unlike what
showConnections/getConnections currently do in R). See below.

They would be similar to the current  showConnections/getConnection.

R-devel does, and I understand the 1.4 will have them as a library "methods".

Re: references, I know of two, John Chambers' "Programming with Data"
(green book) and V&R's "S Programming".

David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636

Tim Keitt

Wed, Oct 10, 2001 10:44 AM #

David James wrote:

Timothy H. Keitt wrote:

David James wrote:

Hi,

Well, earlier this week I committed to taking a look at the Rdbi
package and to producing a first draft for the common database
interface (which I'm attaching as two text files: DBI.tex and
figure1.eps).

The Rdbi package looks quite reasonable to me.  It defines most of
the core functionality we need, and I think that it wouldn't be too
hard to make existing packages be compatible with it.  I also have
some suggestions and comments (in no particular order).

1. R/Splus compatibility.

 From an R/Splus compatibility point of view, I'd suggest
 renaming Rdbi to something neutral.  Initially I had thought 
 as RSdbi, but why not simply call it DBI -- and thus friendly
 to both.

Agreed.

2. I would add 2 more generic functions 

     dbRemoveTable(con, tbl.name, ...)
     dbExistsTable(con, tbl.name,...)  

 in addition to the dbReadTable, dbWriteTable, etc.

Yes, these are needed. I must have forgotten to move them over from RPgSQL.

 The reason is that with these 2 we would have all the methods
 require to be able to attach() any DBMS to the search path (thus
 would allow us to simply attach(conn) and treat the remote tables
 as simple data.frames).  R does not yet provide facilities for
 attaching arbitrary databases, but I believe that Duncan (and John?) 
 will be doing this soon(?).

This has been my dream all along---to be able to attach a database and 
use the native R interface to access the data. I'd be very excited to 
see some general hooks into the 'attach' command. Even still, this is 
actually not at all simple to implement without write access to the 
database. Also, as I've learned SQL, I find I want to run queries 
directly rather than use R on proxies. Anyway, it would be a nice feature.

3. I would add another class to group the various driver classes
 (RODBC, RPgSQL, etc.)

I used class 'PgSQL' in Rdbi.PgSQL, but the names are arbitrary.

What I meant to say was that we should add/consider a virtual driver
class to group all actual drivers.  Conceptually all the drivers
(RPgSQL, RODBC, etc) extend the driver virtual class.

4. Result class

 Rdbi defines only one class Rdbi.result to represent the result
 of a query, yet there are 2 types of queries -- those that produce
 output and those that don't.  I can see that for simplicity having
 only one is advantageous, but perhaps we should distinguish them
 (the result of SELECT-like queries that produce output could
 extend the results that don't).

Interesting. I hadn't considered this. I don't see it as immediately 
usefull. Perhaps some mock code examples would help.

5. I would replace the generic functions

       dbConnectionInfo(conn)
       dbResultInfo(res)
 
  with a single dbGetInfo(obj), say, that dispatches on
  connections, results (and drivers, see 3. above) to retrieve
  meta-data for those objects.

  (Probably I would leave dbColumnInfo(res) as defined in the
  Rdbi package to explicitly extract meta-data from the individual
  columns of the result set.)

Agreed.

6. dbReconnect() 

 From the comments in the Rdbi code, it appears that it's meant to
 take a DBMS connection object (even from a previous session) and
 re-instantiate the (broken) connection.  It seems to me that this
 would require storing passwords in the regular .RData or .Data.
 From a security point of view, I'm uncomfortable doing this.
 Is there a way of re-connecting without saving passwords in
 obvious places?  (I guess that this is more of an implementation
 issue?)

Perhaps we should just leave this out for now. I once thought it would 
be nice if table proxies could reinitialize their connections 
automatically, but it not essential at this time.

7. Exceptions.

 We need a dbGetException(conn) generic to report if and when
 there are errors in a DBMS connection "conn";  if there has been
 an error, it should report the code (number) and/or message.

In Rdbi, these are available using dbConnectionInfo(conn). Do we need a 
seperate function? (I'm not saying we don't.)

I guess not.  The important thing is the ability to obtain the status
of the operation, so having it as part of the dbConnectionInfo is
indeed fine.  Perhaps the DBI should then specify the member names,
say "error.number" and "error.message" (say)?

8. Version information.

 I think we want to ensure that packages that implement the DBI
 report what version of the DBI they implements -- this could
 easily be reported by the dbGetInfo() method of the driver. (We
 may even require that each driver implementation reports its
 version number.)

Agreed.

9. Common utilities.

 Most packages that will implement the DBI will need some
 common utilities, e.g., functions to map R/S names to SQL
 valid identifiers, to "guess" the SQL type of R/S objects, etc.
 We should provide these common methods in the DBI (of course,
 individual packages may choose to extend them or completely
 override them).

I put these in util.R in Rdbi.

Oops, obviously I missed them.  Now I see the make.db.names() in
PgSQL that I also missed.

10. Data mappings.

  As Tim has already pointed out, this needs more thought. The
  packages RJava, RPerl, RPython (and the netscape R plugin)
  have dealt with this issue (just like the R-Excel,R-DCOM, etc.)
  We may want to implement the same converters ideas in some
  of those?

Unless we are prepared to store metadata in seperate tables (requires 
write access to the database), we're pretty much limited to dealing with 
unique types defined in the database. My solution (types.R in 
Rdbi.PgSQL) was a function that returns the SQL type given an R object, 
a function that takes an R object and creates a corresponding SQL 
formatted string and a function that takes an SQL formatted string and 
converts it back to the R type. These could be made into generic methods.

11. Multiple instance of connections, result sets, etc.

  The Rdbi package is silent re: allowing multiple DBMS object
  simultaneously, i.e., multiple open drivers, DBMS connections,
  result sets (RODBC, for instance, already implement multiple
  connections).

This was true in RPgSQL, but not Rdbi. In Rdbi, you can have as many 
'conn' and 'result' objects open as resources permit. I don't have the 
concept of a driver object, because I'm not convinced its needed.

Strictly speaking, probably we could get away without it, but I
think having a driver class extended by the various implementations
could be useful.  For instance, one thing that driver objects can
provide is help with conversions.  Data type mappings probably
are best specified for drivers -- not connections nor result sets.
For instance, suppose dbDataType is a generic that returns the
"closest" data type for the R object x on DBMS drv

    dbDataType(drv, x)

then as new drivers get written, they extend dbDataType in the
obvious way.  It would be possible to specify the driver as a
character string instead of the driver object and use a switch
statement inside the function, but this would be error-prone and
would require keeping the dbDataType up-to-date.

Similarly (but perhaps not as interesting) for mapping names
(identifiers) from R and the DBMS, as in make.db.names in RPgSQL.

Another use is for listing all open connections (not unlike what
showConnections/getConnections currently do in R). See below.

I guess I'm becoming convinced. Actually, I put a driver class into Rdbi 
(based on your original proposal), but its only used to autoload the 
driver package and cause dispatch to the correct connection-generator. I 
think associating data conversion with driver objects, rather than the 
package itself, will be necessary when accessing, for example, different 
postgresql databases that have different sets of user-defined types. 
That would require a per-driver set of conversion routines.

My hesitancy is based on my assumption that 95% of users will access the 
same database every R session and will not enjoy having to reconstruct 
their environment from scratch each time. That's why I'm interested in 
persistent connections and so on. In fact, I'd like to find a clean way 
to have a default connection be used when no connection object is 
supplied to the query routines. Perhaps the best way to do this is to 
create another package that sits on top of the SDBI (or whatever we 
decide to call it) package and provides a seperate interface that 
internally hides the driver, connection and result objects.

Actually, these might be useful for implementing the simplified 
interface described above. But does this mean we will have to explicitly 
destroy all connection and result objects? I think any objects generated 
should be garbage collected like any other R object; otherwise you are 
introducing a new object model. (I suppose connections could unregister 
themselves from the driver when garbage collected.)

12. At one point we had thought about making the DBMS interface 
  general enough so that non-relational DBMS could be covered 
  under the same API (HDF5 comes to mind, but also Berkeley DB).
  Is this reasonable?  Or are they so different to relational DBMS 
  that it's  just not feasible?  Or is it us that we have been too
  narrowly focus?  Somebody that's using non-relational should take
  a closer look and let us know....

This is a good point.


Now, let me move on to the draft paper on a common interface.

Based on the Rdbi package, the above points, and previous
discussions in various lists/forums, I've put together a rough
draft that attempts to flush out in enough detail a framework for
implementing the interface.  I hope that after a few iterations,
it can be used as a guide by DBMS package developers.

I found that describing the interface in terms of classes is a lot
more natural (and easier) than in terms of (only) generic functions and
methods.  Moreover, as I looked at other DB API's (Python, Perl, Java)
I felt more strongly that perhaps we should be using version 4
class-based instead of traditional version 3 object-based programming.
With the current library(methods) in R-devel in quite a reasonable shape, 
I think the DBI is an ideal candidate project for using more advanced
programming tools.  

So, let me throw in the idea of building the DBI using version 4
style classes.  I believe that most of the (current) Rdbi could
be re-used, and I'd volunteer to migrate it to the version 4 style.

Agreed. Does R yet support v.4 style classes? Is there a good 
reference/tutorial?

R-devel does, and I understand the 1.4 will have them as a library "methods".

Re: references, I know of two, John Chambers' "Programming with Data"
(green book) and V&R's "S Programming".

Regards,

DBI.tex

Content-Type:

application/x-tex


------------------------------------------------------------------------
figure1.eps

Content-Type:

application/postscript

-- 
Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/

Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/

Luke Tierney

Wed, Oct 10, 2001 11:02 AM #

On Wed, Oct 10, 2001 at 01:44:50PM -0400, Timothy H. Keitt wrote:

Allowing DBconnections to be handled by the GC as a backup is I think
the right thing to do.  Since they involve scarce resource, explicit
closing should be encouraged.  The fact that the connection system
does not do this is I think unfortunate and something I think we will
have to revisit before too long (aside from the potential of leaving
connections open it is too easy for one function's use of connections
to trample anothers, especially if we go to any form of threading).

It is nevertheless possible to allow a list of open connections to be
obtained by using a weak reference mechanism. There is experimental
support for this in the devel version, with some notes on it at
http://www.stat.umn.edu/~luke/R/references/weakfinex.html (also off
the developer page).

luke

Luke Tierney
University of Minnesota                      Phone:           612-625-7843
School of Statistics                         Fax:             612-624-8868
313 Ford Hall, 224 Church St. S.E.           email:      luke at stat.umn.edu
Minneapolis, MN 55455 USA                    WWW:  http://www.stat.umn.edu

Tim Keitt

Wed, Oct 10, 2001 11:21 AM #

Luke Tierney wrote:

The finalizer hooks described in those notes are the ones I used in Rdbi 
to let the garbage collector close connections associated with a 
connection object that gets collected. I see that we could also use this 
to list open connections as well.

Tim

Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/