Skip to content

Parameterised queries

13 messages · Hannes Mühleisen, Hadley Wickham, Tim Keitt

#
Hi all,

As part of my work modernising R's database connnectors, I've been
working on improving the support for parameterised queries. I'd love
to get your feedback on the API I'm proposing.

The goal is to encourage people to use parameterise queries instead of
pasting together SQL strings because it is much safer (no worries
about SQL injection attacks) and somewhat more performant (becuase the
database only needs to parse the query once).

There are basically two ways to use it:

* A parameterised query + multiple calls to `dbBind()`
* A parameterised query + a list of params.

Here's an example using the dev version of RSQLite available from
https://github.com/rstats-db/RSQLite:

```
library(DBI)
con <- RSQLite::datasetsDb()

rs <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = $x")
# Not bound, so throws error
dbFetch(rs)

# Bind by position
dbBind(rs, list(8))
dbFetch(rs)

# Bind by name
dbBind(rs, list(x = 8))
dbFetch(rs)

# Or do when you create the query
rs <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = $x", list(x = 8))
dbFetch(rs)

# Or all at once with dbGetQuery
dbGetQuery(con, "SELECT * FROM mtcars WHERE cyl = $x", list(x = 8))
```

What do you think? I've deliberately designed the syntax to be
backward compatible, although it obviously requires some changes to
DBI (such as the introduction of the dbBind() generic).

I also have an implementation available for Postgres (in my new
RPostgres package, https://github.com/rstats-db/RPostgres) and I'll be
working on RMySQL later this week.

Hadley
#
Hi Hadley and list,
Makes a lot of sense, yes. MonetDB.R has had support for this from day one. Our syntax uses the list of parameters approach, e.g.

dbSendUpdate(conn, "INSERT INTO sometable (a) VALUES (?)", ?foobar?)

of course, the parameter can be a vector, in which case the query is executed multiple times. 

Generally, I would be in favour of the ?list of params? approach. Also, please note that the ?bind by name? is not supported by all databases. Sticking to position-only parameter binding using ? would be most compatible (also used in JDBC for example).

Best,

Hannes


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4154 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-db/attachments/20150211/86267552/attachment.p7s>
#
On Wed, Feb 11, 2015 at 8:39 AM, Hannes M?hleisen
<Hannes.Muehleisen at cwi.nl> wrote:
I deliberately chose not to allow vector parameters because I think
it's reasonable to say: "Each R function only ever generates a single
query to the database", leaving the onus of looping on the user, and
forcing them to think about how to ensure the vectors don't contain
bad values.  This the same principle behind disallowing multiple
queries (separated by ";") in a single string.

(Side note: if you create a new generic, can you please start it with
(e.g.) monet, not db, because it's doesn't live in the DBI package.
Alternatively, submit a pull request to DBI adding the generic, and
I'll review it. I don't think dbSendQuery is needed with the current
api, but I might be wrong.)
Yup, postgresql doesn't support names either. In that case, providing
a named list would be an error. But where named binding is supported,
I think it's better to use it because it eliminates a class of
potential errors.

Hadley
#
Hi Hadley,
However, there are optimisation opportunities that the db driver could exploit if multiple values are to be inserted at once. For example, a database connection in auto commit mode could switch of auto commit, and try to insert all values in a transaction to get all-or-nothing semantics. Another opportunity would be a bulk load if the vectors are large.
I think a script using DBI should not need to know which DBI implementation is running behind it. But if someone uses named parameters on a MySQL backend (possibly out of order), that script will not run with Postgres or others. 

Best,

Hannes




-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4154 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-db/attachments/20150211/9c4f187d/attachment.p7s>
#
On Wed, Feb 11, 2015 at 8:01 AM, Hadley Wickham <h.wickham at gmail.com> wrote:

            
This seems like a reasonable extension to DBI.

Just for comparison, in rpg (https://github.com/thk686/rpg) I have:

query(sql, pars)  # pars are substituted for $ vars if provided
fetch(sql, pars) # also fetches result

The latest version (not yet on CRAN) has:

f = prepare(sql)
f(pars)

for prepared queries. f is a closure around the auto-generated
prepared-query id-string. The function f does loop at the C++ level over a
vector or matrix of parameters.

THK

  
    
#
Agreed, but it's also fundamentally dangerous. I think this should be
a separate function with that clearly describes the performance-safety
trade off, maybe dbBindAll()?  Alternatively, you could have an
additional `vectorise` argument that defaulted to FALSE.

That said, can you think of a use case apart from doing bulk inserts?
In RSQLite, I use an internal function to get optimal performance for
dbReadTable without generally exposing a more dangerous api.
That's a noble goal, but extremely difficult in principle because of
the variations in SQL support across backends. I'd prefer not to take
a lowest common denominator approach.

Hadley
#
On Wed, Feb 11, 2015 at 1:49 PM, Hadley Wickham <h.wickham at gmail.com> wrote:

            
I don't understand what is dangerous about repeatedly executing a prepared
query. Can you give a scenario?
Other than the convenience, and it being sort of R-like, yes, bulk inserts
is the main application in my case.
I thought that was the intent of DBI.

THK

  
    
#
It gives a new attack vector - to introduce additional data into the
database, you just need to figure out how to turn a length 1 vector in
to a length 2 vector.

It's dangerous in the same way that allowing dbGetQuery() to execute
multiple queries is dangerous.
Great.

Hadley
#
On Wed, Feb 11, 2015 at 2:24 PM, Hadley Wickham <h.wickham at gmail.com> wrote:

            
I'd rather hope that if it were a case that mattered, the user would not
rely on the api as a substitute for appropriate checks.

THK

  
    
#
I think the API should be as safe as possible by default, and
sacrificing safety for speed should only be done explicitly when the
user asks for it.

Hadley
#
On Wed, Feb 11, 2015 at 2:41 PM, Hadley Wickham <h.wickham at gmail.com> wrote:

            
My use cases are not so sensitive, but I agree with the general idea. Also,
you really do not gain much over regular looping as inserts are really
slow, at least in postgresql.

THK

  
    
#
Yes, I'm just working on that for RPostgres. Parameterised inserts are
actually slower than sending one giant SQL string. RPostgreSQL uses
COPY ... FROM STDIN (and manually converts the data frame into a
single c string) for better performance

Hadley
#
On Wed, Feb 11, 2015 at 3:07 PM, Hadley Wickham <h.wickham at gmail.com> wrote:

            
I put copy_from and copy_to in rpg. I punted on the C interface and simply
shell out to psql in that case. Works fine with read.csv. and write.csv.

THK