Hi all, As part of my work modernising R's database connnectors, I've been working on improving the support for parameterised queries. I'd love to get your feedback on the API I'm proposing. The goal is to encourage people to use parameterise queries instead of pasting together SQL strings because it is much safer (no worries about SQL injection attacks) and somewhat more performant (becuase the database only needs to parse the query once). There are basically two ways to use it: * A parameterised query + multiple calls to `dbBind()` * A parameterised query + a list of params. Here's an example using the dev version of RSQLite available from https://github.com/rstats-db/RSQLite: ``` library(DBI) con <- RSQLite::datasetsDb() rs <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = $x") # Not bound, so throws error dbFetch(rs) # Bind by position dbBind(rs, list(8)) dbFetch(rs) # Bind by name dbBind(rs, list(x = 8)) dbFetch(rs) # Or do when you create the query rs <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = $x", list(x = 8)) dbFetch(rs) # Or all at once with dbGetQuery dbGetQuery(con, "SELECT * FROM mtcars WHERE cyl = $x", list(x = 8)) ``` What do you think? I've deliberately designed the syntax to be backward compatible, although it obviously requires some changes to DBI (such as the introduction of the dbBind() generic). I also have an implementation available for Postgres (in my new RPostgres package, https://github.com/rstats-db/RPostgres) and I'll be working on RMySQL later this week. Hadley
Parameterised queries
13 messages · Hannes Mühleisen, Hadley Wickham, Tim Keitt
Hi Hadley and list,
On 11 Feb 2015, at 15:01, Hadley Wickham <h.wickham at gmail.com> wrote: As part of my work modernising R's database connnectors, I've been working on improving the support for parameterised queries. I'd love to get your feedback on the API I'm proposing. The goal is to encourage people to use parameterise queries instead of pasting together SQL strings because it is much safer (no worries about SQL injection attacks) and somewhat more performant (becuase the database only needs to parse the query once).
Makes a lot of sense, yes. MonetDB.R has had support for this from day one. Our syntax uses the list of parameters approach, e.g. dbSendUpdate(conn, "INSERT INTO sometable (a) VALUES (?)", ?foobar?) of course, the parameter can be a vector, in which case the query is executed multiple times. Generally, I would be in favour of the ?list of params? approach. Also, please note that the ?bind by name? is not supported by all databases. Sticking to position-only parameter binding using ? would be most compatible (also used in JDBC for example). Best, Hannes -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4154 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-sig-db/attachments/20150211/86267552/attachment.p7s>
On Wed, Feb 11, 2015 at 8:39 AM, Hannes M?hleisen
<Hannes.Muehleisen at cwi.nl> wrote:
Hi Hadley and list,
On 11 Feb 2015, at 15:01, Hadley Wickham <h.wickham at gmail.com> wrote: As part of my work modernising R's database connnectors, I've been working on improving the support for parameterised queries. I'd love to get your feedback on the API I'm proposing. The goal is to encourage people to use parameterise queries instead of pasting together SQL strings because it is much safer (no worries about SQL injection attacks) and somewhat more performant (becuase the database only needs to parse the query once).
Makes a lot of sense, yes. MonetDB.R has had support for this from day one. Our syntax uses the list of parameters approach, e.g. dbSendUpdate(conn, "INSERT INTO sometable (a) VALUES (?)", ?foobar?) of course, the parameter can be a vector, in which case the query is executed multiple times.
I deliberately chose not to allow vector parameters because I think it's reasonable to say: "Each R function only ever generates a single query to the database", leaving the onus of looping on the user, and forcing them to think about how to ensure the vectors don't contain bad values. This the same principle behind disallowing multiple queries (separated by ";") in a single string. (Side note: if you create a new generic, can you please start it with (e.g.) monet, not db, because it's doesn't live in the DBI package. Alternatively, submit a pull request to DBI adding the generic, and I'll review it. I don't think dbSendQuery is needed with the current api, but I might be wrong.)
Generally, I would be in favour of the ?list of params? approach. Also, please note that the ?bind by name? is not supported by all databases. Sticking to position-only parameter binding using ? would be most compatible (also used in JDBC for example).
Yup, postgresql doesn't support names either. In that case, providing a named list would be an error. But where named binding is supported, I think it's better to use it because it eliminates a class of potential errors. Hadley
Hi Hadley,
On 11 Feb 2015, at 16:08, Hadley Wickham <h.wickham at gmail.com> wrote: On Wed, Feb 11, 2015 at 8:39 AM, Hannes M?hleisen <Hannes.Muehleisen at cwi.nl> wrote:
Hi Hadley and list,
On 11 Feb 2015, at 15:01, Hadley Wickham <h.wickham at gmail.com> wrote: As part of my work modernising R's database connnectors, I've been working on improving the support for parameterised queries. I'd love to get your feedback on the API I'm proposing. The goal is to encourage people to use parameterise queries instead of pasting together SQL strings because it is much safer (no worries about SQL injection attacks) and somewhat more performant (becuase the database only needs to parse the query once).
Makes a lot of sense, yes. MonetDB.R has had support for this from day one. Our syntax uses the list of parameters approach, e.g. dbSendUpdate(conn, "INSERT INTO sometable (a) VALUES (?)", ?foobar?) of course, the parameter can be a vector, in which case the query is executed multiple times.
I deliberately chose not to allow vector parameters because I think it's reasonable to say: "Each R function only ever generates a single query to the database", leaving the onus of looping on the user, and forcing them to think about how to ensure the vectors don't contain bad values. This the same principle behind disallowing multiple queries (separated by ";") in a single string.
However, there are optimisation opportunities that the db driver could exploit if multiple values are to be inserted at once. For example, a database connection in auto commit mode could switch of auto commit, and try to insert all values in a transaction to get all-or-nothing semantics. Another opportunity would be a bulk load if the vectors are large.
Generally, I would be in favour of the ?list of params? approach. Also, please note that the ?bind by name? is not supported by all databases. Sticking to position-only parameter binding using ? would be most compatible (also used in JDBC for example).
Yup, postgresql doesn't support names either. In that case, providing a named list would be an error. But where named binding is supported, I think it's better to use it because it eliminates a class of potential errors.
I think a script using DBI should not need to know which DBI implementation is running behind it. But if someone uses named parameters on a MySQL backend (possibly out of order), that script will not run with Postgres or others. Best, Hannes -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4154 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-sig-db/attachments/20150211/9c4f187d/attachment.p7s>
On Wed, Feb 11, 2015 at 8:01 AM, Hadley Wickham <h.wickham at gmail.com> wrote:
Hi all, As part of my work modernising R's database connnectors, I've been working on improving the support for parameterised queries. I'd love to get your feedback on the API I'm proposing. The goal is to encourage people to use parameterise queries instead of pasting together SQL strings because it is much safer (no worries about SQL injection attacks) and somewhat more performant (becuase the database only needs to parse the query once). There are basically two ways to use it: * A parameterised query + multiple calls to `dbBind()` * A parameterised query + a list of params. Here's an example using the dev version of RSQLite available from https://github.com/rstats-db/RSQLite: ``` library(DBI) con <- RSQLite::datasetsDb() rs <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = $x") # Not bound, so throws error dbFetch(rs) # Bind by position dbBind(rs, list(8)) dbFetch(rs) # Bind by name dbBind(rs, list(x = 8)) dbFetch(rs) # Or do when you create the query rs <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = $x", list(x = 8)) dbFetch(rs) # Or all at once with dbGetQuery dbGetQuery(con, "SELECT * FROM mtcars WHERE cyl = $x", list(x = 8)) ``` What do you think? I've deliberately designed the syntax to be backward compatible, although it obviously requires some changes to DBI (such as the introduction of the dbBind() generic). I also have an implementation available for Postgres (in my new RPostgres package, https://github.com/rstats-db/RPostgres) and I'll be working on RMySQL later this week.
This seems like a reasonable extension to DBI. Just for comparison, in rpg (https://github.com/thk686/rpg) I have: query(sql, pars) # pars are substituted for $ vars if provided fetch(sql, pars) # also fetches result The latest version (not yet on CRAN) has: f = prepare(sql) f(pars) for prepared queries. f is a closure around the auto-generated prepared-query id-string. The function f does loop at the C++ level over a vector or matrix of parameters. THK
Hadley -- http://had.co.nz/
_______________________________________________ R-sig-DB mailing list -- R Special Interest Group R-sig-DB at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-db
http://www.keittlab.org/ [[alternative HTML version deleted]]
I deliberately chose not to allow vector parameters because I think it's reasonable to say: "Each R function only ever generates a single query to the database", leaving the onus of looping on the user, and forcing them to think about how to ensure the vectors don't contain bad values. This the same principle behind disallowing multiple queries (separated by ";") in a single string.
However, there are optimisation opportunities that the db driver could exploit if multiple values are to be inserted at once. For example, a database connection in auto commit mode could switch of auto commit, and try to insert all values in a transaction to get all-or-nothing semantics. Another opportunity would be a bulk load if the vectors are large.
Agreed, but it's also fundamentally dangerous. I think this should be a separate function with that clearly describes the performance-safety trade off, maybe dbBindAll()? Alternatively, you could have an additional `vectorise` argument that defaulted to FALSE. That said, can you think of a use case apart from doing bulk inserts? In RSQLite, I use an internal function to get optimal performance for dbReadTable without generally exposing a more dangerous api.
Generally, I would be in favour of the ?list of params? approach. Also, please note that the ?bind by name? is not supported by all databases. Sticking to position-only parameter binding using ? would be most compatible (also used in JDBC for example).
Yup, postgresql doesn't support names either. In that case, providing a named list would be an error. But where named binding is supported, I think it's better to use it because it eliminates a class of potential errors.
I think a script using DBI should not need to know which DBI implementation is running behind it. But if someone uses named parameters on a MySQL backend (possibly out of order), that script will not run with Postgres or others.
That's a noble goal, but extremely difficult in principle because of the variations in SQL support across backends. I'd prefer not to take a lowest common denominator approach. Hadley
On Wed, Feb 11, 2015 at 1:49 PM, Hadley Wickham <h.wickham at gmail.com> wrote:
I deliberately chose not to allow vector parameters because I think it's reasonable to say: "Each R function only ever generates a single query to the database", leaving the onus of looping on the user, and forcing them to think about how to ensure the vectors don't contain bad values. This the same principle behind disallowing multiple queries (separated by ";") in a single string.
However, there are optimisation opportunities that the db driver could
exploit if multiple values are to be inserted at once. For example, a database connection in auto commit mode could switch of auto commit, and try to insert all values in a transaction to get all-or-nothing semantics. Another opportunity would be a bulk load if the vectors are large. Agreed, but it's also fundamentally dangerous. I think this should be a separate function with that clearly describes the performance-safety trade off, maybe dbBindAll()? Alternatively, you could have an additional `vectorise` argument that defaulted to FALSE.
I don't understand what is dangerous about repeatedly executing a prepared query. Can you give a scenario?
That said, can you think of a use case apart from doing bulk inserts? In RSQLite, I use an internal function to get optimal performance for dbReadTable without generally exposing a more dangerous api.
Other than the convenience, and it being sort of R-like, yes, bulk inserts is the main application in my case.
Generally, I would be in favour of the "list of params" approach.
Also, please note that the "bind by name" is not supported by all databases. Sticking to position-only parameter binding using ? would be most compatible (also used in JDBC for example).
Yup, postgresql doesn't support names either. In that case, providing a named list would be an error. But where named binding is supported, I think it's better to use it because it eliminates a class of potential errors.
I think a script using DBI should not need to know which DBI
implementation is running behind it. But if someone uses named parameters on a MySQL backend (possibly out of order), that script will not run with Postgres or others. That's a noble goal, but extremely difficult in principle because of the variations in SQL support across backends. I'd prefer not to take a lowest common denominator approach.
I thought that was the intent of DBI. THK
Hadley -- http://had.co.nz/
_______________________________________________ R-sig-DB mailing list -- R Special Interest Group R-sig-DB at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-db
http://www.keittlab.org/ [[alternative HTML version deleted]]
I deliberately chose not to allow vector parameters because I think it's reasonable to say: "Each R function only ever generates a single query to the database", leaving the onus of looping on the user, and forcing them to think about how to ensure the vectors don't contain bad values. This the same principle behind disallowing multiple queries (separated by ";") in a single string.
However, there are optimisation opportunities that the db driver could exploit if multiple values are to be inserted at once. For example, a database connection in auto commit mode could switch of auto commit, and try to insert all values in a transaction to get all-or-nothing semantics. Another opportunity would be a bulk load if the vectors are large.
Agreed, but it's also fundamentally dangerous. I think this should be a separate function with that clearly describes the performance-safety trade off, maybe dbBindAll()? Alternatively, you could have an additional `vectorise` argument that defaulted to FALSE.
I don't understand what is dangerous about repeatedly executing a prepared query. Can you give a scenario?
It gives a new attack vector - to introduce additional data into the database, you just need to figure out how to turn a length 1 vector in to a length 2 vector. It's dangerous in the same way that allowing dbGetQuery() to execute multiple queries is dangerous.
That said, can you think of a use case apart from doing bulk inserts? In RSQLite, I use an internal function to get optimal performance for dbReadTable without generally exposing a more dangerous api.
Other than the convenience, and it being sort of R-like, yes, bulk inserts is the main application in my case.
Great. Hadley
On Wed, Feb 11, 2015 at 2:24 PM, Hadley Wickham <h.wickham at gmail.com> wrote:
I deliberately chose not to allow vector parameters because I think it's reasonable to say: "Each R function only ever generates a single query to the database", leaving the onus of looping on the user, and forcing them to think about how to ensure the vectors don't contain bad values. This the same principle behind disallowing multiple queries (separated by ";") in a single string.
However, there are optimisation opportunities that the db driver could exploit if multiple values are to be inserted at once. For example, a database connection in auto commit mode could switch of auto commit,
and try
to insert all values in a transaction to get all-or-nothing semantics. Another opportunity would be a bulk load if the vectors are large.
Agreed, but it's also fundamentally dangerous. I think this should be a separate function with that clearly describes the performance-safety trade off, maybe dbBindAll()? Alternatively, you could have an additional `vectorise` argument that defaulted to FALSE.
I don't understand what is dangerous about repeatedly executing a
prepared
query. Can you give a scenario?
It gives a new attack vector - to introduce additional data into the database, you just need to figure out how to turn a length 1 vector in to a length 2 vector. It's dangerous in the same way that allowing dbGetQuery() to execute multiple queries is dangerous.
I'd rather hope that if it were a case that mattered, the user would not rely on the api as a substitute for appropriate checks. THK
That said, can you think of a use case apart from doing bulk inserts? In RSQLite, I use an internal function to get optimal performance for dbReadTable without generally exposing a more dangerous api.
Other than the convenience, and it being sort of R-like, yes, bulk
inserts
is the main application in my case.
Great. Hadley -- http://had.co.nz/
http://www.keittlab.org/ [[alternative HTML version deleted]]
It gives a new attack vector - to introduce additional data into the database, you just need to figure out how to turn a length 1 vector in to a length 2 vector. It's dangerous in the same way that allowing dbGetQuery() to execute multiple queries is dangerous.
I'd rather hope that if it were a case that mattered, the user would not rely on the api as a substitute for appropriate checks.
I think the API should be as safe as possible by default, and sacrificing safety for speed should only be done explicitly when the user asks for it. Hadley
On Wed, Feb 11, 2015 at 2:41 PM, Hadley Wickham <h.wickham at gmail.com> wrote:
It gives a new attack vector - to introduce additional data into the database, you just need to figure out how to turn a length 1 vector in to a length 2 vector. It's dangerous in the same way that allowing dbGetQuery() to execute multiple queries is dangerous.
I'd rather hope that if it were a case that mattered, the user would not rely on the api as a substitute for appropriate checks.
I think the API should be as safe as possible by default, and sacrificing safety for speed should only be done explicitly when the user asks for it.
My use cases are not so sensitive, but I agree with the general idea. Also, you really do not gain much over regular looping as inserts are really slow, at least in postgresql. THK
Hadley -- http://had.co.nz/
http://www.keittlab.org/ [[alternative HTML version deleted]]
I think the API should be as safe as possible by default, and sacrificing safety for speed should only be done explicitly when the user asks for it.
My use cases are not so sensitive, but I agree with the general idea. Also, you really do not gain much over regular looping as inserts are really slow, at least in postgresql.
Yes, I'm just working on that for RPostgres. Parameterised inserts are actually slower than sending one giant SQL string. RPostgreSQL uses COPY ... FROM STDIN (and manually converts the data frame into a single c string) for better performance Hadley
On Wed, Feb 11, 2015 at 3:07 PM, Hadley Wickham <h.wickham at gmail.com> wrote:
I think the API should be as safe as possible by default, and sacrificing safety for speed should only be done explicitly when the user asks for it.
My use cases are not so sensitive, but I agree with the general idea.
Also,
you really do not gain much over regular looping as inserts are really
slow,
at least in postgresql.
Yes, I'm just working on that for RPostgres. Parameterised inserts are actually slower than sending one giant SQL string. RPostgreSQL uses COPY ... FROM STDIN (and manually converts the data frame into a single c string) for better performance
I put copy_from and copy_to in rpg. I punted on the C interface and simply shell out to psql in that case. Works fine with read.csv. and write.csv. THK
Hadley -- http://had.co.nz/
_______________________________________________ R-sig-DB mailing list -- R Special Interest Group R-sig-DB at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-db
http://www.keittlab.org/ [[alternative HTML version deleted]]