Skip to content

SQL escaping/quoting proposal

7 messages · Hadley Wickham, Lee Hachadoorian, NISHIYAMA Tomoaki +1 more

#
Hi all,

The approach that DBI takes to escaping is sub-optimal: it tries to
figure out if an R variable name matches an SQL reserved word, and if
so munge it so that there's no longer a conflict. This creates a
situation where there are some identifiers that are valid in R, and
some that are valid in SQL and we have a complicated and bug prone
approach to converting between them.

Instead, I recommend taking an approach where identifiers (i.e. table
and field names) are always quoted using the appropriate database
syntax. This not only avoids any problems with SQL reserved words, but
it also ensures that every field name in R (even those containing
spaces and other special characters) can be used in SQL.

To achieve this change, I think we should to:

* deprecate `make.db.names()`, `isSQLKeyword()`, and `SQLKeywords()`
* add new generics `sqlQuoteString()` and `sqlQuoteIdentifier()`.

The new generics would be defined on the driver object, and would come
with default methods as follows:

```
setGeneric("sqlQuoteString", function(drv, x, ...) {
  standardGeneric("sqlQuoteString")
})
setMethod("sqlQuoteString", "DBIDriver", function(drv, x, ...) {
  x <- gsub('"', '""', x, fixed = TRUE)
  paste('"', x, '"', sep = "")
})

setGeneric("sqlQuoteIdentifer", function(drv, x, ...) {
  standardGeneric("sqlQuoteIdentifer")
})
setMethod("sqlQuoteString", "DBIDriver", function(drv, x, ...) {
  x <- gsub("'", "''", x, fixed = TRUE)
  paste("'", x, "'", sep = "")
})
```

Individual implementations would be encouraged to provide methods that
use the quoting functions provided by the client library, where
available.

Does anyone see any problems with this approach?

Hadley
#
On 10/18/2013 12:44 PM, Hadley Wickham wrote:
Hadley,

Admittedly, I rarely use R to *create* SQL table definitions. But I 
would like to preserve the possibility of (a) creating db-safe 
identifiers, (b) not using SQL quoting when I don't want to.

Regarding (a), I would suggest *not* deprecating `make.db.names()`, etc.

Regarding (b), my reasoning is that when working with Postgres, I would 
want to allow Postgres to do its normal lower casing of unquoted 
identifiers. That is, `thisField` is a valid identifier, but passed in 
quoted case will be preserved, which means always having to quote the 
identifier in the future. Passed in quoted, Postgres will force it to 
`thisfield`, and if a client requests `thisField` unquoted, the correct 
field will be returned.

--Lee
#
It might be reasonable to opt in to this behaviour, but it's dangerous
by default because R is case-sensitive. What happens if you're using
dbWriteTable with this data.frame?

df <- data.frame(a = 1, A = 2)

Hadley
#
Hi Hadley,

Thanks to bring up this issue and 
I agree in the overall direction and have some minor concerns.

Could you provide a bit more clarification on what the sqlQuoteIdentifier should do?
How shall we deal when a vector of strings is passed?
  Is it right to assume that sqlQuoteIdentifier(drv, c('a', 'b')) should 
  return a vector consisting of quoted results of individual element of the vector?

How do we construct a reference to table with schema, or column with table?
eg schema.table or table.column?
  More specifically, is it right to assume that sqlQuoteIdentifier is used for constructing
  individual part of the composite identifier?

You had a minor mistake in showing the default method (The name is both "sqlQuoteString",
and I am not sure which is intended for sqlQuoteIdentifier).

Another consideration is for the name of the function. Whether we should 
use sql prefix or use db prefix.  I would like to know what others think for this point.
Here, I think we can avoid problem in most cases, but there are still a bit
cases where the encoding does not allow proper conversion.
That's the problem of the database capability, there is not much things that
the driver can do, though.
#
On 13-10-18 12:44 PM, Hadley Wickham wrote:
I think this addresses my longtime wish that DBI would present a 
consistent interface on the R side wrt capitalization even though the db 
engines do different things in this regard. This would make changing 
among engines easier. If it does not achieve this, is it possible?

Paul

This not only avoids any problems with SQL reserved words, but
#
I've cleaned up the examples and made them work and put the results in
https://gist.github.com/hadley/7057387 - that should make discussion a
bit more concrete.
Yes, I think that makes the most sense.
I think that would be up to the individual function author: you could
assume that if a vector was passed then you should quote then
concatenate together with ".".  Or you could assume that for more
complicated references the user had already flagged that the input
should not be escaped with sql().
Fixed.
I'm pretty sure it should be db to be consistent with the rest of the
package. (And I've also added dbFetch as an alias since fetch is the
_only_ function in DBI without the db prefix)
Right, we can only support what the db can.

The quoting function could also throw an error if it was not possible
to quote the input in a safe way for the database.

Hadley
2 days later
#
I made a few more changes to https://gist.github.com/hadley/7057387

* the method is now based on the connection, not the driver, since for
some drivers the escaping may vary between connections (e.g. ODBC)

* I added a sample tableName function (which I don't think would be
included in DBI), showing how you might accept multiple forms of table
names, and do the right thing as much as possible.

Hadley