Skip to content

dbSendPreparedQuery in RSQLite

10 messages · James Bullard, Seth Falcon, Sean Davis

#
How do we send prepared queries in RSQLite. We are interested in  
something along the lines of

ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr = ?",  
data.frame("integer"))
for( i in 1:2)
   rs = dbGetPreparedQuery(ps, i)

In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html

Thanks
Jim
#
On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard at berkeley.edu> wrote:
It looks like this is not supported for select statements, though I
could easily be wrong.  In this particular case, you could rewrite
your select as:

"SELECT * from table1 where chr in (1,2)"

Sean
#
Yes, in this example that would certainly work. However, I have a  
large number of queries in a tight loop and it would be nice to avoid  
the overhead of the query parsing and compiling on each call through  
the loop. I believe it is not supported on select statements as well  
and wonder if this would be a hard addition, also, I wonder if this is  
the correct syntax below. From other settings it seems a little off  
(see the link in the original email).

thanks again, jim
On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
#
* On 2008-04-15 at 20:45 -0700 James Bullard wrote:
You are correct, currently prepared queries are not supported for
SELECT (I have no memory of why this is the case).  I think that
supporting prepared queries for SELECT should be possible, though not
trivial.

Are you sure you can't avoid your loop by sending a large query and
batching through it using fetch on the result set returned?
By below I think you mean:
Almost.  As above, this is only supported with non-SELECT, so you
might do (untested):

    df = data.frame(x=1:10, y=letters[1:10])
    sql = "INSERT INTO foo (a, b) VALUES (?, ?)"
    rs = dbSendPreparedQuery(db, sql, df)

So you pass a data.frame and the loop is internal.
You can also name the parameters so you could have:

    sql = "INSERT INTO foo (a, b) VALUES (:y, :x)"
    rs = dbSendPreparedQuery(db, sql, df)

And y, x are matched based on names(df).

+ seth
#
On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard at berkeley.edu> wrote:
A couple of typical tricks for avoidance of loops is to either:

1) Use an "in" query
2) Use a temporary table (CREATE TEMPORARY TABLE .... or
dbWriteTable()) and then do a join

If you have hundreds or thousands of queries, it is quite possible
that #2 could be much faster.  Benchmarking would be useful here, but
it is definitely worth a try if you are too slow with looping.

Sean
#
Okay, well I thought that what I was doing was not easily done with  
one query but since I am wrong more often than not I'll ask and see if  
anyone can think of a way to combine it into one query.

I have a set of regions:

regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end = c(293,  
320, 19200))

In the db I have the following table:

CREATE TABLE count_table (chr, location, count);

chr location count
1	10		1
1	290		10
2	245		3
3	10		2

The query that I am using is:

apply(regions, 1, function(region) {
	paste("SELECT * FROM count_table WHERE chr =", region[1], "AND  
location >=", region[2],  "AND location <=", region[3], ";")
})

Which gives me back the following (which is exactly what I want):
[[1]]
   chr location count__1
1   1       10        1
2   1      290       10

[[2]]
   chr location count__1
1   2      245        3

[[3]]
NULL data frame with 0 rows


So I don't see any obvious way to pass in multiple ranges, but maybe I  
missing something with using the IN operator, thanks in advance, sorry  
for the long email.


Jim
On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:
#
On Wed, Apr 16, 2008 at 1:35 PM, James Bullard <bullard at berkeley.edu> wrote:
chr location count__1
1   1       10        1
2   1      290       10
3   2      245        3
4   3       10        2
chr start end__1
1   1     1    293
2   2   230    320
3   3  4950  19200
chr location count__1
1   1       10        1
2   1      290       10
3   2      245        3

So, the principle of a temp table works.  Let's try some benchmarking.
 Create a random count table of length 100k rows, write it to a
database and create some indexes to speed querying.
chr location count
1  14 91737974     5
2   5 61059218     2
3   2 23944824     4
4   3 82907389     9
5  22 94658940     1
[1] TRUE
NULL
NULL

So, to try a benchmark, I wrote a little function that creates random
regions (n of them, see function below).  In the tmp table approach,
the data.frame is written to a table in the database and a join is
used.  In the second, the apply() approach is used.  The function
below returns the system.time for each approach.  In my quick
benchmarking, the temp table approach is about 5 times faster
(including writing the temp table) with a count_table of length 100k
rows.  I would expect that it might be a very much larger difference
as the size of the count_table increases, as the indexes on
count_table must be scanned for each of the apply queries and only
once for the temp table/join approach.

benchRegions <- function(n) {
   # n is the number of regions.
   # returns a list with 2 sets of system times, one for
   #  the temp table approach, and one for the apply approach
   res <- list()
   starts=sample(1:100000000,n,replace=TRUE)
   ends=starts+sample(1:1000,n,replace=TRUE)
   regions <- data.frame(chr=sample(1:22,n,replace=TRUE),
                         start=starts,
                         end=ends)
   res[['tmptab']] <-
system.time({dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE);
dbGetQuery(con,'select count_table.* from count_table join regions on
regions.chr=count_table.chr and count_table.location between
regions.start and regions.end__1')})
   res[['apply']] <- system.time(apply(regions, 1, function(region)
{dbGetQuery(con,paste("SELECT * FROM count_table WHERE chr =",
region[1], "AND location >=", region[2],  "AND location <=",
region[3], ";"))
}))
   return(res)
}

Hope that helps.

Sean
#
Thank you Sean, this worked great. Now the cost of going to the  
database pales in comparison to the split that I need to do in the end  
to properly divide up the returned data.frame into a list where each  
member is a region. It is unfortunate because the data.frame returned  
to R is sorted in region order.

res <-
So then in order to return the object in the appropriate list fashion  
I do:

split(res, res$region)

This splitting dominates the cost of the entire function.  
Unfortunately, the splitting doesn't use the fact that it is sorted  
according to the factor. This I think I will handle by going to the  
database and getting the number of rows in a region and then using  
that vector to partition up the data (unless of course anyone has any  
smarter ideas)

Thanks again, I really appreciate the detailed example it helped  
tremendously in sorting out the indexing.

jim
On Apr 16, 2008, at 12:21 PM, Sean Davis wrote:
#
On Thu, Apr 17, 2008 at 2:04 PM, James Bullard <bullard at berkeley.edu> wrote:
What is the final information that you want to have?  Do you actually
need the detail of the counts per region, or are you doing some
aggregate of the data in the end?  In other words, what do you do with
the data after the split?
Great.  That is what email lists are all about--we all learn from them!
#
On Apr 17, 2008, at 4:19 PM, Sean Davis wrote:
Certainly the downstream analysis usually corresponds to some summary  
measure of a region. I am not too familiar with sqlite, but it does  
appear that I have the normal sum, max, min, avg functions which will  
be nice to use when applicable, however there is clearly a desire for  
access to the data within each region when these functions won't  
suffice (fitting models, calculating variances, and quantiles). That  
being said, your point is well taken, often the data.frame  
representation is more of what you want and now I provide both  
interfaces.

As an aside I implemented the method where I split using the bounds of  
each region determined via an SQL query (i.e. I got the number of rows  
corresponding to each region in the resulting data.frame). I then used  
an apply to partition these regions into the resulting list (where  
each element is a data.frame). This was to determine whether or not I  
was paying a cost for the the fact that split assumes that my data is  
unsorted according to the factor -- the results were not encouraging:  
the time spent in sorting during the split seems to pale in comparison  
to the construction of all the region data.frames. In any case, for  
now the running time is more or less acceptable.

thanks again, jim