dbSendPreparedQuery in RSQLite

Thu, Apr 17, 2008 11:04 AM

Thank you Sean, this worked great. Now the cost of going to the  
database pales in comparison to the split that I need to do in the end  
to properly divide up the returned data.frame into a list where each  
member is a region. It is unfortunate because the data.frame returned  
to R is sorted in region order.

res <-

So then in order to return the object in the appropriate list fashion  
I do:

split(res, res$region)

This splitting dominates the cost of the entire function.  
Unfortunately, the splitting doesn't use the fact that it is sorted  
according to the factor. This I think I will handle by going to the  
database and getting the number of rows in a region and then using  
that vector to partition up the data (unless of course anyone has any  
smarter ideas)

Thanks again, I really appreciate the detailed example it helped  
tremendously in sorting out the indexing.

jim

On Apr 16, 2008, at 12:21 PM, Sean Davis wrote:

On Wed, Apr 16, 2008 at 1:35 PM, James Bullard  
<bullard at berkeley.edu> wrote:

Okay, well I thought that what I was doing was not easily done with  
one
query but since I am wrong more often than not I'll ask and see if  
anyone
can think of a way to combine it into one query.

I have a set of regions:

regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end =  
c(293, 320,
19200))

In the db I have the following table:

CREATE TABLE count_table (chr, location, count);

chr location count
1       10              1
1       290             10
2       245             3
3       10              2

The query that I am using is:

apply(regions, 1, function(region) {
       paste("SELECT * FROM count_table WHERE chr =", region[1], "AND
location >=", region[2],  "AND location <=", region[3], ";")
})

Which gives me back the following (which is exactly what I want):
[[1]]
 chr location count__1
1   1       10        1
2   1      290       10

[[2]]
 chr location count__1
1   2      245        3

[[3]]
NULL data frame with 0 rows


So I don't see any obvious way to pass in multiple ranges, but  
maybe I
missing something with using the IN operator, thanks in advance,  
sorry for
the long email.

counts <-  
data 
.frame(chr=c(1,1,2,3),location=c(10,290,245,10),count=c(1,10,3,2))
dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
dbGetQuery(con,'select * from count_table')

 chr location count__1
1   1       10        1
2   1      290       10
3   2      245        3
4   3       10        2

regions <- data.frame(chr = c(1,2,3), start = c(1,230,4950), end =  
c(293, 320, 19200))
dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE)
dbGetQuery(con,'select * from regions')

 chr start end__1
1   1     1    293
2   2   230    320
3   3  4950  19200

dbGetQuery(con,'select count_table.* from count_table join regions  
on regions.chr=count_table.chr and count_table.location between  
regions.start and regions.end__1')

 chr location count__1
1   1       10        1
2   1      290       10
3   2      245        3

So, the principle of a temp table works.  Let's try some benchmarking.
Create a random count table of length 100k rows, write it to a
database and create some indexes to speed querying.

counts <-  
data 
.frame 
(chr 
= 
sample 
(1 
: 
22,100000 
,replace 
= 
TRUE 
),location 
= 
sample 
(1 
: 
100000000,100000 
,replace=TRUE),count=sample(1:10,100000,replace=TRUE))
counts[1:5,]

 chr location count
1  14 91737974     5
2   5 61059218     2
3   2 23944824     4
4   3 82907389     9
5  22 94658940     1

dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)

[1] TRUE

dbGetQuery(con,'create index ct_chr on count_table(chr)')

NULL

dbGetQuery(con,'create index ct_location on count_table(location)')

NULL

So, to try a benchmark, I wrote a little function that creates random
regions (n of them, see function below).  In the tmp table approach,
the data.frame is written to a table in the database and a join is
used.  In the second, the apply() approach is used.  The function
below returns the system.time for each approach.  In my quick
benchmarking, the temp table approach is about 5 times faster
(including writing the temp table) with a count_table of length 100k
rows.  I would expect that it might be a very much larger difference
as the size of the count_table increases, as the indexes on
count_table must be scanned for each of the apply queries and only
once for the temp table/join approach.

benchRegions <- function(n) {
  # n is the number of regions.
  # returns a list with 2 sets of system times, one for
  #  the temp table approach, and one for the apply approach
  res <- list()
  starts=sample(1:100000000,n,replace=TRUE)
  ends=starts+sample(1:1000,n,replace=TRUE)
  regions <- data.frame(chr=sample(1:22,n,replace=TRUE),
                        start=starts,
                        end=ends)
  res[['tmptab']] <-
system 
.time 
({dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE);
dbGetQuery(con,'select count_table.* from count_table join regions on
regions.chr=count_table.chr and count_table.location between
regions.start and regions.end__1')})
  res[['apply']] <- system.time(apply(regions, 1, function(region)
{dbGetQuery(con,paste("SELECT * FROM count_table WHERE chr =",
region[1], "AND location >=", region[2],  "AND location <=",
region[3], ";"))
}))
  return(res)
}

Hope that helps.

Sean

On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:

On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard at berkeley.edu

wrote:

Yes, in this example that would certainly work. However, I have a  
large
number of queries in a tight loop and it would be nice to avoid the

overhead

of the query parsing and compiling on each call through the loop.

A couple of typical tricks for avoidance of loops is to either:

1) Use an "in" query
2) Use a temporary table (CREATE TEMPORARY TABLE .... or
dbWriteTable()) and then do a join

If you have hundreds or thousands of queries, it is quite possible
that #2 could be much faster.  Benchmarking would be useful here,  
but
it is definitely worth a try if you are too slow with looping.

Sean

thanks again, jim




On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:

On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard at berkeley.edu

wrote:

How do we send prepared queries in RSQLite. We are interested in
something along the lines of

ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr  
= ?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)

In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html

It looks like this is not supported for select statements,  
though I
could easily be wrong.  In this particular case, you could rewrite
your select as:

"SELECT * from table1 where chr in (1,2)"

Sean

dbSendPreparedQuery in RSQLite

Thread (10 messages)