dbSendPreparedQuery in RSQLite
Thank you Sean, this worked great. Now the cost of going to the database pales in comparison to the split that I need to do in the end to properly divide up the returned data.frame into a list where each member is a region. It is unfortunate because the data.frame returned to R is sorted in region order. res <-
chr location count region 1 10 1 1 1 290 10 1 2 245 3 2 3 10 2 3
So then in order to return the object in the appropriate list fashion I do: split(res, res$region) This splitting dominates the cost of the entire function. Unfortunately, the splitting doesn't use the fact that it is sorted according to the factor. This I think I will handle by going to the database and getting the number of rows in a region and then using that vector to partition up the data (unless of course anyone has any smarter ideas) Thanks again, I really appreciate the detailed example it helped tremendously in sorting out the indexing. jim
On Apr 16, 2008, at 12:21 PM, Sean Davis wrote:
On Wed, Apr 16, 2008 at 1:35 PM, James Bullard <bullard at berkeley.edu> wrote:
Okay, well I thought that what I was doing was not easily done with
one
query but since I am wrong more often than not I'll ask and see if
anyone
can think of a way to combine it into one query.
I have a set of regions:
regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end =
c(293, 320,
19200))
In the db I have the following table:
CREATE TABLE count_table (chr, location, count);
chr location count
1 10 1
1 290 10
2 245 3
3 10 2
The query that I am using is:
apply(regions, 1, function(region) {
paste("SELECT * FROM count_table WHERE chr =", region[1], "AND
location >=", region[2], "AND location <=", region[3], ";")
})
Which gives me back the following (which is exactly what I want):
[[1]]
chr location count__1
1 1 10 1
2 1 290 10
[[2]]
chr location count__1
1 2 245 3
[[3]]
NULL data frame with 0 rows
So I don't see any obvious way to pass in multiple ranges, but
maybe I
missing something with using the IN operator, thanks in advance,
sorry for
the long email.
counts <- data .frame(chr=c(1,1,2,3),location=c(10,290,245,10),count=c(1,10,3,2)) dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE) dbGetQuery(con,'select * from count_table')
chr location count__1 1 1 10 1 2 1 290 10 3 2 245 3 4 3 10 2
regions <- data.frame(chr = c(1,2,3), start = c(1,230,4950), end = c(293, 320, 19200)) dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE) dbGetQuery(con,'select * from regions')
chr start end__1 1 1 1 293 2 2 230 320 3 3 4950 19200
dbGetQuery(con,'select count_table.* from count_table join regions on regions.chr=count_table.chr and count_table.location between regions.start and regions.end__1')
chr location count__1 1 1 10 1 2 1 290 10 3 2 245 3 So, the principle of a temp table works. Let's try some benchmarking. Create a random count table of length 100k rows, write it to a database and create some indexes to speed querying.
counts <- data .frame (chr = sample (1 : 22,100000 ,replace = TRUE ),location = sample (1 : 100000000,100000 ,replace=TRUE),count=sample(1:10,100000,replace=TRUE)) counts[1:5,]
chr location count 1 14 91737974 5 2 5 61059218 2 3 2 23944824 4 4 3 82907389 9 5 22 94658940 1
dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
[1] TRUE
dbGetQuery(con,'create index ct_chr on count_table(chr)')
NULL
dbGetQuery(con,'create index ct_location on count_table(location)')
NULL
So, to try a benchmark, I wrote a little function that creates random
regions (n of them, see function below). In the tmp table approach,
the data.frame is written to a table in the database and a join is
used. In the second, the apply() approach is used. The function
below returns the system.time for each approach. In my quick
benchmarking, the temp table approach is about 5 times faster
(including writing the temp table) with a count_table of length 100k
rows. I would expect that it might be a very much larger difference
as the size of the count_table increases, as the indexes on
count_table must be scanned for each of the apply queries and only
once for the temp table/join approach.
benchRegions <- function(n) {
# n is the number of regions.
# returns a list with 2 sets of system times, one for
# the temp table approach, and one for the apply approach
res <- list()
starts=sample(1:100000000,n,replace=TRUE)
ends=starts+sample(1:1000,n,replace=TRUE)
regions <- data.frame(chr=sample(1:22,n,replace=TRUE),
start=starts,
end=ends)
res[['tmptab']] <-
system
.time
({dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE);
dbGetQuery(con,'select count_table.* from count_table join regions on
regions.chr=count_table.chr and count_table.location between
regions.start and regions.end__1')})
res[['apply']] <- system.time(apply(regions, 1, function(region)
{dbGetQuery(con,paste("SELECT * FROM count_table WHERE chr =",
region[1], "AND location >=", region[2], "AND location <=",
region[3], ";"))
}))
return(res)
}
Hope that helps.
Sean
On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard at berkeley.edu
wrote:
Yes, in this example that would certainly work. However, I have a large number of queries in a tight loop and it would be nice to avoid the
overhead
of the query parsing and compiling on each call through the loop.
A couple of typical tricks for avoidance of loops is to either: 1) Use an "in" query 2) Use a temporary table (CREATE TEMPORARY TABLE .... or dbWriteTable()) and then do a join If you have hundreds or thousands of queries, it is quite possible that #2 could be much faster. Benchmarking would be useful here, but it is definitely worth a try if you are too slow with looping. Sean
thanks again, jim On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard at berkeley.edu
wrote:
How do we send prepared queries in RSQLite. We are interested in
something along the lines of
ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr
= ?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)
In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html
It looks like this is not supported for select statements, though I could easily be wrong. In this particular case, you could rewrite your select as: "SELECT * from table1 where chr in (1,2)" Sean