How do we send prepared queries in RSQLite. We are interested in
something along the lines of
ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr = ?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)
In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html
Thanks
Jim
dbSendPreparedQuery in RSQLite
10 messages · James Bullard, Seth Falcon, Sean Davis
On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard at berkeley.edu> wrote:
How do we send prepared queries in RSQLite. We are interested in
something along the lines of
ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr = ?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)
In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html
It looks like this is not supported for select statements, though I could easily be wrong. In this particular case, you could rewrite your select as: "SELECT * from table1 where chr in (1,2)" Sean
Yes, in this example that would certainly work. However, I have a large number of queries in a tight loop and it would be nice to avoid the overhead of the query parsing and compiling on each call through the loop. I believe it is not supported on select statements as well and wonder if this would be a hard addition, also, I wonder if this is the correct syntax below. From other settings it seems a little off (see the link in the original email). thanks again, jim
On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard at berkeley.edu> wrote:
How do we send prepared queries in RSQLite. We are interested in
something along the lines of
ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr = ?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)
In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html
It looks like this is not supported for select statements, though I could easily be wrong. In this particular case, you could rewrite your select as: "SELECT * from table1 where chr in (1,2)" Sean
* On 2008-04-15 at 20:45 -0700 James Bullard wrote:
Yes, in this example that would certainly work. However, I have a large number of queries in a tight loop and it would be nice to avoid the overhead of the query parsing and compiling on each call through the loop. I believe it is not supported on select statements as well and wonder if this would be a hard addition,
You are correct, currently prepared queries are not supported for SELECT (I have no memory of why this is the case). I think that supporting prepared queries for SELECT should be possible, though not trivial. Are you sure you can't avoid your loop by sending a large query and batching through it using fetch on the result set returned?
also, I wonder if this is the correct syntax below. From other settings it seems a little off (see the link in the original email).
By below I think you mean:
ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr = ?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)
Almost. As above, this is only supported with non-SELECT, so you
might do (untested):
df = data.frame(x=1:10, y=letters[1:10])
sql = "INSERT INTO foo (a, b) VALUES (?, ?)"
rs = dbSendPreparedQuery(db, sql, df)
So you pass a data.frame and the loop is internal.
You can also name the parameters so you could have:
sql = "INSERT INTO foo (a, b) VALUES (:y, :x)"
rs = dbSendPreparedQuery(db, sql, df)
And y, x are matched based on names(df).
+ seth
Seth Falcon | http://userprimary.net/user/
On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard at berkeley.edu> wrote:
Yes, in this example that would certainly work. However, I have a large number of queries in a tight loop and it would be nice to avoid the overhead of the query parsing and compiling on each call through the loop.
A couple of typical tricks for avoidance of loops is to either: 1) Use an "in" query 2) Use a temporary table (CREATE TEMPORARY TABLE .... or dbWriteTable()) and then do a join If you have hundreds or thousands of queries, it is quite possible that #2 could be much faster. Benchmarking would be useful here, but it is definitely worth a try if you are too slow with looping. Sean
thanks again, jim On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard at berkeley.edu>
wrote:
How do we send prepared queries in RSQLite. We are interested in
something along the lines of
ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr = ?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)
In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html
It looks like this is not supported for select statements, though I could easily be wrong. In this particular case, you could rewrite your select as: "SELECT * from table1 where chr in (1,2)" Sean
Okay, well I thought that what I was doing was not easily done with
one query but since I am wrong more often than not I'll ask and see if
anyone can think of a way to combine it into one query.
I have a set of regions:
regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end = c(293,
320, 19200))
In the db I have the following table:
CREATE TABLE count_table (chr, location, count);
chr location count
1 10 1
1 290 10
2 245 3
3 10 2
The query that I am using is:
apply(regions, 1, function(region) {
paste("SELECT * FROM count_table WHERE chr =", region[1], "AND
location >=", region[2], "AND location <=", region[3], ";")
})
Which gives me back the following (which is exactly what I want):
[[1]]
chr location count__1
1 1 10 1
2 1 290 10
[[2]]
chr location count__1
1 2 245 3
[[3]]
NULL data frame with 0 rows
So I don't see any obvious way to pass in multiple ranges, but maybe I
missing something with using the IN operator, thanks in advance, sorry
for the long email.
Jim
On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard at berkeley.edu> wrote:
Yes, in this example that would certainly work. However, I have a large number of queries in a tight loop and it would be nice to avoid the overhead of the query parsing and compiling on each call through the loop.
A couple of typical tricks for avoidance of loops is to either: 1) Use an "in" query 2) Use a temporary table (CREATE TEMPORARY TABLE .... or dbWriteTable()) and then do a join If you have hundreds or thousands of queries, it is quite possible that #2 could be much faster. Benchmarking would be useful here, but it is definitely worth a try if you are too slow with looping. Sean
thanks again, jim On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard at berkeley.edu>
wrote:
How do we send prepared queries in RSQLite. We are interested in
something along the lines of
ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr
= ?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)
In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html
It looks like this is not supported for select statements, though I could easily be wrong. In this particular case, you could rewrite your select as: "SELECT * from table1 where chr in (1,2)" Sean
On Wed, Apr 16, 2008 at 1:35 PM, James Bullard <bullard at berkeley.edu> wrote:
Okay, well I thought that what I was doing was not easily done with one
query but since I am wrong more often than not I'll ask and see if anyone
can think of a way to combine it into one query.
I have a set of regions:
regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end = c(293, 320,
19200))
In the db I have the following table:
CREATE TABLE count_table (chr, location, count);
chr location count
1 10 1
1 290 10
2 245 3
3 10 2
The query that I am using is:
apply(regions, 1, function(region) {
paste("SELECT * FROM count_table WHERE chr =", region[1], "AND
location >=", region[2], "AND location <=", region[3], ";")
})
Which gives me back the following (which is exactly what I want):
[[1]]
chr location count__1
1 1 10 1
2 1 290 10
[[2]]
chr location count__1
1 2 245 3
[[3]]
NULL data frame with 0 rows
So I don't see any obvious way to pass in multiple ranges, but maybe I
missing something with using the IN operator, thanks in advance, sorry for
the long email.
counts <- data.frame(chr=c(1,1,2,3),location=c(10,290,245,10),count=c(1,10,3,2)) dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE) dbGetQuery(con,'select * from count_table')
chr location count__1 1 1 10 1 2 1 290 10 3 2 245 3 4 3 10 2
regions <- data.frame(chr = c(1,2,3), start = c(1,230,4950), end = c(293, 320, 19200)) dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE) dbGetQuery(con,'select * from regions')
chr start end__1 1 1 1 293 2 2 230 320 3 3 4950 19200
dbGetQuery(con,'select count_table.* from count_table join regions on regions.chr=count_table.chr and count_table.location between regions.start and regions.end__1')
chr location count__1 1 1 10 1 2 1 290 10 3 2 245 3 So, the principle of a temp table works. Let's try some benchmarking. Create a random count table of length 100k rows, write it to a database and create some indexes to speed querying.
counts <- data.frame(chr=sample(1:22,100000,replace=TRUE),location=sample(1:100000000,100000,replace=TRUE),count=sample(1:10,100000,replace=TRUE)) counts[1:5,]
chr location count 1 14 91737974 5 2 5 61059218 2 3 2 23944824 4 4 3 82907389 9 5 22 94658940 1
dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
[1] TRUE
dbGetQuery(con,'create index ct_chr on count_table(chr)')
NULL
dbGetQuery(con,'create index ct_location on count_table(location)')
NULL
So, to try a benchmark, I wrote a little function that creates random
regions (n of them, see function below). In the tmp table approach,
the data.frame is written to a table in the database and a join is
used. In the second, the apply() approach is used. The function
below returns the system.time for each approach. In my quick
benchmarking, the temp table approach is about 5 times faster
(including writing the temp table) with a count_table of length 100k
rows. I would expect that it might be a very much larger difference
as the size of the count_table increases, as the indexes on
count_table must be scanned for each of the apply queries and only
once for the temp table/join approach.
benchRegions <- function(n) {
# n is the number of regions.
# returns a list with 2 sets of system times, one for
# the temp table approach, and one for the apply approach
res <- list()
starts=sample(1:100000000,n,replace=TRUE)
ends=starts+sample(1:1000,n,replace=TRUE)
regions <- data.frame(chr=sample(1:22,n,replace=TRUE),
start=starts,
end=ends)
res[['tmptab']] <-
system.time({dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE);
dbGetQuery(con,'select count_table.* from count_table join regions on
regions.chr=count_table.chr and count_table.location between
regions.start and regions.end__1')})
res[['apply']] <- system.time(apply(regions, 1, function(region)
{dbGetQuery(con,paste("SELECT * FROM count_table WHERE chr =",
region[1], "AND location >=", region[2], "AND location <=",
region[3], ";"))
}))
return(res)
}
Hope that helps.
Sean
On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard at berkeley.edu>
wrote:
Yes, in this example that would certainly work. However, I have a large number of queries in a tight loop and it would be nice to avoid the
overhead
of the query parsing and compiling on each call through the loop.
A couple of typical tricks for avoidance of loops is to either: 1) Use an "in" query 2) Use a temporary table (CREATE TEMPORARY TABLE .... or dbWriteTable()) and then do a join If you have hundreds or thousands of queries, it is quite possible that #2 could be much faster. Benchmarking would be useful here, but it is definitely worth a try if you are too slow with looping. Sean
thanks again, jim On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard at berkeley.edu>
wrote:
How do we send prepared queries in RSQLite. We are interested in
something along the lines of
ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr = ?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)
In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html
It looks like this is not supported for select statements, though I could easily be wrong. In this particular case, you could rewrite your select as: "SELECT * from table1 where chr in (1,2)" Sean
Thank you Sean, this worked great. Now the cost of going to the database pales in comparison to the split that I need to do in the end to properly divide up the returned data.frame into a list where each member is a region. It is unfortunate because the data.frame returned to R is sorted in region order. res <-
chr location count region 1 10 1 1 1 290 10 1 2 245 3 2 3 10 2 3
So then in order to return the object in the appropriate list fashion I do: split(res, res$region) This splitting dominates the cost of the entire function. Unfortunately, the splitting doesn't use the fact that it is sorted according to the factor. This I think I will handle by going to the database and getting the number of rows in a region and then using that vector to partition up the data (unless of course anyone has any smarter ideas) Thanks again, I really appreciate the detailed example it helped tremendously in sorting out the indexing. jim
On Apr 16, 2008, at 12:21 PM, Sean Davis wrote:
On Wed, Apr 16, 2008 at 1:35 PM, James Bullard <bullard at berkeley.edu> wrote:
Okay, well I thought that what I was doing was not easily done with
one
query but since I am wrong more often than not I'll ask and see if
anyone
can think of a way to combine it into one query.
I have a set of regions:
regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end =
c(293, 320,
19200))
In the db I have the following table:
CREATE TABLE count_table (chr, location, count);
chr location count
1 10 1
1 290 10
2 245 3
3 10 2
The query that I am using is:
apply(regions, 1, function(region) {
paste("SELECT * FROM count_table WHERE chr =", region[1], "AND
location >=", region[2], "AND location <=", region[3], ";")
})
Which gives me back the following (which is exactly what I want):
[[1]]
chr location count__1
1 1 10 1
2 1 290 10
[[2]]
chr location count__1
1 2 245 3
[[3]]
NULL data frame with 0 rows
So I don't see any obvious way to pass in multiple ranges, but
maybe I
missing something with using the IN operator, thanks in advance,
sorry for
the long email.
counts <- data .frame(chr=c(1,1,2,3),location=c(10,290,245,10),count=c(1,10,3,2)) dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE) dbGetQuery(con,'select * from count_table')
chr location count__1 1 1 10 1 2 1 290 10 3 2 245 3 4 3 10 2
regions <- data.frame(chr = c(1,2,3), start = c(1,230,4950), end = c(293, 320, 19200)) dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE) dbGetQuery(con,'select * from regions')
chr start end__1 1 1 1 293 2 2 230 320 3 3 4950 19200
dbGetQuery(con,'select count_table.* from count_table join regions on regions.chr=count_table.chr and count_table.location between regions.start and regions.end__1')
chr location count__1 1 1 10 1 2 1 290 10 3 2 245 3 So, the principle of a temp table works. Let's try some benchmarking. Create a random count table of length 100k rows, write it to a database and create some indexes to speed querying.
counts <- data .frame (chr = sample (1 : 22,100000 ,replace = TRUE ),location = sample (1 : 100000000,100000 ,replace=TRUE),count=sample(1:10,100000,replace=TRUE)) counts[1:5,]
chr location count 1 14 91737974 5 2 5 61059218 2 3 2 23944824 4 4 3 82907389 9 5 22 94658940 1
dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
[1] TRUE
dbGetQuery(con,'create index ct_chr on count_table(chr)')
NULL
dbGetQuery(con,'create index ct_location on count_table(location)')
NULL
So, to try a benchmark, I wrote a little function that creates random
regions (n of them, see function below). In the tmp table approach,
the data.frame is written to a table in the database and a join is
used. In the second, the apply() approach is used. The function
below returns the system.time for each approach. In my quick
benchmarking, the temp table approach is about 5 times faster
(including writing the temp table) with a count_table of length 100k
rows. I would expect that it might be a very much larger difference
as the size of the count_table increases, as the indexes on
count_table must be scanned for each of the apply queries and only
once for the temp table/join approach.
benchRegions <- function(n) {
# n is the number of regions.
# returns a list with 2 sets of system times, one for
# the temp table approach, and one for the apply approach
res <- list()
starts=sample(1:100000000,n,replace=TRUE)
ends=starts+sample(1:1000,n,replace=TRUE)
regions <- data.frame(chr=sample(1:22,n,replace=TRUE),
start=starts,
end=ends)
res[['tmptab']] <-
system
.time
({dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE);
dbGetQuery(con,'select count_table.* from count_table join regions on
regions.chr=count_table.chr and count_table.location between
regions.start and regions.end__1')})
res[['apply']] <- system.time(apply(regions, 1, function(region)
{dbGetQuery(con,paste("SELECT * FROM count_table WHERE chr =",
region[1], "AND location >=", region[2], "AND location <=",
region[3], ";"))
}))
return(res)
}
Hope that helps.
Sean
On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard at berkeley.edu
wrote:
Yes, in this example that would certainly work. However, I have a large number of queries in a tight loop and it would be nice to avoid the
overhead
of the query parsing and compiling on each call through the loop.
A couple of typical tricks for avoidance of loops is to either: 1) Use an "in" query 2) Use a temporary table (CREATE TEMPORARY TABLE .... or dbWriteTable()) and then do a join If you have hundreds or thousands of queries, it is quite possible that #2 could be much faster. Benchmarking would be useful here, but it is definitely worth a try if you are too slow with looping. Sean
thanks again, jim On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard at berkeley.edu
wrote:
How do we send prepared queries in RSQLite. We are interested in
something along the lines of
ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr
= ?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)
In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html
It looks like this is not supported for select statements, though I could easily be wrong. In this particular case, you could rewrite your select as: "SELECT * from table1 where chr in (1,2)" Sean
On Thu, Apr 17, 2008 at 2:04 PM, James Bullard <bullard at berkeley.edu> wrote:
Thank you Sean, this worked great. Now the cost of going to the database pales in comparison to the split that I need to do in the end to properly divide up the returned data.frame into a list where each member is a region. It is unfortunate because the data.frame returned to R is sorted in region order. res <-
chr location count region 1 10 1 1 1 290 10 1 2 245 3 2 3 10 2 3
So then in order to return the object in the appropriate list fashion I do: split(res, res$region) This splitting dominates the cost of the entire function. Unfortunately, the splitting doesn't use the fact that it is sorted according to the factor. This I think I will handle by going to the database and getting the number of rows in a region and then using that vector to partition up the data (unless of course anyone has any smarter ideas)
What is the final information that you want to have? Do you actually need the detail of the counts per region, or are you doing some aggregate of the data in the end? In other words, what do you do with the data after the split?
Thanks again, I really appreciate the detailed example it helped tremendously in sorting out the indexing.
Great. That is what email lists are all about--we all learn from them!
On Apr 16, 2008, at 12:21 PM, Sean Davis wrote:
On Wed, Apr 16, 2008 at 1:35 PM, James Bullard <bullard at berkeley.edu>
wrote:
Okay, well I thought that what I was doing was not easily done with one query but since I am wrong more often than not I'll ask and see if
anyone
can think of a way to combine it into one query. I have a set of regions: regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end = c(293,
320,
19200))
In the db I have the following table:
CREATE TABLE count_table (chr, location, count);
chr location count
1 10 1
1 290 10
2 245 3
3 10 2
The query that I am using is:
apply(regions, 1, function(region) {
paste("SELECT * FROM count_table WHERE chr =", region[1], "AND
location >=", region[2], "AND location <=", region[3], ";")
})
Which gives me back the following (which is exactly what I want):
[[1]]
chr location count__1
1 1 10 1
2 1 290 10
[[2]]
chr location count__1
1 2 245 3
[[3]]
NULL data frame with 0 rows
So I don't see any obvious way to pass in multiple ranges, but maybe I
missing something with using the IN operator, thanks in advance, sorry
for
the long email.
counts <-
data.frame(chr=c(1,1,2,3),location=c(10,290,245,10),count=c(1,10,3,2))
dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE) dbGetQuery(con,'select * from count_table')
chr location count__1 1 1 10 1 2 1 290 10 3 2 245 3 4 3 10 2
regions <- data.frame(chr = c(1,2,3), start = c(1,230,4950), end =
c(293, 320, 19200))
dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE) dbGetQuery(con,'select * from regions')
chr start end__1 1 1 1 293 2 2 230 320 3 3 4950 19200
dbGetQuery(con,'select count_table.* from count_table join regions on
regions.chr=count_table.chr and count_table.location between regions.start and regions.end__1')
chr location count__1 1 1 10 1 2 1 290 10 3 2 245 3 So, the principle of a temp table works. Let's try some benchmarking. Create a random count table of length 100k rows, write it to a database and create some indexes to speed querying.
counts <-
data.frame(chr=sample(1:22,100000,replace=TRUE),location=sample(1:100000000,100000,replace=TRUE),count=sample(1:10,100000,replace=TRUE))
counts[1:5,]
chr location count 1 14 91737974 5 2 5 61059218 2 3 2 23944824 4 4 3 82907389 9 5 22 94658940 1
dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
[1] TRUE
dbGetQuery(con,'create index ct_chr on count_table(chr)')
NULL
dbGetQuery(con,'create index ct_location on count_table(location)')
NULL
So, to try a benchmark, I wrote a little function that creates random
regions (n of them, see function below). In the tmp table approach,
the data.frame is written to a table in the database and a join is
used. In the second, the apply() approach is used. The function
below returns the system.time for each approach. In my quick
benchmarking, the temp table approach is about 5 times faster
(including writing the temp table) with a count_table of length 100k
rows. I would expect that it might be a very much larger difference
as the size of the count_table increases, as the indexes on
count_table must be scanned for each of the apply queries and only
once for the temp table/join approach.
benchRegions <- function(n) {
# n is the number of regions.
# returns a list with 2 sets of system times, one for
# the temp table approach, and one for the apply approach
res <- list()
starts=sample(1:100000000,n,replace=TRUE)
ends=starts+sample(1:1000,n,replace=TRUE)
regions <- data.frame(chr=sample(1:22,n,replace=TRUE),
start=starts,
end=ends)
res[['tmptab']] <-
system.time({dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE);
dbGetQuery(con,'select count_table.* from count_table join regions on
regions.chr=count_table.chr and count_table.location between
regions.start and regions.end__1')})
res[['apply']] <- system.time(apply(regions, 1, function(region)
{dbGetQuery(con,paste("SELECT * FROM count_table WHERE chr =",
region[1], "AND location >=", region[2], "AND location <=",
region[3], ";"))
}))
return(res)
}
Hope that helps.
Sean
On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard at berkeley.edu>
wrote:
Yes, in this example that would certainly work. However, I have a
large
number of queries in a tight loop and it would be nice to avoid the
overhead
of the query parsing and compiling on each call through the loop.
A couple of typical tricks for avoidance of loops is to either: 1) Use an "in" query 2) Use a temporary table (CREATE TEMPORARY TABLE .... or dbWriteTable()) and then do a join If you have hundreds or thousands of queries, it is quite possible that #2 could be much faster. Benchmarking would be useful here, but it is definitely worth a try if you are too slow with looping. Sean
thanks again, jim On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 8:31 PM, James Bullard
<bullard at berkeley.edu>
wrote:
How do we send prepared queries in RSQLite. We are interested in something along the lines of ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr =
?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)
In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html
It looks like this is not supported for select statements, though
I
could easily be wrong. In this particular case, you could rewrite your select as: "SELECT * from table1 where chr in (1,2)" Sean
On Apr 17, 2008, at 4:19 PM, Sean Davis wrote:
On Thu, Apr 17, 2008 at 2:04 PM, James Bullard <bullard at berkeley.edu> wrote:
Thank you Sean, this worked great. Now the cost of going to the database pales in comparison to the split that I need to do in the end to properly divide up the returned data.frame into a list where each member is a region. It is unfortunate because the data.frame returned to R is sorted in region order. res <-
chr location count region 1 10 1 1 1 290 10 1 2 245 3 2 3 10 2 3
So then in order to return the object in the appropriate list fashion I do: split(res, res$region) This splitting dominates the cost of the entire function. Unfortunately, the splitting doesn't use the fact that it is sorted according to the factor. This I think I will handle by going to the database and getting the number of rows in a region and then using that vector to partition up the data (unless of course anyone has any smarter ideas)
What is the final information that you want to have? Do you actually need the detail of the counts per region, or are you doing some aggregate of the data in the end? In other words, what do you do with the data after the split?
Certainly the downstream analysis usually corresponds to some summary measure of a region. I am not too familiar with sqlite, but it does appear that I have the normal sum, max, min, avg functions which will be nice to use when applicable, however there is clearly a desire for access to the data within each region when these functions won't suffice (fitting models, calculating variances, and quantiles). That being said, your point is well taken, often the data.frame representation is more of what you want and now I provide both interfaces. As an aside I implemented the method where I split using the bounds of each region determined via an SQL query (i.e. I got the number of rows corresponding to each region in the resulting data.frame). I then used an apply to partition these regions into the resulting list (where each element is a data.frame). This was to determine whether or not I was paying a cost for the the fact that split assumes that my data is unsorted according to the factor -- the results were not encouraging: the time spent in sorting during the split seems to pale in comparison to the construction of all the region data.frames. In any case, for now the running time is more or less acceptable. thanks again, jim
Thanks again, I really appreciate the detailed example it helped tremendously in sorting out the indexing.
Great. That is what email lists are all about--we all learn from them!
On Apr 16, 2008, at 12:21 PM, Sean Davis wrote:
On Wed, Apr 16, 2008 at 1:35 PM, James Bullard <bullard at berkeley.edu>
wrote:
Okay, well I thought that what I was doing was not easily done with one query but since I am wrong more often than not I'll ask and see if
anyone
can think of a way to combine it into one query. I have a set of regions: regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end = c(293,
320,
19200))
In the db I have the following table:
CREATE TABLE count_table (chr, location, count);
chr location count
1 10 1
1 290 10
2 245 3
3 10 2
The query that I am using is:
apply(regions, 1, function(region) {
paste("SELECT * FROM count_table WHERE chr =", region[1], "AND
location >=", region[2], "AND location <=", region[3], ";")
})
Which gives me back the following (which is exactly what I want):
[[1]]
chr location count__1
1 1 10 1
2 1 290 10
[[2]]
chr location count__1
1 2 245 3
[[3]]
NULL data frame with 0 rows
So I don't see any obvious way to pass in multiple ranges, but
maybe I
missing something with using the IN operator, thanks in advance,
sorry
for
the long email.
counts <-
data .frame(chr=c(1,1,2,3),location=c(10,290,245,10),count=c(1,10,3,2))
dbWriteTable (con,'count_table',counts,row.names=FALSE,overwrite=TRUE) dbGetQuery(con,'select * from count_table')
chr location count__1 1 1 10 1 2 1 290 10 3 2 245 3 4 3 10 2
regions <- data.frame(chr = c(1,2,3), start = c(1,230,4950), end =
c(293, 320, 19200))
dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE) dbGetQuery(con,'select * from regions')
chr start end__1 1 1 1 293 2 2 230 320 3 3 4950 19200
dbGetQuery(con,'select count_table.* from count_table join regions on
regions.chr=count_table.chr and count_table.location between regions.start and regions.end__1')
chr location count__1 1 1 10 1 2 1 290 10 3 2 245 3 So, the principle of a temp table works. Let's try some benchmarking. Create a random count table of length 100k rows, write it to a database and create some indexes to speed querying.
counts <-
data .frame (chr = sample (1 : 22,100000 ,replace = TRUE ),location = sample (1 : 100000000,100000 ,replace=TRUE),count=sample(1:10,100000,replace=TRUE))
counts[1:5,]
chr location count 1 14 91737974 5 2 5 61059218 2 3 2 23944824 4 4 3 82907389 9 5 22 94658940 1
dbWriteTable (con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
[1] TRUE
dbGetQuery(con,'create index ct_chr on count_table(chr)')
NULL
dbGetQuery(con,'create index ct_location on count_table(location)')
NULL
So, to try a benchmark, I wrote a little function that creates
random
regions (n of them, see function below). In the tmp table approach,
the data.frame is written to a table in the database and a join is
used. In the second, the apply() approach is used. The function
below returns the system.time for each approach. In my quick
benchmarking, the temp table approach is about 5 times faster
(including writing the temp table) with a count_table of length 100k
rows. I would expect that it might be a very much larger difference
as the size of the count_table increases, as the indexes on
count_table must be scanned for each of the apply queries and only
once for the temp table/join approach.
benchRegions <- function(n) {
# n is the number of regions.
# returns a list with 2 sets of system times, one for
# the temp table approach, and one for the apply approach
res <- list()
starts=sample(1:100000000,n,replace=TRUE)
ends=starts+sample(1:1000,n,replace=TRUE)
regions <- data.frame(chr=sample(1:22,n,replace=TRUE),
start=starts,
end=ends)
res[['tmptab']] <-
system
.time
({dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE);
dbGetQuery(con,'select count_table.* from count_table join regions
on
regions.chr=count_table.chr and count_table.location between
regions.start and regions.end__1')})
res[['apply']] <- system.time(apply(regions, 1, function(region)
{dbGetQuery(con,paste("SELECT * FROM count_table WHERE chr =",
region[1], "AND location >=", region[2], "AND location <=",
region[3], ";"))
}))
return(res)
}
Hope that helps.
Sean
On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard at berkeley.edu
wrote:
Yes, in this example that would certainly work. However, I have a
large
number of queries in a tight loop and it would be nice to avoid the
overhead
of the query parsing and compiling on each call through the loop.
A couple of typical tricks for avoidance of loops is to either: 1) Use an "in" query 2) Use a temporary table (CREATE TEMPORARY TABLE .... or dbWriteTable()) and then do a join If you have hundreds or thousands of queries, it is quite possible that #2 could be much faster. Benchmarking would be useful here, but it is definitely worth a try if you are too slow with looping. Sean
thanks again, jim On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
On Tue, Apr 15, 2008 at 8:31 PM, James Bullard
<bullard at berkeley.edu>
wrote:
How do we send prepared queries in RSQLite. We are interested in something along the lines of ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr =
?",
data.frame("integer"))
for( i in 1:2)
rs = dbGetPreparedQuery(ps, i)
In C it would look something along the lines of
http://www.mail-archive.com/sqlite-users at sqlite.org/msg31854.html
It looks like this is not supported for select statements, though
I
could easily be wrong. In this particular case, you could rewrite your select as: "SELECT * from table1 where chr in (1,2)" Sean