concurrent reading/writing in "chunks" with RSQLite (need some help troubleshooting)

4 messages · Jonathan Greenberg, Seth Falcon, Sean Davis +1 more

Mon, Jul 5, 2010 12:36 PM #

Keith and Seth (and r-sig-db'ers):

I somewhat understand the difference between both of your solutions --
Seth, I'm not a database person (quickly learning tho) but does your
code hold the output in memory before writing it all at the end (via
dbCommit), or is each iteration's output being stored in some "waiting
room" (on disk) where it gets written at the very end?  The key (for
my purposes) is to avoid keeping more than a chunk's worth of input
AND output in memory at any one time.

Along the lines of Keith's solution, I noticed the COUNT query can be
VERY slow -- there is no rapid way of getting the number of rows in a
database table?

Thanks again!

--j

On Tue, Jun 29, 2010 at 9:50 PM, kMan <kchamberln at gmail.com> wrote:

Hi Seth,

This particular list requires considerable stretching on my part, so I would not be surprised if I missed a goal implied by the code style used or off-list correspondence.

If I understood the original code correctly, though, each iteration through the loop required (2) queries, a read and a write, where (I've assumed) the result sets are handled one at a time anyway. The code I proposed does the same thing, but overall there is (1) additional query prior to the loop to get the number of rows. The rest is just case processing b/c dbGetQuery() seemed to do funny things when LIMIT extended beyond the end of the table, and it wouldn't remember its last read location (e.g. didn't work like scan()). In addition, it works with dbWriteTable() and gets around the need for creating the table first, and filling it in later.

I used the WHERE condition for lack of a better way to seek to a particular record. The fetch() & fetch_n, were those intended to get the read queries to start at the last read location (e.g. like serial reads using scan())? Eliminating need to condition match or generate a db index first would speed things up, and I'd very much like to learn that trick. I'd think it would also make for an excellent R-wiki post.

Sincerely,
KeithC.

-----Original Message-----
From: Seth Falcon [mailto:seth at userprimary.net]
Sent: Tuesday, June 29, 2010 9:37 PM
To: kMan
Cc: Jonathan Greenberg; r-sig-db at stat.math.ethz.ch
Subject: Re: R-sig-DB Digest, Vol 68, Issue 7

On Tue, Jun 29, 2010 at 8:19 PM, kMan <kchamberln at gmail.com> wrote:

Dear Jonathan,

I'm trying to cycle through a database, reading a chunk of data from one

table in the DB, processing this data chunk, >and then writing the output to
another table in the database. ?I'm having to set the fetch(n=) option
because the >database is too large to read the entire thing into memory at
once. ?Here's my sample code:
[snip]

There is no need to open separate connections for reading vs. writing.

While it is true that you can read and write from a single connection,
you can only have one result set open at a time and so if you want to
iterate through a result set and then write, you do, I believe, need
two connections.


I suspect that the code you propose will be considerably slower as you
are issuing multiple queries along with a WHERE constraint.

+ seth

Your code was corrupting the connection between R/driver, I think, and I
don't know why. But when I close a connection in R and my OS won't let me
delete the file, there is a different kind of problem. The code below works
on my system. I wish I knew enough to explain wtf happened with your code.
------------------------------------------------------------------------
##
## Prelim/Declarations
##
require(RSQLite)
mysqldb <- "test.sqlite"
m <- dbDriver("SQLite")
tbl.names <- c("TABLEA", "TABLEB")
q1<-paste("SELECT COUNT(*) FROM", tbl.names[1])

##
## Surrogate data (if not allready done)
##
(somedata=data.frame(data1=seq(1:10),data2=seq(1:10)*3))
con<-dbConnect(m, dbname=mysqldb)
dbWriteTable(con,"TABLEA",somedata, overwrite=TRUE)
dbDisconnect(con)

##
## Process TABLEA in parts, write to TABLEB
##
con<-dbConnect(m, dbname=mysqldb)
n.lines<-dbGetQuery(con, q1) #NOTE class is data.frame
chunk.size<-3
start.index<-seq(from = 1, to = n.lines[1,1], by = 3)
limit<-c(diff(start.index), n.lines[1,1]-sum(diff(start.index)))
i<-1
for(i in 1:length(limit)){
?if(i<length(limit)) { q2<-paste("SELECT * FROM TABLEA WHERE row_names >=",

? ? ?start.index[i], "LIMIT", limit[i])
?} else { q2<-paste("SELECT * FROM TABLEA WHERE row_names =",
? ? ?start.index[i], "LIMIT", limit[i])
?}
?dat<-dbGetQuery(con, q2)
?dbWriteTable(con, tbl.names[2], dat, row.names=FALSE, append=TRUE)
}
dbGetQuery(con, "SELECT * FROM TABLEB")
dbDisconnect(con)
------------------------------------------------------------------------

Sincerely,
KeithC.



--
Seth Falcon | @sfalcon | http://userprimary.net/

Seth Falcon

Mon, Jul 5, 2010 1:09 PM #

Jonathan,

On Mon, Jul 5, 2010 at 12:36 PM, Jonathan Greenberg

<greenberg at ucdavis.edu> wrote:

It seems at this point that your simplified example is no longer
sufficient for understanding what you want to achieve.  I'm pretty
sure that data within a transaction will be written to disk (for a
disk-based db) before commit is called.  This is something that will
be determined by SQLite itself -- you can go read up on how things
work.  Or you can do some experiments with your data and see how it
looks.

Depending on the manipulations you actually want to perform before
inserting the data, you might be able to keep everything in the db.
You could use SQLite extension functions (see RSQLite.extfuns) to
expand what is possible at the SQL level

No, there is no notion of number of rows in a table that I am aware
of.  Note that COUNT is a way of aggregating the number of rows
returned from a query and in general there is no way for the db to
know how many rows will match a query.

I expect Keith's solution to be quite slow.  It is executing a
separate query each time through the loop and making a linear scan
through the table to do the offset.

+ seth

Sean Davis

Mon, Jul 5, 2010 3:23 PM #

On Mon, Jul 5, 2010 at 4:09 PM, Seth Falcon <seth at userprimary.net> wrote:

One way of dealing with this situation using RDBMs is to use a set of
triggers (an insert and delete pair) to keep the count in a separate table.
 I'm sure there are some examples/discussion online for creating triggers in
SQLite.

Sean

_______________________________________________
R-sig-DB mailing list -- R Special Interest Group
R-sig-DB at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-db

kMan

Mon, Jul 5, 2010 11:19 PM #

Dear Jonathan,

I do not know of a quick way to query the number of records, the first time.
After that, it should be a simpler matter. I suggest looking into "CREATE
INDEX" as part of your dbGetQuery() call. With clever use of indexes you may
work around needing to create duplicate data in some cases. 

I have not actually used it for this particular case yet, so I do not have a
working example. 

Sincerely,
KeithC.


-----Original Message-----
From: jgrn307 at gmail.com [mailto:jgrn307 at gmail.com] On Behalf Of Jonathan
Greenberg
Sent: Monday, July 05, 2010 1:37 PM
To: kMan
Cc: Seth Falcon; r-sig-db at stat.math.ethz.ch
Subject: Re: concurrent reading/writing in "chunks" with RSQLite (need some
help troubleshooting)

Keith and Seth (and r-sig-db'ers):

I somewhat understand the difference between both of your solutions -- Seth,
I'm not a database person (quickly learning tho) but does your code hold the
output in memory before writing it all at the end (via dbCommit), or is each
iteration's output being stored in some "waiting room" (on disk) where it
gets written at the very end?  The key (for my purposes) is to avoid keeping
more than a chunk's worth of input AND output in memory at any one time.

Along the lines of Keith's solution, I noticed the COUNT query can be VERY
slow -- there is no rapid way of getting the number of rows in a database
table?

Thanks again!

--j

On Tue, Jun 29, 2010 at 9:50 PM, kMan <kchamberln at gmail.com> wrote:

would not be surprised if I missed a goal implied by the code style used or
off-list correspondence.

through the loop required (2) queries, a read and a write, where (I've
assumed) the result sets are handled one at a time anyway. The code I
proposed does the same thing, but overall there is (1) additional query
prior to the loop to get the number of rows. The rest is just case
processing b/c dbGetQuery() seemed to do funny things when LIMIT extended
beyond the end of the table, and it wouldn't remember its last read location
(e.g. didn't work like scan()). In addition, it works with dbWriteTable()
and gets around the need for creating the table first, and filling it in
later.

particular record. The fetch() & fetch_n, were those intended to get the
read queries to start at the last read location (e.g. like serial reads
using scan())? Eliminating need to condition match or generate a db index
first would speed things up, and I'd very much like to learn that trick. I'd
think it would also make for an excellent R-wiki post.

happened with your code.