RSQLite and transparent compression
Well, the technical questions is really whether you can do sqlite operations on a compressed database. Otherwise, all you can do is externally compresed the database and then decompress it every time you want to access it, which may be tedious and slow. sqlite is used a lot on devices with very limited resources, so it is entirely possible that there is some compression possibility, which is why I suggest you read the documentation (argh!). Finally, 10-20GB for a textfile is not that big. If you do not have enough RAM you must be working on a constrained system. Kasper
On Tue, Aug 6, 2013 at 12:35 AM, Grant Farnsworth <gvfarns at gmail.com> wrote:
On Tue, Aug 6, 2013 at 12:02 AM, Kasper Daniel Hansen <kasperdanielhansen at gmail.com> wrote:
What do you mean by large? You are aware you can have an in-memory
version
of a SQLite database (whether that helps depends on the size of course)?
If
you operate on a disk based database, fast I/O helps a lot, perhaps even copying the database to a local drive. I don't know anything about compression though, but in general I have found the sqlite.org website
and
its mailing list to be super helpful.
Not outrageously large. I'd say 10-20GB each as text delimited files. Still, it's too large to put in RAM and work with. This is why I use SQLite. I get these files as gzipped delimited text files, then I read them a million lines or so at a time using scan(), do some basic clean up, and stuff them into a big SQLite database. When I want to use the data, I just subset the stuff I need, which fits comfortably into RAM. If the datasets were small enough, I'd just store them in an R data file...then I wouldn't have to worry about type conversions or variable name issues. I guess it just seems wasteful to have these huge files sitting around (or move them across networks) when the raw data was compressed and I know the sqlite databases would compress nicely as well. That's why I'm specifically looking for a compression solution. I'd be open to other approaches, of course. For example, I could imagine ways to append the data into a dataframe in an .rda or .rds file and then subset it later without ever having to load the whole thing into ram if I used some of the big data packages, but besides the file size I'm pretty happy with the SQLite solution---it just seemed like transparent zipping might be available and I was surprised to find that it wasn't. By the way, speed isn't a critical issue. It's not super time-sensitive work and the network to my file server is plenty fast. It just seems like I might have missed an obvious way to save the space and time that lack of compression causes.
_______________________________________________ R-sig-DB mailing list -- R Special Interest Group R-sig-DB at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-db