RSQLite indexing

Tue, Oct 23, 2007 1:09 PM

On Oct 22, 2007, at 6:07 PM, Kasper Daniel Hansen wrote:

On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote:

I am trying to use RSQLite for storing data and  I need to create
indexes on
two variables in the table. It appears from searching the web that
the CREATE
INDEX operation in SQLite is relatively slow for large files, and
this has been
my experience as well.

The two index variables are crossed. One has about 350,000 levels
[yes, it's
genetic association data]. The other will have about 4000 levels
eventually,
but is up to about 100 now.   When the data were entered they were
already ordered by this second index variable.

Creating the index took about an hour on the 100-level, presorted
variable and about 12 hours on the 350,000-level unsorted
variable.  I'm looking for advice on how to reduce this. Specifically
1/ would it be faster if the variable with more levels was the
presorted one?
2/ would it be faster or slower if the index were created before
adding all the data?
3/ are there any options that can be set to speed up the indexing?

The SQLite database will not be the primary archive for the data,
so optimizations that are risky in the case of power loss or
hardware failure are still acceptable.  Since Bioconductor seems to
use SQLite a lot I'm hoping there is some simple solution.

I have not used RSQLite, but have some experience doing this thing
for big sqlite databases using the command line client. Every
database in sqlite has a number of parameters associated with it. You
want to make sure that cache_size is at _most_ 2000 (yes, I know this
is totally counterintuitive as it tells sqlite to use as little
memory as possible). You also tell it to be non-synchronous. In
sqlite the commands are
sqlite> pragma default_cache_size = 2000
sqlite> pragma_synchronous = off
You can test the setting of these parameters by just doing a
sqlite> pragma default_cache_size
As far as I remember, cache size can only be set when you create the
database. I have no idea how RSQlite handles it.

When I asked about this problem on the sqlite mailing list, the
sqlite-creator said that this was a "locality of reference problem"
and that it was being "worked on". And that I could search the
archives for more info (which did not help me back then).

I don't know whether or not sorting helps.

Another thing to do is to check in what amount sqlite sits idle while
doing I/O. It is probably impossible to avoid some idleness with such
a thing, but it should of course be kept to a minimum.

It is true that some of the other databases are probably much faster
at creating indices. But in the post-index analysis, sqlite is a
really fast database, probably amongst the fastest there is. It does
not do a good job a converting your queries into smart queries, but
if you are doing something simple, it is blazingly fast with the
right user options.

Let me just emphasize (based on Seth's email) that the  
default_cache_size settings only is for an indexing command. For  
actual operations on the database using the index, like a select  
statement, you would want to increase the cache_size to something  
bigger.

My comments are based on experience with a database with 315 *10^6  
rows, and the importance of this is quite dependent on database size.

Kasper

RSQLite indexing

Thread (8 messages)