RSQLite indexing - R-devel | R Mailing Lists

Mon, Oct 22, 2007 2:54 PM #

I am trying to use RSQLite for storing data and  I need to create indexes on 
two variables in the table. It appears from searching the web that the CREATE 
INDEX operation in SQLite is relatively slow for large files, and this has been 
my experience as well.

The two index variables are crossed. One has about 350,000 levels [yes, it's 
genetic association data]. The other will have about 4000 levels eventually, 
but is up to about 100 now.   When the data were entered they were already ordered by this second index variable.

Creating the index took about an hour on the 100-level, presorted variable and about 12 hours on the 350,000-level unsorted variable.  I'm looking for advice on how to reduce this. Specifically
1/ would it be faster if the variable with more levels was the presorted one?
2/ would it be faster or slower if the index were created before adding all the data?
3/ are there any options that can be set to speed up the indexing?

The SQLite database will not be the primary archive for the data, so optimizations that are risky in the case of power loss or hardware failure are still acceptable.  Since Bioconductor seems to use SQLite a lot I'm hoping there is some simple solution.


     -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Hadley Wickham

Mon, Oct 22, 2007 3:37 PM #

On 10/22/07, Thomas Lumley <tlumley at u.washington.edu> wrote:

How big is your dataset?  SQLite can be slow, but I didn't think it
was that slow.

I didn't think this would matter (and can imagine situations where it
would be worse).  Is there a particular reason you think this might
help?

It's generally much faster to create the index after loading all the data.

Have you tried 'vacuum'ing your database prior to indexing?

Will you be indexing on both columns simultaneously?  If so, you might
try creating a single index.


Hadley

http://had.co.nz/

Thomas Lumley

Mon, Oct 22, 2007 3:57 PM #

On Mon, 22 Oct 2007, hadley wickham wrote:

350000x100 entries on 12 variables

Mainly that it would be nice if it did.  Preordering might help more for the larger number of levels because the smaller number of levels gives more valid places to insert a record into the index.  I could just experiment, but since it takes so long I thought it was worth asking if anyone had actual knowledge.

Yes. The reason I thought it might not be in this case is that the indexing operation uses remarkably little CPU (and doesn't use anywhere near the disk bandwidth) so I hoped there might be some saving in doing both at once.

Yes. But it has never had anything deleted from it, so it's not surprising that it didn't help.

No, indexing will be on one or the other column, not on both at once.

      -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Jeffrey Horner

Mon, Oct 22, 2007 4:03 PM #

Thomas Lumley wrote on 10/22/2007 04:54 PM:

Well then, while some may think it overkill to run a mysql database 
server for 1 to a handfull of clients, the benefits of using typed 
columns (rather than strings, which might have something to do with slow 
indexing) plus all the other goodies could save you the extra day or two 
speeding up SQLite indexing.

A project here at vanderbilt stores (mostly read-only) data in mysql 5.0 
tables in myisam format. The tables have 3 to 5 million records each. 
Index creation speed for a varchar(9) column with 1,559,100 levels takes 
roughly 1.5 minutes, for example.

Jeff

http://biostat.mc.vanderbilt.edu/JeffreyHorner

Marcus G. Daniels

Mon, Oct 22, 2007 4:05 PM #

Thomas Lumley wrote:

I had a similar problem a while back and took the opportunity to test 
various RDBMS.   I found that a product called TimesTen from Oracle was 
the fastest, followed by Microsoft SQL server, followed by Oracle 10 
Enterprise with bitmap indicies, followed by Postgres.  RSQLite had 
relatively poor performance and I had trouble persuading MySQL to 
optimize the query at all.   I did not try any column oriented database 
systems like Sybase IQ or Vertica, which should do better for the 
read-oriented workload.

Kasper Daniel Hansen

Mon, Oct 22, 2007 6:07 PM #

On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote:

I have not used RSQLite, but have some experience doing this thing  
for big sqlite databases using the command line client. Every  
database in sqlite has a number of parameters associated with it. You  
want to make sure that cache_size is at _most_ 2000 (yes, I know this  
is totally counterintuitive as it tells sqlite to use as little  
memory as possible). You also tell it to be non-synchronous. In  
sqlite the commands are
sqlite> pragma default_cache_size = 2000
sqlite> pragma_synchronous = off
You can test the setting of these parameters by just doing a
sqlite> pragma default_cache_size
As far as I remember, cache size can only be set when you create the  
database. I have no idea how RSQlite handles it.

When I asked about this problem on the sqlite mailing list, the  
sqlite-creator said that this was a "locality of reference problem"  
and that it was being "worked on". And that I could search the  
archives for more info (which did not help me back then).

I don't know whether or not sorting helps.

Another thing to do is to check in what amount sqlite sits idle while  
doing I/O. It is probably impossible to avoid some idleness with such  
a thing, but it should of course be kept to a minimum.

It is true that some of the other databases are probably much faster  
at creating indices. But in the post-index analysis, sqlite is a  
really fast database, probably amongst the fastest there is. It does  
not do a good job a converting your queries into smart queries, but  
if you are doing something simple, it is blazingly fast with the  
right user options.

Kasper

Seth Falcon

Mon, Oct 22, 2007 9:33 PM #

Jeffrey Horner <jeff.horner at vanderbilt.edu> writes:

What is your schema?  In particular, are things that are integers or
floats being stored that way in SQLite?

I believe the annotation data packages via AnnotationDbi are using
cache_size=64000 and synchronous=0 and that this was determined by a
handful of experiments on typical annotation dbs.

Columns with few levels may not benefit from an index.  See this
thread:

http://thread.gmane.org/gmane.comp.db.sqlite.general/23683/focus=23693

But your column with many levels should suffer this problem :-)

+ seth

Seth Falcon | seth at userprimary.net | blog: http://userprimary.net/user/

Kasper Daniel Hansen

Tue, Oct 23, 2007 1:09 PM #

On Oct 22, 2007, at 6:07 PM, Kasper Daniel Hansen wrote:

On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote:

I am trying to use RSQLite for storing data and  I need to create
indexes on
two variables in the table. It appears from searching the web that
the CREATE
INDEX operation in SQLite is relatively slow for large files, and
this has been
my experience as well.

The two index variables are crossed. One has about 350,000 levels
[yes, it's
genetic association data]. The other will have about 4000 levels
eventually,
but is up to about 100 now.   When the data were entered they were
already ordered by this second index variable.

Creating the index took about an hour on the 100-level, presorted
variable and about 12 hours on the 350,000-level unsorted
variable.  I'm looking for advice on how to reduce this. Specifically
1/ would it be faster if the variable with more levels was the
presorted one?
2/ would it be faster or slower if the index were created before
adding all the data?
3/ are there any options that can be set to speed up the indexing?

The SQLite database will not be the primary archive for the data,
so optimizations that are risky in the case of power loss or
hardware failure are still acceptable.  Since Bioconductor seems to
use SQLite a lot I'm hoping there is some simple solution.

I have not used RSQLite, but have some experience doing this thing
for big sqlite databases using the command line client. Every
database in sqlite has a number of parameters associated with it. You
want to make sure that cache_size is at _most_ 2000 (yes, I know this
is totally counterintuitive as it tells sqlite to use as little
memory as possible). You also tell it to be non-synchronous. In
sqlite the commands are
sqlite> pragma default_cache_size = 2000
sqlite> pragma_synchronous = off
You can test the setting of these parameters by just doing a
sqlite> pragma default_cache_size
As far as I remember, cache size can only be set when you create the
database. I have no idea how RSQlite handles it.

When I asked about this problem on the sqlite mailing list, the
sqlite-creator said that this was a "locality of reference problem"
and that it was being "worked on". And that I could search the
archives for more info (which did not help me back then).

I don't know whether or not sorting helps.

Another thing to do is to check in what amount sqlite sits idle while
doing I/O. It is probably impossible to avoid some idleness with such
a thing, but it should of course be kept to a minimum.

It is true that some of the other databases are probably much faster
at creating indices. But in the post-index analysis, sqlite is a
really fast database, probably amongst the fastest there is. It does
not do a good job a converting your queries into smart queries, but
if you are doing something simple, it is blazingly fast with the
right user options.

Let me just emphasize (based on Seth's email) that the  
default_cache_size settings only is for an indexing command. For  
actual operations on the database using the index, like a select  
statement, you would want to increase the cache_size to something  
bigger.

My comments are based on experience with a database with 315 *10^6  
rows, and the importance of this is quite dependent on database size.

Kasper