I am trying to use RSQLite for storing data and I need to create indexes on
two variables in the table. It appears from searching the web that the CREATE
INDEX operation in SQLite is relatively slow for large files, and this has been
my experience as well.
The two index variables are crossed. One has about 350,000 levels [yes, it's
genetic association data]. The other will have about 4000 levels eventually,
but is up to about 100 now. When the data were entered they were already ordered by this second index variable.
Creating the index took about an hour on the 100-level, presorted variable and about 12 hours on the 350,000-level unsorted variable. I'm looking for advice on how to reduce this. Specifically
1/ would it be faster if the variable with more levels was the presorted one?
2/ would it be faster or slower if the index were created before adding all the data?
3/ are there any options that can be set to speed up the indexing?
The SQLite database will not be the primary archive for the data, so optimizations that are risky in the case of power loss or hardware failure are still acceptable. Since Bioconductor seems to use SQLite a lot I'm hoping there is some simple solution.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
RSQLite indexing
8 messages · Thomas Lumley, Hadley Wickham, Jeffrey Horner +3 more
On 10/22/07, Thomas Lumley <tlumley at u.washington.edu> wrote:
I am trying to use RSQLite for storing data and I need to create indexes on two variables in the table. It appears from searching the web that the CREATE INDEX operation in SQLite is relatively slow for large files, and this has been my experience as well. The two index variables are crossed. One has about 350,000 levels [yes, it's genetic association data]. The other will have about 4000 levels eventually, but is up to about 100 now. When the data were entered they were already ordered by this second index variable. Creating the index took about an hour on the 100-level, presorted variable and about 12 hours on the 350,000-level unsorted variable. I'm looking for advice on how to reduce this. Specifically
How big is your dataset? SQLite can be slow, but I didn't think it was that slow.
1/ would it be faster if the variable with more levels was the presorted one?
I didn't think this would matter (and can imagine situations where it would be worse). Is there a particular reason you think this might help?
2/ would it be faster or slower if the index were created before adding all the data?
It's generally much faster to create the index after loading all the data.
3/ are there any options that can be set to speed up the indexing?
Have you tried 'vacuum'ing your database prior to indexing? Will you be indexing on both columns simultaneously? If so, you might try creating a single index. Hadley
On Mon, 22 Oct 2007, hadley wickham wrote:
On 10/22/07, Thomas Lumley <tlumley at u.washington.edu> wrote:
I am trying to use RSQLite for storing data and I need to create indexes on two variables in the table. It appears from searching the web that the CREATE INDEX operation in SQLite is relatively slow for large files, and this has been my experience as well. The two index variables are crossed. One has about 350,000 levels [yes, it's genetic association data]. The other will have about 4000 levels eventually, but is up to about 100 now. When the data were entered they were already ordered by this second index variable. Creating the index took about an hour on the 100-level, presorted variable and about 12 hours on the 350,000-level unsorted variable. I'm looking for advice on how to reduce this. Specifically
How big is your dataset? SQLite can be slow, but I didn't think it was that slow.
350000x100 entries on 12 variables
1/ would it be faster if the variable with more levels was the presorted one?
I didn't think this would matter (and can imagine situations where it would be worse). Is there a particular reason you think this might help?
Mainly that it would be nice if it did. Preordering might help more for the larger number of levels because the smaller number of levels gives more valid places to insert a record into the index. I could just experiment, but since it takes so long I thought it was worth asking if anyone had actual knowledge.
2/ would it be faster or slower if the index were created before adding all the data?
It's generally much faster to create the index after loading all the data.
Yes. The reason I thought it might not be in this case is that the indexing operation uses remarkably little CPU (and doesn't use anywhere near the disk bandwidth) so I hoped there might be some saving in doing both at once.
3/ are there any options that can be set to speed up the indexing?
Have you tried 'vacuum'ing your database prior to indexing?
Yes. But it has never had anything deleted from it, so it's not surprising that it didn't help.
Will you be indexing on both columns simultaneously? If so, you might try creating a single index.
No, indexing will be on one or the other column, not on both at once.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
Thomas Lumley wrote on 10/22/2007 04:54 PM:
I am trying to use RSQLite for storing data and I need to create indexes on two variables in the table. It appears from searching the web that the CREATE INDEX operation in SQLite is relatively slow for large files, and this has been my experience as well. The two index variables are crossed. One has about 350,000 levels [yes, it's genetic association data]. The other will have about 4000 levels eventually, but is up to about 100 now. When the data were entered they were already ordered by this second index variable. Creating the index took about an hour on the 100-level, presorted variable and about 12 hours on the 350,000-level unsorted variable. I'm looking for advice on how to reduce this. Specifically 1/ would it be faster if the variable with more levels was the presorted one? 2/ would it be faster or slower if the index were created before adding all the data? 3/ are there any options that can be set to speed up the indexing? The SQLite database will not be the primary archive for the data, so optimizations that are risky in the case of power loss or hardware failure are still acceptable. Since Bioconductor seems to use SQLite a lot I'm hoping there is some simple solution.
Well then, while some may think it overkill to run a mysql database server for 1 to a handfull of clients, the benefits of using typed columns (rather than strings, which might have something to do with slow indexing) plus all the other goodies could save you the extra day or two speeding up SQLite indexing. A project here at vanderbilt stores (mostly read-only) data in mysql 5.0 tables in myisam format. The tables have 3 to 5 million records each. Index creation speed for a varchar(9) column with 1,559,100 levels takes roughly 1.5 minutes, for example. Jeff
Thomas Lumley wrote:
How big is your dataset? SQLite can be slow, but I didn't think it
was that slow.
350000x100 entries on 12 variables
I had a similar problem a while back and took the opportunity to test various RDBMS. I found that a product called TimesTen from Oracle was the fastest, followed by Microsoft SQL server, followed by Oracle 10 Enterprise with bitmap indicies, followed by Postgres. RSQLite had relatively poor performance and I had trouble persuading MySQL to optimize the query at all. I did not try any column oriented database systems like Sybase IQ or Vertica, which should do better for the read-oriented workload.
On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote:
I am trying to use RSQLite for storing data and I need to create indexes on two variables in the table. It appears from searching the web that the CREATE INDEX operation in SQLite is relatively slow for large files, and this has been my experience as well. The two index variables are crossed. One has about 350,000 levels [yes, it's genetic association data]. The other will have about 4000 levels eventually, but is up to about 100 now. When the data were entered they were already ordered by this second index variable. Creating the index took about an hour on the 100-level, presorted variable and about 12 hours on the 350,000-level unsorted variable. I'm looking for advice on how to reduce this. Specifically 1/ would it be faster if the variable with more levels was the presorted one? 2/ would it be faster or slower if the index were created before adding all the data? 3/ are there any options that can be set to speed up the indexing? The SQLite database will not be the primary archive for the data, so optimizations that are risky in the case of power loss or hardware failure are still acceptable. Since Bioconductor seems to use SQLite a lot I'm hoping there is some simple solution.
I have not used RSQLite, but have some experience doing this thing for big sqlite databases using the command line client. Every database in sqlite has a number of parameters associated with it. You want to make sure that cache_size is at _most_ 2000 (yes, I know this is totally counterintuitive as it tells sqlite to use as little memory as possible). You also tell it to be non-synchronous. In sqlite the commands are sqlite> pragma default_cache_size = 2000 sqlite> pragma_synchronous = off You can test the setting of these parameters by just doing a sqlite> pragma default_cache_size As far as I remember, cache size can only be set when you create the database. I have no idea how RSQlite handles it. When I asked about this problem on the sqlite mailing list, the sqlite-creator said that this was a "locality of reference problem" and that it was being "worked on". And that I could search the archives for more info (which did not help me back then). I don't know whether or not sorting helps. Another thing to do is to check in what amount sqlite sits idle while doing I/O. It is probably impossible to avoid some idleness with such a thing, but it should of course be kept to a minimum. It is true that some of the other databases are probably much faster at creating indices. But in the post-index analysis, sqlite is a really fast database, probably amongst the fastest there is. It does not do a good job a converting your queries into smart queries, but if you are doing something simple, it is blazingly fast with the right user options. Kasper
Jeffrey Horner <jeff.horner at vanderbilt.edu> writes:
Thomas Lumley wrote on 10/22/2007 04:54 PM:
I am trying to use RSQLite for storing data and I need to create indexes on two variables in the table. It appears from searching the web that the CREATE INDEX operation in SQLite is relatively slow for large files, and this has been my experience as well.
What is your schema? In particular, are things that are integers or floats being stored that way in SQLite? I believe the annotation data packages via AnnotationDbi are using cache_size=64000 and synchronous=0 and that this was determined by a handful of experiments on typical annotation dbs. Columns with few levels may not benefit from an index. See this thread: http://thread.gmane.org/gmane.comp.db.sqlite.general/23683/focus=23693 But your column with many levels should suffer this problem :-) + seth
Seth Falcon | seth at userprimary.net | blog: http://userprimary.net/user/
On Oct 22, 2007, at 6:07 PM, Kasper Daniel Hansen wrote:
On Oct 22, 2007, at 2:54 PM, Thomas Lumley wrote:
I am trying to use RSQLite for storing data and I need to create indexes on two variables in the table. It appears from searching the web that the CREATE INDEX operation in SQLite is relatively slow for large files, and this has been my experience as well. The two index variables are crossed. One has about 350,000 levels [yes, it's genetic association data]. The other will have about 4000 levels eventually, but is up to about 100 now. When the data were entered they were already ordered by this second index variable. Creating the index took about an hour on the 100-level, presorted variable and about 12 hours on the 350,000-level unsorted variable. I'm looking for advice on how to reduce this. Specifically 1/ would it be faster if the variable with more levels was the presorted one? 2/ would it be faster or slower if the index were created before adding all the data? 3/ are there any options that can be set to speed up the indexing? The SQLite database will not be the primary archive for the data, so optimizations that are risky in the case of power loss or hardware failure are still acceptable. Since Bioconductor seems to use SQLite a lot I'm hoping there is some simple solution.
I have not used RSQLite, but have some experience doing this thing for big sqlite databases using the command line client. Every database in sqlite has a number of parameters associated with it. You want to make sure that cache_size is at _most_ 2000 (yes, I know this is totally counterintuitive as it tells sqlite to use as little memory as possible). You also tell it to be non-synchronous. In sqlite the commands are sqlite> pragma default_cache_size = 2000 sqlite> pragma_synchronous = off You can test the setting of these parameters by just doing a sqlite> pragma default_cache_size As far as I remember, cache size can only be set when you create the database. I have no idea how RSQlite handles it. When I asked about this problem on the sqlite mailing list, the sqlite-creator said that this was a "locality of reference problem" and that it was being "worked on". And that I could search the archives for more info (which did not help me back then). I don't know whether or not sorting helps. Another thing to do is to check in what amount sqlite sits idle while doing I/O. It is probably impossible to avoid some idleness with such a thing, but it should of course be kept to a minimum. It is true that some of the other databases are probably much faster at creating indices. But in the post-index analysis, sqlite is a really fast database, probably amongst the fastest there is. It does not do a good job a converting your queries into smart queries, but if you are doing something simple, it is blazingly fast with the right user options.
Let me just emphasize (based on Seth's email) that the default_cache_size settings only is for an indexing command. For actual operations on the database using the index, like a select statement, you would want to increase the cache_size to something bigger. My comments are based on experience with a database with 315 *10^6 rows, and the importance of this is quite dependent on database size. Kasper