Skip to content

Advantages of using SQLite for data import in comparison to csv files

7 messages · Juliet Jacobson, Gabor Grothendieck, Magnus Torfason +1 more

#
Hello everybody out there using R,

I'm using R for the analysis of biological data and write the results
down using LaTeX, both on a notebook with linux installed.
I've already tried two options for the import of my data:
1. Import from a SQLite database
2. Import from individual csv files edited with sed, awk and sort.
Both methods actually work very well, since I don't need advanced
features like multi-user network access to the data.
My data sets are tables with up to 20 columns and 1000 rows, containing
mostly numerical values and strings. Moreover,
I might also have to handle microarray data, but I'm not so sure about
that yet. Moreover, I need to organise tags for a collection of photos,
but this data is of course not analysed with R.
I'm now beginning to work on a larger project and have to decide,
whether it is better to use SQLite or csv-files for handling my data.
I fear, it might get difficult to switch between the two system after
having accumulated the data, adapted software for backups and revision
control, written makefiles etc.
Could anyone of you give me a hint on the additional benefits of
importing data from a SQLite database to R to the simpler way of
organising the data in csv files? Is it for example possible to select
values from a column within a certain range from a csv file using awk?

Thanks in advance,
Juliet Jacobson
#
You could look at read.csv.sql in sqldf (http://sqldf.googlecode.com) as well.
On Wed, Jan 13, 2010 at 2:00 PM, Juliet Jacobson <julietjacobson at aim.com> wrote:
#
Thanks for your answer. I hadn't found this possibility by web search.
Since sqldf also allows the import of tables from csv files, complex
SELECT queries and even joins on tables, I have the impression that
there aren't any reasons for using a SQLite database to organise the
data for R.
But then why has the R driver for data import from a SQLite database
been written?
Gabor Grothendieck wrote:
#
sqldf is layered on top of RSQLite.  Its not a separate facility.
On Thu, Jan 14, 2010 at 2:12 AM, Juliet Jacobson <julietjacobson at aim.com> wrote:
#
RSQLite is actually much more than a "driver for data import". It 
basically allows any manipulation of SQLite databases, including 
reading, writing, or updating and munging data within the database.


Regarding the original question of data import:

I use csv and sqlite files interchangeably to store my data, and 
converting between one and the other is generally trivial (read one 
format into a data.frame and then write it into the other).

For me, the key determinant is whether a given data set is so big that 
reading it into a data.frame with read.csv() every time I need it is a 
pain. I usually keep all my original data in csv files and write 
routines to write the contents of huge files into sqlite databases.

sqldf is a great package as well, but it definitely does not obviate the 
need to actually be able to manipulate data base files. For one thing, 
you cannot index a csv file or a data.frame. If you have to repeatedly 
select subsets of your large data set, creating an index on the relevant 
column in the sqlite table is an absolute life saver.

(However, with a 1000x20 data set, you will probably not have to worry 
about the speed of selecting from the table. Unless you need to use 
bigger data sets, the simplest way is probably to just use csv files, 
read the contents into a data.frame with read.csv, and then use sqldf on 
the data.frame if you need to do complicated subsetting)

Best,
Magnus
On 1/14/2010 2:12 AM, Juliet Jacobson wrote:
#
Thanks for your answer.
I've thought of the possibility of an index when using a SQLite database, but 
as you mentioned, I don't really benefit from it in regard of rather small 
data sets. 
What I am considering as a problem when using csv files is the occurrence of 
data redundancy: When I wan't to print a table in a document, it is often not  
necessary to insert all columns of a table. You might choose to leave out 
certain parameters that are aren't of interest for the reader. I feared that 
the generation of a second table with less information would be unavoidable 
in this case before including it in LaTeX with Sweave. However, after having 
tried to get started with sqldf, this powerful addition to R's data analysis 
and manipulation abilities is probably sufficient for my purposes.

Best regards,
Juliet
#
Just to comment on this bit :
This is one reason the data.table package was created.  It is very similar 
to a data.frame, with the addition of things like keys.  If you consider an 
index in sqlite a life saver then data.table might be up your street.

Its really simple to do this in R.

For example, lets say you have a 10,000,000 row table of 2 columns : 
id(integer), v(double).  So this is a tiny example at just 115MB i.e. 
(4+8)*1e7/1024^2.   This is many times smaller than most desktop users have 
RAM available to R on their 32bit machines,  so you can run this example 
yourself.

W'll start off with data.frame and then show how data.table compares.
id           v
1  P -0.57695495
2  A -0.33388390
3  P  0.05277701
4  K  0.49610573
5  F  0.31031501
6  C  1.26861215
[1] 10000000        2
user  system elapsed
   1.23    0.17    1.41
user  system elapsed
   1.36    0.17    1.53
120001608 bytes
120001640 bytes                # so a DT is the same structure as a 
data.frame taking the same space.  data.frame is efficiently stored.  Can't 
get lower than the 115MB calculation for this dataset.   data.table doesn't 
do any magic with regard to storage
# Now comes the fun :
NAME       NROW  MB COLS KEY
[1,] DT   10,000,000 115 id,v id
Total: 115MB
Its not difficult to use the key,  its very like using rownames :
user  system elapsed
   0.09    0.00    0.09
user  system elapsed
    0.1     0.0     0.1
[1] TRUE
The mult="all" is needed because otherwise you'd just get the first row of 
the "G" group  (the default for mult is "first" when multiple rows match the 
key value)

Its very important to realise that in a data.table query you have the option 
to do table scans too e.g. :
user  system elapsed
   1.39    0.12    1.52
So its up to the user of data.table. The user has the option to use 
data.table badly (by not using the key),  just like you can use SQL badly 
(by not using an index)

Ok, who cares. 1 second saved. Why all the effort ?  Two reasons.  Firstly, 
try doubling the rows in DF and the time doubles to 3 seconds, double the 
rows in the DT though and that time stays constant at 0.1 seconds. Keep on 
doubling and the savings get bigger and bigger. Secondly, if you're 
repeating that 'query' inside a loop then the time saving adds up.  There 
are other advantages of data.table but it seemed appropriate just to mention 
the keys this time.

HTH


"Magnus Torfason" <zulutime.net at gmail.com> wrote in message 
news:4B4F9C70.2050804 at gmail.com...