Skip to content
Prev 207898 / 398502 Next

RMySQL - Bulk loading data and creating FK links

I'm talking about ease of use to.  The first line of the Details section in 
?"[.data.table" says :
   "Builds on base R functionality to reduce 2 types of time :
       1. programming time (easier to write, read, debug and maintain)
       2. compute time"

Once again, I am merely saying that the user has choices, and the best 
choice (and there are many choices including plyr, and lots of other great 
packages and base methods) depends on the task and the real goal.   This 
choice is not restricted to compute time only, as you seem to suggest.  In 
fact I listed programming time first (i.e ease of use).

To answer your points :

This is the SQL code you posted and I used in the comparison. Notice its 
quite long,  repeats the text "var1,var2,var3" 4 times, contains two 
'select's and a 'using'.
user  system elapsed
 103.13    2.17  106.23

Isolating the series of operations you described :
user  system elapsed
  39.00    0.63   39.62

So thats roughly 40% of the time. Whats happening in the remaining 66 secs?

Heres a repeat of the equivalent in data.table :
user  system elapsed
   0.90    0.13    1.03
user  system elapsed
   3.92    0.78    4.71

I looked at the news section, but I didn't find the benchmarks quickly or 
easily.  The links I saw took me to the FAQs.



"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message 
news:971536df1001280855i1d5f7c03v46f7a3e58ff93948 at mail.gmail.com...
I think one would only be concerned about such internals if one were
primarily interested in performance; otherwise, one would be more
interested in ease of specification and part of that ease is having it
independent of implementation and separating implementation from
specification activities.  An example of separation of specification
and implementation is that by simply specifying a disk-based database
rather than an in-memory database SQL can perform queries that take
more space than memory.  The query itself need not be modified.

I think the viewpoint you are discussing is primarily one of
performance whereas the viewpoint I was discussing is primarily ease
of use and that accounts for the difference.

I believe your performance comparison is comparing a sequence of
operations that include building a database, transferring data to it,
performing the operation, reading it back in and destroying the
database to an internal manipulation.  I would expect the internal
manipulation, particular one done primarily in C code as is the case
with data.table, to be faster although some benchmarks of the database
approach found that it compared surprisingly well to straight R code
-- some users of sqldf found that for an 8000 row data frame sqldf
actually ran faster than aggregate and also faster than tapply.  The
News section on the sqldf home page provides links to their
benchmarks.  Thus if R is fast enough then its likely that the
database approach is fast enough too since its even faster.

On Thu, Jan 28, 2010 at 8:52 AM, Matthew Dowle <mdowle at mdowle.plus.com> 
wrote: