Skip to content

SQL speed vs R

5 messages · Paul Gilbert, Hadley Wickham, Tim Keitt +2 more

#
(snipped from R-help thread "SQL vs R"

I'm not sure if anyone else has been following the R-help chatter on 
this but, in addition to much unnecessary acrimony,  I am surprised by 
an implied question:
...
...
...
(I am assuming it is the original extract that is considered to be slow. 
If the statement really means that working with a very large dataset is 
much slower than working with a small subset, then no one should be 
surprised. The SQL equivalent to the extracted dataframe would be to 
make a temporary table and work with that.)

I think of R and SQL as mostly complementary. The example query is 
simple in SQL, and relatively ugly in R. But also, I would expect it to 
be much faster with an SQL engine, after all, that is what SQL engines 
are designed to do. So the implied suggestion that it might be faster in 
R really surprises me. (Implied by the original question and also 
somewhat endorsed by some responses.)

The things that would seem to me to be obvious candidates for examination:
  - Is the SQL server running out of memory and starting to swap?
  - Are the important fields indexed?

So my questions for R-sig-db followers are:

1/ Has anyone ever had the experience that an R "query" of a dataframe 
is faster than an SQL database for a non-trivial SQL style query on a 
modestly large dataset?  If so, what are the parameters that one should 
think about when considering if a "pure R" solution might be faster?

2/ Assuming no in 1/, is there any programming language that would be 
preferred to an SQL engine for doing SQL style queries? (Please leave 
aside the debate over your favourite flavour of SQL engine, and the 
possibility that the database would be better in some non-SQL 
structure.) If yes, I would really appreciate some explanation. I think 
of the fact that general programming languages needs to do other things 
besides SQL queries as something like a constraint, and there is a 
theorem that a constrained optimization can never be better than an 
unconstrained one.

3/ In addition to my two candidates for examination, above, what are the 
most obvious things one should consider when finding that queries are 
slow? (I do realize that people spend a lot of time learning how to 
optimize databases, I'm really just wondering what the "big" things are.)

Thanks,
Paul
#
I think your expectation is wrong. SQL engines are not designed to do
fast in-memory analysis. They are designed to be:

* fast at adding new rows (in an ACID manner)
* fast at working with on disk

My expectation (after working on dplyr) is that if you can fit the
data in memory, R should be faster. If it's not faster, it's a sign
that the underlying R code is poorly written.

The right indexes can make a RDMS competitive with naive loops on
in-memory data, but I rarely find subsetting to be a bottleneck.
Indexes help a little with counts, but don't help at all for any other
summaries
If it fits in memory, R _should_ be faster.
Not a programming language, but I think it's useful to be aware of
columnar databases (e.g. monetdb). These are column based (liked R),
not row based (like most RDMSs), which means that they're slower at
adding new rows, but faster at doing analytic summaries.

Hadley
#
On Wed, May 7, 2014 at 7:08 AM, Hadley Wickham <h.wickham at gmail.com> wrote:

            
All very good points. Just remember that the R in RDMS stands for
relational. They facilitate very large and complex joins as opposed for
simple filtering or aggregation. Column stores are not always ideal for
complex joins but very good for large scans.

THK

  
    
#
The place where I have seen a large difference is when you have to "join" a
large number of tables to get results.  The problem with doing is R is the
restriction of how much can fit into memory and the time that it takes to
download the data across a data connection if the data is remote from the
machine that you are doing R on.  As an aside, we have a number of stored
procedures on the SQL machine that will call R locally to do some of the
processing.  You have to look at the tradeoff of the time to get the data
vs. the time needed to aggregate the results.  An SQL server is much better
at doing joins across ten different tables and being able to scale that
across multiple processors if they are available.  The resulting table is
typically much smaller when it comes time to download it to your local R
machine.  There is a big difference is these tables start to exceed the
memory that you have on your local R machine.

I would agree that if I can fit the data into memory, there are packages
(data.table & dplyr) that will make the accessing/aggregation of data
pretty fast, but if I can do some of that on a SQL machine, the amount of
data that I have to download and locally process may be a lot smaller.


Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
On Wed, May 7, 2014 at 10:08 AM, Tim Keitt <tkeitt at utexas.edu> wrote:

            

  
  
#
To throw a cat amongst the pigeons...

Postgres  and some other backends can use R as a built in language at the
server side, e.g. http://www.joeconway.com/plr/
I'm afraid I have not used any of these in a serious project.

Kind regards,
Sean
On 7 May 2014 15:30, jim holtman <jholtman at gmail.com> wrote: