On R performance

I've been working on an R performance academic project for the last
couple years which has involved writing an interpreter for R from
scratch and a JIT for R vector operations.

With the recent comments on Julia, I thought I'd share some thoughts
from my experience since they differ substantially from the common
speculation on R performance.

I went into the project thinking that R would be slow for the commonly
cited reasons: NAs, call-by-value, immutable values, ability to
dynamically add/remove variables from environments, etc. But this is
largely *not* true. It does require being somewhat clever, but most of
the cost of these features can be either eliminated or moved to
uncommon cases that won't affect most code. And there's plenty of room
for innovation here. The history of Javascript runtimes over the last
decade has shown that dramatic performance improvements are possible
even for difficult languages.

This is good news. I think we can keep essentially everything that
people like about R and still achieve great performance.

So why is R performance poor now? I think the fundamental reason is
related to software engineering: R is nearly impossible to experiment
with, so no one tries out new performance techniques on it. There are
two main issues here:

1) The R Language Definition doesn't get enough love. I could point
out plenty of specific problems, omissions, etc., but I think the
high-level problem is that the Language Definition currently conflates
three things: 1) the actual language definition, 2) the definition of
what is more properly the standard library, and 3) the implementation.
This conflation hides how simple the R/S language actually is and, by
assuming that the current implementation is the only implementation,
obscures performance improvements that could be made by changing the
implementation.

2) The R core implementation (e.g. everything in src/main) is too big.
There are ~900 functions listed in names.c. This has got to be simply
unmanageable. If one were to change the SEXP representation, how many
internal functions would have to be checked and updated? This is a
severe hinderance on improving performance.

I see little value is debating changes to the language semantics until
we've addressed this low hanging fruit and at least tried to make the
current R/S semantics run fast.

Justin