-----Original Message-----
From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-
models-bounces at r-project.org] On Behalf Of Paul Johnson
Sent: Wednesday, September 23, 2009 6:43 PM
To: r-sig-mixed-models at r-project.org
Subject: [R-sig-ME] I'm sorry, and here is what I mean to ask about
speed
I'm sorry I made Doug mad and I'm sorry to have led the discussion off
into such a strange, disagreeable place.
Now that I understand your answers, I believe I can ask the question
in a non-naive way. I believe this version should not provoke some of
the harsh words that I triggered in my awkward question.
New Non-Naive version of the Speed Question
Do you have a copy of HLM6 on your system? Maybe you could help me by
running the same model in R (with any of the packages such as lme4,
nlme, or whatever) and HLM6 and let us all know if you get similar
estimates and how long it takes to run each one.
Here's why I ask.
...
Perhaps you think there are good reasons why R estimation takes longer.
E.g.:
1. HLM programmers have full access to benefit from optimizations in
lmer and other open programs, but they don't share their optimizations
in return.
2. lmer and other R routines are making calculations in a better way,
a more accurate way, so we should not worry that they take longer.
That was my first guess, in the original mail I said I thought that
HLM was using PQL whereas lmer is using Laplace or Adaptive Gaussian
Quadrature. But Doug's comment indicated that I was mistaken to
expect a difference there because REML is the default in lmer and it
is also what HLM is doing, and there's no involvement of quadrature or
integral approximation in a mixed linear model (gaussian dependent
variable).
Here's Doug's comment:
<Doug Bates>
But you don't need to speculate about what lmer does. It is Open
Source so you can check for yourself.
However, this does bring up another point which is the need to compare
apples with apples when you are benchmarking software. If the data
import and model specification stages in HLM6 create the necessary
matrix structures for performing the iterative fit then does the time
to fit the model consist solely of the optimization and summary
stages? Using an expression like
system.time(fm1 <- lmer(...))
is assessing the time to take the original data, which could be in a
very general form, create all those internal structures and perform
the optimization.
You should bear in mind that a lot of that construction of the model
structures is written in R exactly so that it is capable of fitting
very general model specifications. The code in HLM6 is, I imagine,
compiled code, which is possible because it targets a very specific
task, and compiled code is always going to be much faster than
interpreted code.
<\Doug Bates>
So part of the speed difference will be that R is an interpreted
language, whereas HLM6 is compiled.
The other part is the construction and handling of the model matrix,
which is a tough one to compare, as lmer() can handle more general models.
Will your colleague only be fitting models that are within the
specifications of HLM6, or will your colleague have some datasets with
structure that HLM6 can not handle, and so will need to shoe-horn the data
into HLM6 and make compromises that would not need to be made in lmer()?
If the former, then some clever programming (potentially both in R and
in C) can yield a specialized version of lmer() that will be comparable
in speed to HLM6 (I've done such modifications to several functions
over the years so have stopped believing that compiled code is always
faster than R - after all much of R is compiled C).
If the latter, then the flexibility of the interpreted language version,
and the implementation speed (i.e versus recoding and recompiling a
specialized C program to fit new scenarios) generally beats the compiled
language version.
On the other hand, perhaps you are (like me) surprised by this
difference and you want to help me figure out the cause of the
differences. If you have ideas about that, maybe we can work together
(I don't suck at C!). I have pretty much experience profiling programs
in C and did some optimization help on a big-ish C++ based R package
this summer.
So far, I have a simple observer's interest in this question. I
advise people whether it is beneficial for them to spend their scarce
resources on a commercial package like HLM6 and one of the factors
that is important to them is how "fast" the programs are. I
personally don't see an urgent reason to buy HLM because it can
estimate a model in 1 second and an open source approach requires 50
seconds.
When I need to fit hundreds or thousands of models, I overcome the
speed deficit of the interpreted language by using a compute cluster,
far cheaper than the cost of my or other programmers' time that would
be involved to code and compile some specialty software in an effort
to handle the great variety of problems that the interpreted language
can handle.
All that said, the learning curve for the S language is somewhat
steep and a bit long. If you just have to stuff some data into
something right away and get some numbers out, the $500 or so to
purchase HLM6 may be cheaper than learning R. But if you're in it
for the long haul, learning how to drive this Race caR is sweet.
Steven McKinney, Ph.D.
Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre
email: smckinney -at- bccrc +dot+ ca
tel: 604-675-8000 x7561
BCCRC
Molecular Oncology
675 West 10th Ave, Floor 4
Vancouver B.C.
V5Z 1L3
Canada
But I'm not the one making the decision. If I can make the R
version run almost as fast as HLM6, or provide reasons why people
might benefit from using a program that takes longer, then I can do my
job of advising the users.
I am sorry if this question appears impertinent or insulting. I do not
mean it as a criticism.
--
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas