Skip to content
Prev 1803 / 20628 Next

Teaching Mixed Effects

My thanks to Andrew for starting this thread and to Ben for his
responses.  I will add more responses below.

I'm sorry that my responses have been delayed - it happens that this
is the first week of our spring semester and I have been tied up
getting courses underway.
On Tue, Jan 20, 2009 at 8:59 AM, Bolker,Benjamin Michael <bolker at ufl.edu> wrote:
I have not verified the results from the current mcmcsamp, even for
simple Gaussian models.  They seem reasonable for these models but I
need to look at them much more closely before I could advise trusting
those results.

The problem with designing an mcmcsamp method is that the variances of
the random effects can legitimately be zero and often have a
non-negligible probability of assuming the value of zero during the
MCMC iteraions.  However, most methods of sampling from the
distribution of a variance are based on sampling from the distribution
of a multiplier of the current value.  If the current value is zero,
you end up stuck there.
In a designed experiment you can do this.  In an observational study
you can't expect to end up with balanced data.  I also tell my
students that assuming a balanced design will always produce balanced
data is contrary to Murphy's Law.  Balance is fragile.  I think it is
unwise to depend on balance being achieved.
Exactly.  Adaptive Gauss-Hermite quadrature uses a quadrature formula
centered at the conditional modes of the random effects and scaled by
an approximation to the conditional standard deviations.  (That's
where the term "adaptive" comes from.)  The quadrature formula depends
on the number of quadrature points.  Generally we use an odd number of
points so one of the evaluations is at the conditional mode.  Thus you
could use a 3-point evaluation or a 5-point evaluation.  The simplest
formula, the 1-point Gauss-Hermite evaluation, is exactly the Laplace
approximation.

It turns out that the Laplace approximation is the only feasible such
method (at least the only one that I can imagine) when you have more
than one grouping factor for the random effects.  Even with just one
grouping factor, if you have vector-valued random effects (meaning
that you have more than one random effect associated with with each
level of the grouping factor) then the complexity of AGHQ is the
number of quadrature points raised to the dimension of the
vector-valued random effect.  Thus if you have a random intercept and
a random slope for each subject, say, and you choose a 7 point formula
then you must do 49 evaluations of the deviance residuals for each
AGHQ evaluation.

Oliver Schaubenberger's paper on SAS PROC GLIMMIX, which Ben mentions
below, discusses the problem of proliferation of the number of
evaluations required by AGHQ.
I think you mean SAS PROC GLIMMIX, not SAS PROC MIXED.

Regarding the comparison of methods provided by different software,
remember that details of the implementation can be important.  The
term "adaptive Gauss-Hermite quadrature" describes a technical
approach but there can be many variations on the implementation, with
important consequences for precision, speed and accuracy.  It is a
gross oversimplification to imagine that such a technique is
implemented by handing a paper with some formulas to a "programmer"
and declaring the job done.  Comparing implementations, including
looking at the intermediate steps, is important but only possible for
open source implementations.
I'm afraid my response will be rather long-winded, for which I
apologize.  I feel that as statisticians we have done a disservice to
those who use statistics by taking complex problems (tests of
hypotheses for terms in complicated model structures) for which there
is an enormous simplification in the central special case (linear
dependence of the mean on the model parameters, "spherical" Gaussian
distribution) and describing certain computational shortcuts that can
be used only in this special case.  For the most part we don't even
hint at the general problem or even discuss the approach used in the
special case.  We jump right in to a particular form of a summary of a
computational shortcut, the analysis of variance table.

I am very pleased to see you describe the situation as a comparison of
two nested models.  That is indeed how we should approach the problem.
 We have a general model and a specific model that is a special case
of the general model.  Obviously the general model will fit the data
at least as well as the more specialized model.  We wish to determine
if the fit is sufficiently better to justify using the more general
model.   To do so we must decide how to judge the extent to which the
general model fits better and how much more complex it is, so we can
form some kind of cost/benefit criterion.  Then we must assess the
value of this test statistic by comparing it to a reference
distribution.  In the special case of a Gaussian linear model it all
works out beautifully and we can summarize the results very compactly
with t statistics or F statistics and their degrees of freedom.  But
this case is special.  It is misleading to believe that things will
simplify like this in more general cases.

Consider the question of how we measure the comparative complexity of
two models. Typically we can measure the difference in the complexity
of the models in terms of the number of additional parameters in the
general model. In the central special case (Gaussian linear models)
there is a geometric representation of the model where the general
model corresponds to a linear subspace of the sample space and the
specific model corresponds to a subspace contained in the general
model.  The spherical Gaussian assumption (i.e. Gaussian distribution
with variance-covariance of sigma^2 I, for which the contours of
constant probability density are spheres) links the probability model
to the Euclidean geometry of the space.  From those two assumptions
the methods of testing a general linear hypothesis can be derived
geometrically.  Gosset derived a statistic that is equivalent to the t
statistic using analytic means but the current form of the t statistic
and its relation to degrees of freedom came from Fisher and were based
on his geometric insight (see the Wikipedia article on Gosset).  And,
of course, Fisher was able to generalize the t distribution to the F
distribution, again based on geometric principles.

In the first chapter of our 1980 book "Nonlinear Regression Analysis
and Its Applications" Don Watts and I illustrate the geometric
approach to the t and F statistics for linear models.  Fisher's genius
was to see that questions about comparative model fits, which are
related to distances in the geometric representation, can be
transformed into questions about angles, related to the ratios of
distances or, equivalently, squared distances of orthogonal components
of a response vector.  The use of the ratio allows one scale factor
(variance component in statistical terminology) to be canceled out.
That is, the null distribution of an F ratio can be expressed without
needing to specify the unknown error variance.

Regrettably, the "use a ratio to eliminate a variance component" trick
only works once.  The first variance component is free but not
subsequent ones.  If you have a perfectly balanced, orthogonal design
then you can apply the trick multiple times by isolating certain
orthogonal submodels and applying the trick within the submodel.  That
is, you can use estimates from different "error strata" in ratios for
different terms.  However, that approach based on certain mean squares
and expected mean squares doesn't generalize well.  The
computationally tractable approach to estimation of parameters in
mixed models is maximum likelihood or REML estimation.

The reason for my long-winded explanation is your saying " This is the
crux of the Bates et al argument: Likelihood Ratio Tests, Wald tests
etc all need to assume a distribution and some degrees of freedom.",
which is a natural statement given the way that we teach analysis of
variance.  We teach the "what" (i.e. create a table of sums of
squares, degrees of freedom, mean squares, F ratios, p-values) and not
the "why".  If you only see the "what" then it is natural to assume
that there are some magical properties associated with sums of squares
and degrees of freedom and all we need to do is to figure out which
sums of squares and which degrees of freedom to use.  The magical
properties are actually the simplified geometric representation
(orthogonal linear subspaces, Euclidean geometry) that is unique to
the Gaussian linear model.  The beauty of that model is that, no
matter how complicated the representation of a test as a formula may
be, the geometric representation is always the same, as the ratio of
the normalized squared lengths of two orthogonal components of the
response vector.

When we step away from that Gaussian linear model the simplifications
all break down.  I spent the early part of career thinking of what
parts of the Gaussian linear model can be transferred over to the
Gaussian nonlinear model and what that would mean for inference.  This
is why I concentrated so much on the geometry of models based on the
spherical Gaussian distribution.  Generalized linear models retain the
linear predictor, transformed through an inverse link function to the
appropriate scale for the mean, but allow for distributions other than
the Gaussian.  They require another way of thinking.  I think of mixed
models as being based on two random vectors, the response vector and
the unobserved random effects.  In the case of a Gaussian linear mixed
model the conditional distribution of the response, given the random
effects, is a spherical Gaussian and the unconditional distribution of
the random effects is multivariate Gaussian but our inferences require
the marginal distribution of the response.  For a linear mixed model
this is Gaussian but with more than one variance component.  Getting
rid of just one variance component won't do, yet the t and F
derivations depend strongly on just having one variance component that
can be removed by considering a ratio.

If we want to perform a hypothesis test related to a fixed-effects
term in a mixed model (and, for the moment, I will not go into the
question of whether statistical inferences should always be phrased as
the result of hypothesis tests) I would claim we should start at the
beginning, which is considering two models for the data at hand, one
model being a special case of the other.  We need to decide how we
measure the quality of the fit of the general model relative to the
more specific model and how we measure the additional cost of the
general model.  Then we need to formulate a test statistic.  If we are
incredibly lucky, the null distribution of this test statistic will be
well-defined (that is, it will not depend on the values of other,
unknown parameters) and we can evaluate probabilities associated with
it.  That does happen in the case of the Gaussian linear model.  I
personally don't think it will be possible to possible to provide a
general approach that isolates the effect of a fixed-effect term in a
linear mixed model using a statistic that does not depend on the
values of other parameters.  I would be delighted if someone can do it
but I think there is too much that goes right in the case of the
Gaussian linear model to expect that the same incredible
simplifications will apply to other models.

I don't feel that holy grail of inference in mixed effects models
should be a magical formula for degrees of freedom to be associated
with some ratio that looks like a t or an F statistic (despite the
strongly held beliefs of those in the First Church of the
Kenward-Roger Approximation). Certainly there has been a lot of
statistical research related to approximating a difficult distribution
by a more common distribution but I view this approach as belonging to
an earlier era.  It is the approach embodied in software like SAS
whose purpose often seems to be to evaluate difficult formulas and
provide reams of output including every number that could possibly be
of interest.  I think we should use the power of current and future
computers to interact with and explore data and models for the data.
MCMC is one way to do this.  In nonlinear regression Don and I
advocated profiling the sum of squares function with respect to the
values of individual parameters as another way of assessing the actual
behavior of the model versus trying to formulate an approximation.
I'm sure that creative people will come up with many other ways to use
the power of computers to this end.  The point is to explore the
actual behavior of the model/data combination, not to do just one fit,
calculate a bunch of summary statistics, apply approximate
distributions to get p-values and go home.

If we want to generalize methods of inference we should consider the
whole chain of reasoning that leads us to the result rather than
concentrating only on the last step, which is "now that I have the
value of a statistic how do I convert it to a p-value?" or, even more
specifically, "I have calculated something that looks like a t-ratio
so I am going to assume that its distribution is indeed a
t-distribution which leaves me with only one question and that is on
how many degrees of freedom".

I appreciate that this is inconvenient to those applying such models
to their data.  Philosophical discussions of the fundamentals of
statistical inference are all well and good but when the referees on
your paper say you have to provide a p-value for a particular term or
tests, it is a practical matter, not an academic, theoretical debate.
Those with sufficient energy and skill plus a stellar reputation as a
researcher may be able to convince editors that p-values are not the
"be all and end all" of data analysis - Reinhold Kleigl has been able
to do this in some cases - but that is definitely not the path of
least resistance.  The sad reality is that p-values have taken on a
role as the coin of the realm in science that is far beyond what any
statistician would have imagined.  (Apparently the default
"significance level" of 5%, which is considered in some disciplines to
be carved in stone, resulted from a casual comment by Fisher to the
effect that he might regard an outcome that would be expected to occur
less than, say, 1 time in 20 as "significant".)

It is unhelpful of me not to provide p-values in the lmer summaries
but I develop the software out of interest in doing it as well as I
possibly can and not because someone assigns me a task to compute
something.  I really don't know of a good way of assigning p-values to
t-ratios or F-ratios so I don't.  I still report the ratio of the
estimate divided by it standard error, and even call it a t-ratio,
because I think it is informative.
I think that was data on artificial contraception use obtained as part
of a fertility survey in Bangladesh.

I feel that the likelihood ratio is a perfectly reasonable way of
comparing two model fits where one is a special case of the other.  In
fact, if the models have been fit by maximum likelihood, the
likelihood ratio would, I think, be the first candidate for a test
statistic.  The problem with likelihood ratio tests is not the
likelihood ratio, per se -- it is converting the likelihood ratio to a
p-value.  You need to be able to evaluate the distribution of the
likelihood ratio under the null hypothesis.  The chi-square
approximation to the distribution is exactly that - an approximation -
and its validity depends on not testing at the boundary and on having
a large sample, in some sense of the sample size.  If I were really
interested in evaluating a p-value for the likelihood ratio I would
probably try a parametric bootstrap to get a reference distribution.
Actually the MLE or the REML estimate of a variance component can
indeed be zero.  The residual variance (i.e. the variance of the "per
observation" noise term) is only zero for artificial data but the
estimates of other variance components can be exactly zero.

I think of such situations as an indication that I should simplify the
model by eliminating such a term.  I have spent most of my career at
the University of Wisconsin-Madison in a statistics department founded
by George Box who famously said "All models are wrong; some models are
useful."  I don't expect a model to be correct, I am only interested
in whether the terms in the model are useful in explaining the
observed data.