understanding log-likelihood/model fit

On Wed, Aug 20, 2008 at 2:24 PM, Andrew Robinson
I'll take a swing at these.

On Wed, Aug 20, 2008 at 02:01:29PM +0100, Daniel Ezra Johnson wrote:
Everyone agrees about what happens here:

Nsubj <- 10
Ngrp <- 2
NsubjRep <- 5
set.seed(123)
test1s <- data.frame(subject = rep(seq(Nsubj * Ngrp), each = NsubjRep),
       response=500+c(rep(-100,Nsubj * NsubjRep),rep(100,Nsubj *
NsubjRep))+rnorm(Nsubj * Ngrp * NsubjRep, 0, 10),
       fixed=(rep(c("A","B"),each=Nsubj * NsubjRep)))
null1 <- lmer(response~(1|subject),test1s)
fixed1 <- lmer(response~fixed+(1|subject),test1s)

I still have two questions which I'll try to restate. I should note
that I have attempted to understand the mathematical details of ML
mixed effect model fitting and it's a bit beyond me. But I hope that
someone can provide an answer I can understand.

Question 1: When you have an "outer" fixed effect and a "subject"
random effect in the same model, specifically why does the model
(apparently) converge in such a way that the fixed effect is maximized
and the random effect minimized? (Not so much why should it, as why
does it? This is the 'fixed1' case.)
I approach this from a slightly different point of view than does
Andrew.  I prefer to think of the process of estimating the fixed
effects parameters and the random effects as a penalized estimation
problem where the penalty is applied to the random effects only.  The
magnitude of the penalty depends on the (unconditional) variance
covariance matrix of the random effects.  When the variances are small
there is a large penalty.  When the variances are large there is a
small penalty on the size of the random effects.  The measure of model
complexity, which is related to the determinant of the conditional
variance of the random effects, given the data, has the opposite
behavior.  When the variance of the random effects is small the model
is considered simpler.  The simplest possible model on this scale is
one without any random effects at all, corresponding to a variance of
zero.  The larger the variance of the random effects the more complex
the model.

The maximum likelihood criterion seeks a balance between model
complexity and fidelity to the data.  Simple models fit the data less
well than do complex models.

The point to notice, however, is that the penalty on model complexity
applies to the random effects only, not to the fixed effects.  From
this point of view if the model can explain a certain pattern in the
data either with fixed-effects parameters or with random effects there
is an advantage in explaining it with the fixed effects.
A common way for hierarchical models to be fit using ML is by
profiling out the fixed effects, estimating the random effects, and
then using GLS to estimate the fixed effects conditional on the random
effects.  So, any explanatory capacity that the fixed effects offer is
deployed before the random effects are invoked.

Likewise a popular way to applying ReML is to fit the fixed effects
using OLS, then estimate the random effects from the residuals.
Again, the net effect is that any explanatory capacity that the fixed
effects offer is deployed before the random effects are invoked.

Question 2: Take the fixed1 model from Question 1 and compare it to
the null1 model, which has a random subject effect but no fixed
effect. The predicted values of the two models -- the ones from
fitted(), which include the ranefs -- are virtually the same. So why
does fixed1 have a lower deviance, why is it preferred to null1 in a
likelihood ratio test? (Again, I'm not asking why it's a better model.
I'm asking questions about the software, estimation procedure, and/or
theory of likelihood applied to such cases.)
Because the likelihood involves a penalty on the size of the variance
of the random effects.  Obtaining a similar fit (fidelity to the data)
with a much lower variance on the random effects (the penalty on model
complexity) is advantageous under the likelihood criterion.
The deviance is computed from the log likelihood of the data,
conditional on the model.  The LL of the null model is maximized by
making the variance components big enough to cover the variation of
the data.  But, this means that the likelihood is being spread thinly,
as it were.  Eg ...

dnorm(0, sd=1)
[1] 0.3989423
dnorm(0, sd=2)
[1] 0.1994711

On the other hand, fixed1 uses fixed effects to explain a lot of that
variation, so that when the time comes to estimate the random effects,
they are smaller, and the LL is higher, because it doesn't have to
stretch so far.

Cheers,

Andrew

--
Andrew Robinson
Department of Mathematics and Statistics            Tel: +61-3-8344-6410
University of Melbourne, VIC 3010 Australia         Fax: +61-3-8344-4599
http://www.ms.unimelb.edu.au/~andrewpr
http://blogs.mbs.edu/fishing-in-the-bay/

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

understanding log-likelihood/model fit

Thread (13 messages)