Skip to content

understanding log-likelihood/model fit

13 messages · John Maindonald, Daniel Ezra Johnson, Steven McKinney +3 more

#
Dear All,

I'm sure this is a simple question, but I haven't been able to find an
answer to it that I can understand. I'd like an answer that is pitched
somewhere between the full mathematics on the one hand, and an
oversimplified overview on the other.

One way of putting the question is, what is the difference, really,
between a fixed and a random effect (as they are fit by lmer)?
Another way of putting it involves the following example.

Suppose we have observations of a response from many subjects.
The overall average response is 500.
The subjects fall into two groups.
Half have an effect of +100 and half an effect of -100.

test1 <- data.frame(subject=rep(1:200,each=500),
	response=500+c(rep(-100,50000),rep(100,50000))+rnorm(100000,0,10),
	fixed=(rep(c("A","B"),each=50000)))

The following model treats subject as a random effect:
The following model keeps the subject effect and adds the fixed effect.
Linear mixed model fit by REML
Formula: response ~ (1 | subject)
   Data: test1
    AIC    BIC  logLik deviance REMLdev
 746923 746951 -373458   746923  746917
Random effects:
 Groups   Name        Variance Std.Dev.
 subject  (Intercept) 10041.81 100.209
 Residual               100.46  10.023
Number of obs: 100000, groups: subject, 200
Fixed effects:
            Estimate Std. Error t value
(Intercept)  500.000      7.086   70.56
Linear mixed model fit by REML
Formula: response ~ fixed + (1 | subject)
   Data: test1
    AIC    BIC  logLik deviance REMLdev
 743977 744015 -371984   743960  743969
Random effects:
 Groups   Name        Variance  Std.Dev.
 subject  (Intercept)  0.016654 0.12905
 Residual             99.642120 9.98209
Number of obs: 100000, groups: subject, 200
Fixed effects:
             Estimate Std. Error t value
(Intercept) 400.11806    0.04647    8610
fixedB      199.87485    0.06572    3041

The result is what one would expect, intuitively.
In the model "null" there is a large subject variance.
In the model "fixed" there is virtually no subject variance.
In both models the residuals are the same.
The logLik of the model with the fixed effect is closer to zero (by about 1500).
Therefore, we say the model with the fixed effect fits better.

This makes sense. Instead of 100 subject effects near +100 and 100
near -100, we have virtually no subject effects and the fixed effect
accounts for all the between-subject variance.

The question: why? Why does the model with the fixed effect fit better?
Why does the smaller (zero) random effect plus the fixed effect
translate into an improvement in log-likelihood?

It's not anything to do with the residuals. The models make the same
predictions:
[1] 400.2807 400.2807 400.2807 400.2807 400.2807 600.2013 600.2013 600.2013
 [9] 600.2013 600.2013
[1] 400.1282 400.1282 400.1282 400.1282 400.1282 599.9839 599.9839 599.9839
 [9] 599.9839 599.9839

And I don't think it has anything to do with the extreme non-normality
of the random effects in "null" as opposed to "fixed".

So what's the difference?

What, in terms of model fitting, makes it preferable to account for
the between-subject variation with a fixed effect (as in "fixed")
rather than with a random effect (as in "null")?

Thanks for your help,
Daniel
#
What you observe has everything to do with
'the extreme non-normality of the random effects in "null" as opposed  
to "fixed"'.

The distribution at the subject level is nowhere near normal.
The fixed effects account for a rather severe departure from normality.
For this reason, the between subjects components of variance
estimate is radically different between the two models.  Also the
subjects random effects are radically different.  Try

znull <- ranef(null, drop=TRUE)
zfix <- ranef(fixed, drop=TRUE)
plot(zfix$subject ~ znull$subject)

The residuals for the two models are simiiar, but not at all the same.
The residuals from the null model are surprisingly close to normal.
[Actually, a criticism of the Pinheiro and Bates book is that it relies
far too heavily on plots of residuals for its diagnostic checking.]

Think about how the results might be used.  The analyses that you
give seem to imply that the existence of the two groups is hidden
from you.  You want to generalize to other subjects (the usual
situation), so that fitting subject as a fixed effect would defeat the
whole aim of the enterprise.  So, unless you learn of the existence
of the two groups from initial exploratory analysis of the data (you
darn well should), you have to do the null analysis.

At this late stage you might examine the random subject effects
and belatedly conclude that you have been laboring under a gross
misapprehension.  Or you might charge blindly ahead and use the
model for predictive purposes.  In the latter case, the accuracy will
be terrible (high SE), but, hey, you can make predictions on what to
expect from another subject or group of subjects.

There are strong reasons for trying to ensure that models (1) are
somewhat near correct, and (2) that they model aspects of the data
that reflect the intended use of the results.

Now try
test2 <- data.frame(subject=rep(1:200,each=500),
	response=500+rep(rnorm(200), each=500)+rnorm(100000,0,10),
	fixed=(rep(c("A","B"),each=50000)))

lmer(response~ (1|subject), data=test2)
lmer(response~ fixed+(1|subject), data=test2)

Wrestling with such simulated data is a good way to gain
understanding and tune one's intuitive processes.

By the way, I believe you should be fitting by maximum likelihood
(ML), for comparing models with different fixed effects.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
On 20 Aug 2008, at 6:47 AM, Daniel Ezra Johnson wrote:

            
#
Thanks for your reply.
I thought so too, but it's not correct. The log-likelihood of the
mixed-effects model does not appear to depend on how nearly normal, or
not, the random effect BLUPs are.

That was what led to my original question. When we have a random
effect (subject) and a between-group fixed effect (an "outer" effect),
logically they are competing to account for the same variance.

The mixed model fitting appears to "prefer" the model with the fixed
effect, even though the model with no fixed effect appears to fit the
data equally well (judging by residuals and the fact that the BLUPs
are exactly what they 'should be', no shrinkage here).

I am comfortable with this result, indeed my work depends on it, but I
want to understand better why it comes out this way. This has nothing
to do with practical research design questions.

D
#
On 20 Aug 2008, at 10:19 AM, Daniel Ezra Johnson wrote:
Let's see your example.  If you change the data, however, you will
change the log-likelihood.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
#
On Wed, Aug 20, 2008 at 10:02:16AM +1000, John Maindonald wrote:

            
John, can you expand on this point?  

Warm wishes,

Andrew
#
I prefer smaller data sets so I can
more readily look at all components;
here is my modification of the modeling
(along with a random seed so we
all get the same data)

You can change the parameter values
for big data sets.


Nsubj <- 10
Ngrp <- 2
NsubjRep <- 5

set.seed(123)

test1s <- data.frame(subject = rep(seq(Nsubj * Ngrp), each = NsubjRep),
        response=500+c(rep(-100,Nsubj * NsubjRep),rep(100,Nsubj * NsubjRep))+rnorm(Nsubj * Ngrp * NsubjRep, 0, 10),
        fixed=(rep(c("A","B"),each=Nsubj * NsubjRep)))

null1 <- lmer(response~(1|subject),test1s)
fixed1 <- lmer(response~fixed+(1|subject),test1s)
summary(null1)
summary(fixed1)

quartz() ## substitute your favourite plot device 
par(mfrow = c(2, 1))
hist(ranef(null1)[[1]][,1])
hist(ranef(fixed1)[[1]][,1])

quartz() ## substitute your favourite plot device 
par(mfrow = c(2, 1))
qqnorm(ranef(null1)[[1]][,1])
qqline(ranef(null1)[[1]][,1])
qqnorm(ranef(fixed1)[[1]][,1])
qqline(ranef(fixed1)[[1]][,1])

### -----

Look at the histogram of BLUPs for model null1.
The strong bimodal structure is a clue that
there may be an underlying fixed effect
(and in fact we know that this is the case,
the beauty of simulations).

The QQ plot of the null1 model also
hints that subjects are not just some
selection from a normal distribution.

The fixed effect (see summary(fixed1)
output) shows a huge t value and the
random effect estimates are near zero
(you'll see this message during the
fitting process too)

Warning message:
In .local(x, ..., value) :
  Estimated variance for factor ?subject? is effectively zero

All of this suggests that a fixed effect (covariate 'fixed')
is appropriate to explain the structure
in the data, and subject need not be modeled
as a random effect.  A linear model without
random effects would suffice here.


Now to run John Maindonald's model
with subjects being drawn from a
normal distribution:

### -----

set.seed(123)
test2s <- data.frame(subject = rep(seq(Nsubj * Ngrp), each = NsubjRep),
        response=500 + rep(rnorm(Nsubj * Ngrp), each = NsubjRep) + rnorm(Nsubj * Ngrp * NsubjRep, 0, 10),
        fixed=(rep(c("A","B"),each=Nsubj * NsubjRep)))

null2 <- lmer(response~(1|subject),test2s)
fixed2 <- lmer(response~fixed+(1|subject),test2s)
summary(null2)
summary(fixed2)

quartz() ## substitute your favourite plot device 
par(mfrow = c(2, 1))
hist(ranef(null2)[[1]][,1])
hist(ranef(fixed2)[[1]][,1])

quartz() ## substitute your favourite plot device 
par(mfrow = c(2, 1))
qqnorm(ranef(null2)[[1]][,1])
qqline(ranef(null2)[[1]][,1])
qqnorm(ranef(fixed2)[[1]][,1])
qqline(ranef(fixed2)[[1]][,1])

### -----

Now the histogram of null2 BLUPs looks
like a unimodal density, and the
QQ plot suggests BLUPs are closer
to normal.  This is much more in line
with the situation that random effects 
components were designed to handle.

The fixed effect in fixed2 (see
summary(fixed2) output) does not
contribute to the model fit, as indicated
by the small t-value, and the random effect 
estimates are not all small numbers near 
zero.  A random effects model with subject
as a random effect is appropriate
here.


So this is a partial look into the
difference between fixed and random
effects.  

A covariate that is more 'degenerate', 
i.e. takes on a few distinct values, so 
has a 'spiky' distribution, is often better
modeled as a fixed effect.

When a covariate really can be imagined
to be a draw from some large non-degenerate
or non-spiky distribution, and you are
not really interested in the individual
values it can assume, but you want to
'model out' its effect (so its effect doesn't
cause some other covariate estimate to be
dragged off course) you can save
model degrees of freedom by modeling
that covariate as a random effect. 

Comparing likelihoods and other
statistics derived therefrom will
guide you in the right direction,
as these simulations illustrate.
The likelihoods do 'depend' on
the structure in the data.  
Statistics work!
 


HTH

Steven McKinney

Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre

email: smckinney +at+ bccrc +dot+ ca

tel: 604-675-8000 x7561

BCCRC
Molecular Oncology
675 West 10th Ave, Floor 4
Vancouver B.C. 
V5Z 1L3
Canada




-----Original Message-----
From: r-sig-mixed-models-bounces at r-project.org on behalf of John Maindonald
Sent: Tue 8/19/2008 5:25 PM
To: Daniel Ezra Johnson
Cc: r-sig-mixed-models at r-project.org
Subject: Re: [R-sig-ME] understanding log-likelihood/model fit
On 20 Aug 2008, at 10:19 AM, Daniel Ezra Johnson wrote:
Let's see your example.  If you change the data, however, you will
change the log-likelihood.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
#
Everyone agrees about what happens here:

Nsubj <- 10
Ngrp <- 2
NsubjRep <- 5
set.seed(123)
test1s <- data.frame(subject = rep(seq(Nsubj * Ngrp), each = NsubjRep),
       response=500+c(rep(-100,Nsubj * NsubjRep),rep(100,Nsubj *
NsubjRep))+rnorm(Nsubj * Ngrp * NsubjRep, 0, 10),
       fixed=(rep(c("A","B"),each=Nsubj * NsubjRep)))
null1 <- lmer(response~(1|subject),test1s)
fixed1 <- lmer(response~fixed+(1|subject),test1s)

I still have two questions which I'll try to restate. I should note
that I have attempted to understand the mathematical details of ML
mixed effect model fitting and it's a bit beyond me. But I hope that
someone can provide an answer I can understand.

Question 1: When you have an "outer" fixed effect and a "subject"
random effect in the same model, specifically why does the model
(apparently) converge in such a way that the fixed effect is maximized
and the random effect minimized? (Not so much why should it, as why
does it? This is the 'fixed1' case.)

Question 2: Take the fixed1 model from Question 1 and compare it to
the null1 model, which has a random subject effect but no fixed
effect. The predicted values of the two models -- the ones from
fitted(), which include the ranefs -- are virtually the same. So why
does fixed1 have a lower deviance, why is it preferred to null1 in a
likelihood ratio test? (Again, I'm not asking why it's a better model.
I'm asking questions about the software, estimation procedure, and/or
theory of likelihood applied to such cases.)

D
#
I'll take a swing at these.
On Wed, Aug 20, 2008 at 02:01:29PM +0100, Daniel Ezra Johnson wrote:
A common way for hierarchical models to be fit using ML is by
profiling out the fixed effects, estimating the random effects, and
then using GLS to estimate the fixed effects conditional on the random
effects.  So, any explanatory capacity that the fixed effects offer is
deployed before the random effects are invoked.  

Likewise a popular way to applying ReML is to fit the fixed effects
using OLS, then estimate the random effects from the residuals.
Again, the net effect is that any explanatory capacity that the fixed
effects offer is deployed before the random effects are invoked.
The deviance is computed from the log likelihood of the data,
conditional on the model.  The LL of the null model is maximized by
making the variance components big enough to cover the variation of
the data.  But, this means that the likelihood is being spread thinly,
as it were.  Eg ...
[1] 0.3989423
[1] 0.1994711

On the other hand, fixed1 uses fixed effects to explain a lot of that
variation, so that when the time comes to estimate the random effects,
they are smaller, and the LL is higher, because it doesn't have to
stretch so far.

Cheers,

Andrew
#
On Wed, Aug 20, 2008 at 2:24 PM, Andrew Robinson
<A.Robinson at ms.unimelb.edu.au> wrote:
I approach this from a slightly different point of view than does
Andrew.  I prefer to think of the process of estimating the fixed
effects parameters and the random effects as a penalized estimation
problem where the penalty is applied to the random effects only.  The
magnitude of the penalty depends on the (unconditional) variance
covariance matrix of the random effects.  When the variances are small
there is a large penalty.  When the variances are large there is a
small penalty on the size of the random effects.  The measure of model
complexity, which is related to the determinant of the conditional
variance of the random effects, given the data, has the opposite
behavior.  When the variance of the random effects is small the model
is considered simpler.  The simplest possible model on this scale is
one without any random effects at all, corresponding to a variance of
zero.  The larger the variance of the random effects the more complex
the model.

The maximum likelihood criterion seeks a balance between model
complexity and fidelity to the data.  Simple models fit the data less
well than do complex models.

The point to notice, however, is that the penalty on model complexity
applies to the random effects only, not to the fixed effects.  From
this point of view if the model can explain a certain pattern in the
data either with fixed-effects parameters or with random effects there
is an advantage in explaining it with the fixed effects.
Because the likelihood involves a penalty on the size of the variance
of the random effects.  Obtaining a similar fit (fidelity to the data)
with a much lower variance on the random effects (the penalty on model
complexity) is advantageous under the likelihood criterion.
#
I find that a very insightful way to think about results from
Daniel's simulated data.  In the case of his null1 model,
failure to adjust for the fixed effect led to a large (huge
relative to the component in fixed1) between subjects
component of variance.  There was, then, a small penalty
for random subject effects whose distribution looked
nothing like normal, and the residuals looked remarkably
normal.  It is an extreme example of a case where the
residuals have scant diagnostic usefulness.

My concern with some of the Pinheiro and Bates plots is
that they look too soon at the residuals, before doing the
grosser checks that are needed that the model is half-way
correct, and that the points are not joined up a/c subject
or whatever, so that within subject systematic departures
from the model are not evident.  I seem to recall finding
an aberrant patient when I looked at the Dialyzer data
some considerable time ago, and/or an aberrant Orange
tree.  In a quick look, I was not immediately able to
recover the state of thinking that led me to this conclusion.
Irrespective of the application to these data, the point that
residuals are one part only of the relevant diagnostic
information, and may indeed not convey much at all of
the relevant information, stands.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
On 21 Aug 2008, at 5:53 AM, Douglas Bates wrote:

            
#
On 20/08/2008, at 11:01 PM, Daniel Ezra Johnson wrote:

            
Because your generated data has only a fixed effect, so the estimated  
random effect is zero. This is the same as if you generated data of  
the form y=x, it would be expected to fit with an intercept of close  
to zero. Someone else supplied an example of adding a random effect  
which, of course, will result in a fitted random effect of greater  
than zero.
The residual variance is slightly lower for the second model  
explaining the better fit. Knowing about the fixed effects helps but  
not much.

The fitted are similar, but there is reasonable variation. It will  
always be better (in terms of likelihood and as a rule) to fit the  
correct model. What is nice is that the model with only a random  
effect gives sensible results, so in many situations I don't need to  
know why the clusters vary and departures from normality don't seem to  
matter much.

Ken
#
Thank you all for your help.

I'm now referring back to the discussion in Chapter 2 of Pinheiro and
Bates and understanding this much better.
Well, a little better.

In the figures on pp. 73-74, the middle panels (log-residual norm)
seem to illustrate what Douglas Bates has described here as
And the bottom panels (log-determinant ratio) seem to illustrate
In these charts, as you move all the way to the right, in the limit,
the values of Delta and theta are maximized, which I believe means the
random effect variance goes to zero (with respect to the residual
variance).

As you move to the left, your model complexity gets worse, but your
model fidelity improves for a time, and that's where you get the
maximum log-likelihood (top panel).

If theta going to infinity represents zero random effects, could you
say that theta going to zero represents random effects that are no
longer distinguishable from fixed effects?

D
#
On Thu, Aug 21, 2008 at 7:35 AM, Daniel Ezra Johnson
<danielezrajohnson at gmail.com> wrote:
Yes, in the sense that the residual sum of squares is what you would
obtain from a fixed-effects fit where the model matrix is the [X,Z]
where X is the model matrix for the fixed-effects parameters and Z is
the model matrix for the random effects.  The parameter estimates for
this model may not be well defined because the model could be
overparameterized.  However, the residual sum of squares for this
model is well-defined, as are the predictions and the residuals.

You could produce a similar plot for an lmer model fit.  Consider the
Dyestuff data in current versions of the lme4 package. The REML
criterion is similar to the maximum likelihood but it involves another
term so let's just consider the deviance and not the REML deviance.
0:     327.33176: 0.730297
  1:     327.33082: 0.772944
  2:     327.32706: 0.753271
  3:     327.32706: 0.752556
  4:     327.32706: 0.752581
Linear mixed model fit by maximum likelihood
Formula: Yield ~ 1 + (1 | Batch)
   Data: Dyestuff
   AIC   BIC logLik deviance REMLdev
 333.3 337.5 -163.7    327.3   319.7
Random effects:
 Groups   Name        Variance Std.Dev.
 Batch    (Intercept) 1388.1   37.258
 Residual             2451.3   49.511
Number of obs: 30, groups: Batch, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1527.50      17.69   86.33

The verbose output gives the profiled deviance at each iteration as a
function of a single parameter, which is the ratio of the standard
deviation of the random effects to the residual standard deviation.
[1] 0.7525196

This parameter is stored in the ST slot.  The slot is a list with
elements corresponding to the random effects terms in the model.  In
this model there is only one random effects term, (1|Batch).  In
general the elements of the list are square matrices where the number
of rows and columns corresponds to the number of random effects from
that term for each level of the grouping factor.  In this case there
is only one random effect per level of the factor.
[[1]]
            (Intercept)
(Intercept)   0.7525135

The deviance slot contains the deviance itself and several of its components.
ML         REML         ldL2        ldRX2      sigmaML    sigmaREML
  327.327060   319.725920     8.059354     2.057972    49.510754    50.357153
       pwrss         disc         usqr         wrss
73539.442307 62669.199627 10870.242680 62669.199627

For a linear mixed model, the discrepancy (disc) at the current
parameter values is equal to the weighted residual sum of squares
(wrss).  The random effects, b, are linearly transformed to u, for
which the unconditional distribution is a "spherical"
Gaussian distribution.  (i.e. the variance-covariance matrix is
\sigma^2 times the q by q identity matrix, where \sigma^2 is the same
scalar variance as in the definition of the conditional distribution
of Y|B).  The penalty on the size of the random effects is therefore
the squared length of u (usqr).  The penalized, weighted residual sum
of squares is the sum of wrss and usqr.  The conditional estimate of
\sigma^2 is pwrss/n and sigmaML is its square root.
[1] TRUE
n
30
[1] TRUE


The log-determinant of the conditional variance-covariance of the
random effects, given the data (i.e. B|Y) is, up to the scale factor
of \sigma, ldL2.  The profiled deviance is calculated as shown in
slide 132 of http://www.stat.wisc.edu/~bates/PotsdamGLMM/LMMD.pdf

The actual code in the C function lmm_update_fixef_u in lme4/src/lmer.c is

	d[ML_POS] = d[ldL2_POS] +
	    dn * (1 + log(d[pwrss_POS]) + log(2 * PI / dn));

or, in R,
[1] 327.3271
[1] TRUE

Next I was going to show how to change the value of the ST parameter
and update the deviance slot but in the process I discovered that one
of the C functions (the aforementioned lmm_update_fixef_u) cannot be
called from R.  After changing that and making a new release I can
show how to create a similar figure in the new formulation.