Missing values in lmer vs. HLM - R-SIG-mixed-models

Wed, Jul 1, 2015 5:56 PM #

All,

I have a paper in which we are using a within-person model using multi-level modeling. I ran the models in lmer in R, although we had a substantial portion of people for whom at least one observation is still missing. My understanding is that the default is to drop that person entirely (e.g., na.action=na.omit) ?.is that correct? My understanding was that the HLM software (e.g., by SSI) and most other multi-level modeling programs can still run the models based on the remaining observations (e.g., you may have 4 out of 5 observations per person and still be able to run the model).

I would love to know if it is possible to do that in lmer or if some solution is present. For example, is it possible to use FIML in lmer? Advice for handling this situation would be appreciated, as I?m new to lmer!

Best,


Tom Carpenter, Ph.D.
Assistant Professor, Psychology
Seattle Pacific University
3307 3rd Ave W. Suite 107,
Seattle, WA, 98119
tcarpenter at spu.edu<mailto:tcarpenter at spu.edu>
Office: (206) 281-2916
Fax: (206) 281-2695

Douglas Bates

Sat, Jul 4, 2015 9:09 AM #

I think we would need to know more about the structure of the data and the
models that you wish to fit to it before we could be of any assistance.

To be honest, your question doesn't make sense in the context of lmer.  The
data for lmer must be in the "long form".  That is, each observation
corresponds to a row in the data frame.  If one subject has 5 observations
there will be 5 rows for that subject.  If another has only two
observations there will be two rows.  To me you are describing unbalanced
data, not missing data.  In most cases it is more confusing than
illuminating to think of data in the "wide form", with one row for each
subject and multiple columns for the observations, when working with R.

There is no difficulty with working with unbalanced data in lmer.

On Sat, Jul 4, 2015 at 10:42 AM Carpenter, Tom <tcarpenter at spu.edu> wrote:

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Douglas Bates

Sat, Jul 4, 2015 9:18 AM #

By the way, most of us don't know the acronym FIML.  I have a suspicion
that it is one of the many "maximum likelihood" estimators defined in the
multilevel modeling literature.  To a statistician these expressions are
nonsensical.  Once you define the probability model there is only one
possible definition of likelihood and hence only one criterion for the
maximum likelihood estimators to optimize.  Creating a different criterion
and saying that the optimizers of this criterion are the "<whatever>
maximum likelihood" estimators is false advertising.

Having said all this I will admit that the original sin, the "REML"
criterion, was committed by statisticians.  In retrospect I wish that we
had not incorporated that criterion into the nlme and lme4 packages but, at
the time we wrote them, our work would have been dismissed as wrong if our
answers did not agree with SAS PROC MIXED, etc.  So we opted for
bug-for-bug compatibility with existing software.

On Sat, Jul 4, 2015 at 11:09 AM Douglas Bates <bates at stat.wisc.edu> wrote:

I think we would need to know more about the structure of the data and the
models that you wish to fit to it before we could be of any assistance.

To be honest, your question doesn't make sense in the context of lmer.
The data for lmer must be in the "long form".  That is, each observation
corresponds to a row in the data frame.  If one subject has 5 observations
there will be 5 rows for that subject.  If another has only two
observations there will be two rows.  To me you are describing unbalanced
data, not missing data.  In most cases it is more confusing than
illuminating to think of data in the "wide form", with one row for each
subject and multiple columns for the observations, when working with R.

There is no difficulty with working with unbalanced data in lmer.

On Sat, Jul 4, 2015 at 10:42 AM Carpenter, Tom <tcarpenter at spu.edu> wrote:

All,

I have a paper in which we are using a within-person model using
multi-level modeling. I ran the models in lmer in R, although we had a
substantial portion of people for whom at least one observation is still
missing. My understanding is that the default is to drop that person
entirely (e.g., na.action=na.omit) ?.is that correct? My understanding was
that the HLM software (e.g., by SSI) and most other multi-level modeling
programs can still run the models based on the remaining observations
(e.g., you may have 4 out of 5 observations per person and still be able to
run the model).

I would love to know if it is possible to do that in lmer or if some
solution is present. For example, is it possible to use FIML in lmer?
Advice for handling this situation would be appreciated, as I?m new to lmer!

Best,


Tom Carpenter, Ph.D.
Assistant Professor, Psychology
Seattle Pacific University
3307 3rd Ave W. Suite 107,
Seattle, WA, 98119
tcarpenter at spu.edu<mailto:tcarpenter at spu.edu>
Office: (206) 281-2916
Fax: (206) 281-2695













        [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Carpenter, Tom

Sat, Jul 4, 2015 9:43 AM #

Very, very helpful! Thanks.

As a side note, are there good resources you might direct an applied user toward for understanding issues with REML? I had a reviewer recently complain that I had NOT used it in lmer...

Tom Carpenter, Ph.D.
Assistant Professor, Psychology
Seattle Pacific University
3307 3rd Ave W. Suite 107,
Seattle, WA, 98119
tcarpenter at spu.edu<mailto:tcarpenter at spu.edu>
Office: (206) 281-2916
Mobile: (206) 276-1541
Fax: (206) 281-2695

On Jul 4, 2015, at 9:18 AM, Douglas Bates <bates at stat.wisc.edu<mailto:bates at stat.wisc.edu>> wrote:

By the way, most of us don't know the acronym FIML.  I have a suspicion that it is one of the many "maximum likelihood" estimators defined in the multilevel modeling literature.  To a statistician these expressions are nonsensical.  Once you define the probability model there is only one possible definition of likelihood and hence only one criterion for the maximum likelihood estimators to optimize.  Creating a different criterion and saying that the optimizers of this criterion are the "<whatever> maximum likelihood" estimators is false advertising.

Having said all this I will admit that the original sin, the "REML" criterion, was committed by statisticians.  In retrospect I wish that we had not incorporated that criterion into the nlme and lme4 packages but, at the time we wrote them, our work would have been dismissed as wrong if our answers did not agree with SAS PROC MIXED, etc.  So we opted for bug-for-bug compatibility with existing software.

On Sat, Jul 4, 2015 at 11:09 AM Douglas Bates <bates at stat.wisc.edu<mailto:bates at stat.wisc.edu>> wrote:

I think we would need to know more about the structure of the data and the models that you wish to fit to it before we could be of any assistance.

To be honest, your question doesn't make sense in the context of lmer.  The data for lmer must be in the "long form".  That is, each observation corresponds to a row in the data frame.  If one subject has 5 observations there will be 5 rows for that subject.  If another has only two observations there will be two rows.  To me you are describing unbalanced data, not missing data.  In most cases it is more confusing than illuminating to think of data in the "wide form", with one row for each subject and multiple columns for the observations, when working with R.

There is no difficulty with working with unbalanced data in lmer.

On Sat, Jul 4, 2015 at 10:42 AM Carpenter, Tom <tcarpenter at spu.edu<mailto:tcarpenter at spu.edu>> wrote:

All,

I have a paper in which we are using a within-person model using multi-level modeling. I ran the models in lmer in R, although we had a substantial portion of people for whom at least one observation is still missing. My understanding is that the default is to drop that person entirely (e.g., na.action=na.omit) ....is that correct? My understanding was that the HLM software (e.g., by SSI) and most other multi-level modeling programs can still run the models based on the remaining observations (e.g., you may have 4 out of 5 observations per person and still be able to run the model).

I would love to know if it is possible to do that in lmer or if some solution is present. For example, is it possible to use FIML in lmer? Advice for handling this situation would be appreciated, as I'm new to lmer!

Best,


Tom Carpenter, Ph.D.
Assistant Professor, Psychology
Seattle Pacific University
3307 3rd Ave W. Suite 107,
Seattle, WA, 98119
tcarpenter at spu.edu<mailto:tcarpenter at spu.edu><mailto:tcarpenter at spu.edu<mailto:tcarpenter at spu.edu>>
Office: (206) 281-2916
Fax: (206) 281-2695














_______________________________________________
R-sig-mixed-models at r-project.org<mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Karl Ove Hufthammer

Sat, Jul 4, 2015 12:21 PM #

Den 04. juli 2015 18:18, Douglas Bates skreiv:

Hm. I find this statement surprising. I was under the impression REML is 
*preferred* to ML in many situations (e.g. in simple random intercept 
models with few observations for each random intercept), and that *ML 
estimation* may result in severe bias. Do you consider maximising the 
REML criterion as a bug?

Karl Ove Hufthammer

Phillip Alday

Sat, Jul 4, 2015 9:14 PM #

On Sat, 2015-07-04 at 21:21 +0200, Karl Ove Hufthammer wrote:

This was my question as well. My understanding was that REML, like
Bessel's correction for the sample variance, was motivated by bias in
the maximum-likelihood estimator for small numbers of observations. The
corrected estimator is in both cases no longer the MLE, so that the ML
part is bit of a misnomer, but if you take "residualized" expansion of
RE instead of "restricted", then REML seems more like a function of ML
and not a "type" of ML.

IIRC, the default in MixedModels.jl is now ML -- have you changed your
opinion about the utility of REML? Is there some type of weird
paradoxical situation with REML like with Bessel's correction -- the
variance estimates are no longer biased, but the s.d. estimates are? 

Or is the original sin the use of the name REML when REML is no longer
*the* maximum likelihood?

Best,
Phillip Alday

landon hurley

Sat, Jul 4, 2015 9:35 PM #

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 07/05/2015 12:14 AM, Phillip Alday wrote:

I had assumed that he would have responded by now, but it is a holiday
in the US. The position Bates is taking is explained (I think) in his
2010 report
lme4: Mixed effects modelling with R in Section 5.5 `The REML
Criterion', roughly page 123-124 in the pdf [0]. It's a short read, but
the most relevant bit I think is:

best,

landon
[0]:
www.researchgate.net/publictopics.PublicPostFileLoader.html?id=53326f19d5a3f206348b45af&key=6a85e53326f199010f

https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

- -- 
Violence is the last refuge of the incompetent.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)

iQIcBAEBCgAGBQJVmLQMAAoJEDeph/0fVJWs21AP/RPMNrQDdkd67eQWfb9jhbTZ
VIfZDNAt088n8XednUzk6BQlVUirb8akqdVq8YqtdaCotdP5dxXjRr30hO72FeRj
rvLWcZXtDVuNwOFZA44Aw2YplwD1sU+G+vMLOsPD4BrBRfByY5FkkX2lTliQjbVK
eYNi57977w/AE7y48OTprwBdkNkWjTQTrnKAsglBtOlnC+x8TdD8J0WfaZCfI1CP
iUN6298pdSNWm+vAWaFCeq6Wig8o2kYGONd1RBDzGidbcy5CiQebuJYZ8cU5zl4V
X34alL6cX+9qDLWbi4jYz1/3lG1U4NsCKhs6fM7imxOFV9XXtsZTr16xmHbkc6B+
daNAbln/SHKoCpuvGiO+IL/H8y8W1DLyJk7sRlb7tOuDHViBCOPLwPZj71aJFFab
qY40JvxdV0rputOeg5OtTyqROxCwtgqKWDaGcli1Dpcrca2aE7qNpPdMjxmnJN+j
RDaCkHflJuwHifkupg7qSfLa2zbBf4KirfGnOsoeicZPF4s7BmDLU28fMlhAqEly
Dbg3niPaW4/X3X69yrkw5XG46uftRGlSUdl/eMWWBUVy3o5oAkAgxpQl/49Md0CX
SyYyv+Iaa9NPtgI828NYfw59xsW98hTnNNtG+D0V4nN/1d29M/KM6LspxV9Nu64b
XNno0S7U16mSu1KhvIwY
=7uyi
-----END PGP SIGNATURE-----

Douglas Bates

Sun, Jul 5, 2015 8:21 AM #

My apologies for making such a statement then not following up.  As has
been mentioned, this is a holiday weekend in the U.S.

The section that Landon quoted does get at the point of my comment.

The usual justification for REML is that REML estimators of variance
components are less biased than are the maximum likelihood estimators
(mle).  On the surface this seems to be a convincing argument, for who
would want to use a "biased" estimator?

But why should we be concerned with the estimator of the variance?  Why not
the estimator of the standard deviation, or the logarithm of the standard
deviation?  The distribution of variance estimators are highly skewed in
most cases.  Consider the simplest case of estimating the variance from an
i.i.d. sample from a Gaussian distribution.  The distribution of the
estimator is a Chi-squared distribution, which is highly skewed.  The
distribution of the estimator of ? is less skewed.  The distribution of the
estimator of log(?) is more-or-less symmetric.

The important point here is that "bias" relates to the expected value of
the estimator.  The argument for REML is based on the expected value of a
quantity with a highly skewed distribution, but we know that this is a poor
measure of location for such a distribution.  That's why it is more
informative to consider median salaries instead of average salaries.  The
fact that the average wealth of members of LeBron James's high school
basketball team is very high doesn't make them all rich.

Mle's have an invariance property in that the mle of ? is the square root
of the mle of ??; the mle of log(??) is the logarithm of the mle of ??,
etc.  Unbiased estimators aren't invariant under transformation.  The
square root of an unbiased estimator of ?? is not an unbiased estimator of
?.

If an unbiased estimator were so important then we should probably consider
the estimate of log(??), not ?? itself.  The reason for our being fixated
on ?? is more computational than practical.  When using hand calculations
it is easiest to estimate ?? then derive an estimate of ? from that.  These
considerations are less convincing when using computers.

In summary, the case for REML is less convincing than it seems at first
glance.  It is a consequence of a certain type of mathematical exposition,
where your assumptions are never questioned.  You only care about going
from "if" to "then".  In mathematical statistics you say, "assuming that
the model is correct, these are the consequences" and that is all there is
to it.  The way that the game is actually played is that, when you get to
the end of the proof and discover that you need some conditions to make it
work, you go back to the beginning and add those conditions.  It helps if
you call this case the "regular" case or the "normal" case or some other
word with favorable connotations.

So if you want to characterize the "best" estimator you do it by peeling
off properties related to the first moment, the second moment, etc. For the
first moment you say that the expected value of the estimator must be equal
to the parameter being estimated and you call that the "unbiased" case.
Technically this is just a mathematical property but the connotation of the
word gives it much more heft than the mathematical property.  In
mathematical statistics it is irrelevant to question why it is this
particular estimator or this particular scale that is of interest - the
only objective is to prove the theorem and publish the result.

(The folklore in our department is that George Box's famous statement about
"all models are wrong" originated in a thesis defense where the candidate
began by stating that "Assuming that the model is correct" and George
interrupted to say "But all models are wrong".  It wasn't a good day for
the candidate.  I'm sorry to say that I don't know if this story is
accurate as I never took the opportunity to ask him.)

On Sat, Jul 4, 2015 at 11:36 PM landon hurley <ljrhurley at gmail.com> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 07/05/2015 12:14 AM, Phillip Alday wrote:

On Sat, 2015-07-04 at 21:21 +0200, Karl Ove Hufthammer wrote:

Den 04. juli 2015 18:18, Douglas Bates skreiv:

Having said all this I will admit that the original sin, the
"REML" criterion, was committed by statisticians.  In retrospect
I wish that we had not incorporated that criterion into the nlme
and lme4 packages but, at the time we wrote them, our work would
have been dismissed as wrong if our answers did not agree with
SAS PROC MIXED, etc.  So we opted for bug-for-bug compatibility
with existing software.

Hm. I find this statement surprising. I was under the impression
REML is *preferred* to ML in many situations (e.g. in simple
random intercept models with few observations for each random
intercept), and that *ML estimation* may result in severe bias. Do
you consider maximising the REML criterion as a bug?

This was my question as well. My understanding was that REML, like
Bessel's correction for the sample variance, was motivated by bias in
the maximum-likelihood estimator for small numbers of observations.
The corrected estimator is in both cases no longer the MLE, so that
the ML part is bit of a misnomer, but if you take "residualized"
expansion of RE instead of "restricted", then REML seems more like a
function of ML and not a "type" of ML.

IIRC, the default in MixedModels.jl is now ML -- have you changed
your opinion about the utility of REML? Is there some type of weird
paradoxical situation with REML like with Bessel's correction -- the
 variance estimates are no longer biased, but the s.d. estimates
are?


Or is the original sin the use of the name REML when REML is no
longer *the* maximum likelihood?

I had assumed that he would have responded by now, but it is a holiday
in the US. The position Bates is taking is explained (I think) in his
2010 report
lme4: Mixed effects modelling with R in Section 5.5 `The REML
Criterion', roughly page 123-124 in the pdf [0]. It's a short read, but
the most relevant bit I think is:

The argument for preferring ?_R to ?_L as an estimate of ?**2 is
that the numerator in both estimates is the sum of squared
residuals at ? and, although the residual vector, yobs ? X? , is an
n-dimensional vector, the residual at ? satisfies p linearly
independent constraints, X**{T} (yobs ? X? ) = 0. That is, the residual
at ? is the projection of the observed response vector, yobs , into
an (n ? p)-dimensional linear subspace of the n-dimensional response
space. The estimate ?R takes into account the fact that ?**2 is
estimated from residuals that have only n ? p degrees of freedom.

Another argument often put forward for REML estimation is that ?_R is
an unbiased estimate of ?**2 , in the sense that the expected value of
the estimator is equal to the value of the parameter. However,
determining the expected value of an estimator involves integrating
with respect to the density of the estimator and we have seen that
densities of estimators of variances will be skewed, often highly
skewed. It is not clear why we should be interested in the expected
value of a highly skewed estimator. If we were to transform to a
more symmetric scale, such as the estimator of the standard deviation
or the estimator of the logarithm of the standard deviation, the
REML estimator would no longer be unbiased. Furthermore, this
property of unbiasedness of variance estimators does not generalize
from the linear regression model to linear mixed models. This is all
to say that the distinction between REML and ML estimates of
variances and variance components is probably less important that
many people believe.


best,

landon
[0]:

www.researchgate.net/publictopics.PublicPostFileLoader.html?id=53326f19d5a3f206348b45af&key=6a85e53326f199010f

Best, Phillip Alday _______________________________________________
R-sig-mixed-models at r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

John Maindonald

Sun, Jul 5, 2015 5:31 PM #

Quoting Douglas:

So mathematical statistics is a ?game?. That is surely a rather damning comment!
It does however raise important points. My perception is that the situation has
improved greatly from the ?that is all there is to it? typical stance in the published
literature of the 1960s and 1970s. There?s greater pressure to back up theoretical
development with computations with (often, somewhat) real data.

The situation is though uneven. In some parts of the literature though (the literature
on smoothing seems to me particularly rife with this problem) serious issues with the
unreality of iid or at least id (independence) assumptions, for time series and/or
spatial data, are just ignored! That is just one example. [For interpolation, maybe
the iid assumption often makes reasonable sense for spatial data.]

More important than whether the estimator has likelihood in its name, or whether it
is misleading to call it some kind of likelihood estimator, is whether it serves the
intended purpose. Use the median for sure where it makes sense, which incidentally
is neither unbiased nor ML. I do not think that one would get away with quoting
maximum likelihood estimators (eg, for wages and wage differentials in various
sectors) in a set of national account figures. Here, one might be tempted to make
politically fraught comments about the malign effects of highly skew distributions!
Nuf sed.

In many contexts, it is thought important to have numbers that add up. That, after
all is what analysis of variance breakdowns of the mean are all about. Medians do
not work well in that context; they do not give a table that adds up. But of course,
just because one is presenting a table where effects add up to give a grand mean,
distributions that are close to symmetric are important, for conceptual as well as for
theoretical (normality of sampling distribution) reasons. This all becomes a whole
lot more fraught for GLM models.

The main benefit of REML may be that it matches what comes out computationally
to the theory. Does this do serious damage to the numbers that come out, relative
to the way that they will be used? Doug, do your strictures apply also to the t-statistic,
which is a REML type statistic? (One uses an unbiased estimate of the variance in
the denominator.) Or is the issue that it is just a means to an end?

On moment estimators, comments made by Tukey seem to me relevant:
"Do not assume ?that we always know what in fact we never know ? the exact probability structure . . .
No data set is large enough to tell us how it should be analysed.?
[Tukey: More honest foundations for data analysis. Journal of Statistical Planning and Inference,
vol. 57, no. 1, pp. 21-28, 1997]

Nor, I want to add, do we commonly have all the needed background information.

Moment estimators can be a way to get an estimator that applies to a wide class of distributions.
I and many others think this more than sufficient justification for the dispersion estimate that is
widely used in quasi-likelihood computations, notable to GLM models with quasi-Poisson or
quasi-binomial distributions. A key question is of course whether the dispersion might vary
with the mean. And yes, this does make a whole lot more sense if one is working on a scale
where the sampling distribution(s) is(are) symmetric. So perhaps what is wrong with standard
quasi models is that they inflate the variance on a scale where distributions are nothing like
symmetric! What is totally wrong is any failure to adjust for an inflated variance in cases (in
most areas, e.g., ecology, the great majority) where the variance clearly is inflated relative to
the Poisson or binomial. Note that this applies also to glmer models, notably where Poisson
or binomial errors are specified. One can specify an observation level random effect. I suspect
that results are often compromised because of failure to attend to this issue.

In summary, there are some very important issues here, but I do not see that substituting one
mathematical simplification for another is an answer. In the end, we want our models to be
useful, useful I would hope for more than purposes of getting promotion!

John Maindonald email: john.maindonald at anu.edu.au

On 6/07/2015, at 01:21, Douglas Bates <bates at stat.wisc.edu> wrote:

My apologies for making such a statement then not following up.  As has
been mentioned, this is a holiday weekend in the U.S.

The section that Landon quoted does get at the point of my comment.

The usual justification for REML is that REML estimators of variance
components are less biased than are the maximum likelihood estimators
(mle).  On the surface this seems to be a convincing argument, for who
would want to use a "biased" estimator?

But why should we be concerned with the estimator of the variance?  Why not
the estimator of the standard deviation, or the logarithm of the standard
deviation?  The distribution of variance estimators are highly skewed in
most cases.  Consider the simplest case of estimating the variance from an
i.i.d. sample from a Gaussian distribution.  The distribution of the
estimator is a Chi-squared distribution, which is highly skewed.  The
distribution of the estimator of ? is less skewed.  The distribution of the
estimator of log(?) is more-or-less symmetric.

The important point here is that "bias" relates to the expected value of
the estimator.  The argument for REML is based on the expected value of a
quantity with a highly skewed distribution, but we know that this is a poor
measure of location for such a distribution.  That's why it is more
informative to consider median salaries instead of average salaries.  The
fact that the average wealth of members of LeBron James's high school
basketball team is very high doesn't make them all rich.

Mle's have an invariance property in that the mle of ? is the square root
of the mle of ??; the mle of log(??) is the logarithm of the mle of ??,
etc.  Unbiased estimators aren't invariant under transformation.  The
square root of an unbiased estimator of ?? is not an unbiased estimator of
?.

If an unbiased estimator were so important then we should probably consider
the estimate of log(??), not ?? itself.  The reason for our being fixated
on ?? is more computational than practical.  When using hand calculations
it is easiest to estimate ?? then derive an estimate of ? from that.  These
considerations are less convincing when using computers.

In summary, the case for REML is less convincing than it seems at first
glance.  It is a consequence of a certain type of mathematical exposition,
where your assumptions are never questioned.  You only care about going
from "if" to "then".  In mathematical statistics you say, "assuming that
the model is correct, these are the consequences" and that is all there is
to it.  The way that the game is actually played is that, when you get to
the end of the proof and discover that you need some conditions to make it
work, you go back to the beginning and add those conditions.  It helps if
you call this case the "regular" case or the "normal" case or some other
word with favorable connotations.

So if you want to characterize the "best" estimator you do it by peeling
off properties related to the first moment, the second moment, etc. For the
first moment you say that the expected value of the estimator must be equal
to the parameter being estimated and you call that the "unbiased" case.
Technically this is just a mathematical property but the connotation of the
word gives it much more heft than the mathematical property.  In
mathematical statistics it is irrelevant to question why it is this
particular estimator or this particular scale that is of interest - the
only objective is to prove the theorem and publish the result.

(The folklore in our department is that George Box's famous statement about
"all models are wrong" originated in a thesis defense where the candidate
began by stating that "Assuming that the model is correct" and George
interrupted to say "But all models are wrong".  It wasn't a good day for
the candidate.  I'm sorry to say that I don't know if this story is
accurate as I never took the opportunity to ask him.)

On Sat, Jul 4, 2015 at 11:36 PM landon hurley <ljrhurley at gmail.com> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 07/05/2015 12:14 AM, Phillip Alday wrote:

On Sat, 2015-07-04 at 21:21 +0200, Karl Ove Hufthammer wrote:

Den 04. juli 2015 18:18, Douglas Bates skreiv:

Having said all this I will admit that the original sin, the
"REML" criterion, was committed by statisticians.  In retrospect
I wish that we had not incorporated that criterion into the nlme
and lme4 packages but, at the time we wrote them, our work would
have been dismissed as wrong if our answers did not agree with
SAS PROC MIXED, etc.  So we opted for bug-for-bug compatibility
with existing software.

Hm. I find this statement surprising. I was under the impression
REML is *preferred* to ML in many situations (e.g. in simple
random intercept models with few observations for each random
intercept), and that *ML estimation* may result in severe bias. Do
you consider maximising the REML criterion as a bug?

This was my question as well. My understanding was that REML, like
Bessel's correction for the sample variance, was motivated by bias in
the maximum-likelihood estimator for small numbers of observations.
The corrected estimator is in both cases no longer the MLE, so that
the ML part is bit of a misnomer, but if you take "residualized"
expansion of RE instead of "restricted", then REML seems more like a
function of ML and not a "type" of ML.

IIRC, the default in MixedModels.jl is now ML -- have you changed
your opinion about the utility of REML? Is there some type of weird
paradoxical situation with REML like with Bessel's correction -- the
variance estimates are no longer biased, but the s.d. estimates
are?


Or is the original sin the use of the name REML when REML is no
longer *the* maximum likelihood?

I had assumed that he would have responded by now, but it is a holiday
in the US. The position Bates is taking is explained (I think) in his
2010 report
lme4: Mixed effects modelling with R in Section 5.5 `The REML
Criterion', roughly page 123-124 in the pdf [0]. It's a short read, but
the most relevant bit I think is:

The argument for preferring ?_R to ?_L as an estimate of ?**2 is
that the numerator in both estimates is the sum of squared
residuals at ? and, although the residual vector, yobs ? X? , is an
n-dimensional vector, the residual at ? satisfies p linearly
independent constraints, X**{T} (yobs ? X? ) = 0. That is, the residual
at ? is the projection of the observed response vector, yobs , into
an (n ? p)-dimensional linear subspace of the n-dimensional response
space. The estimate ?R takes into account the fact that ?**2 is
estimated from residuals that have only n ? p degrees of freedom.

Another argument often put forward for REML estimation is that ?_R is
an unbiased estimate of ?**2 , in the sense that the expected value of
the estimator is equal to the value of the parameter. However,
determining the expected value of an estimator involves integrating
with respect to the density of the estimator and we have seen that
densities of estimators of variances will be skewed, often highly
skewed. It is not clear why we should be interested in the expected
value of a highly skewed estimator. If we were to transform to a
more symmetric scale, such as the estimator of the standard deviation
or the estimator of the logarithm of the standard deviation, the
REML estimator would no longer be unbiased. Furthermore, this
property of unbiasedness of variance estimators does not generalize
from the linear regression model to linear mixed models. This is all
to say that the distinction between REML and ML estimates of
variances and variance components is probably less important that
many people believe.


best,

landon
[0]:

www.researchgate.net/publictopics.PublicPostFileLoader.html?id=53326f19d5a3f206348b45af&key=6a85e53326f199010f

Best, Phillip Alday _______________________________________________
R-sig-mixed-models at r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Ken Beath

Sun, Jul 5, 2015 6:24 PM #

There has been a reasonable amount of work on comparing different methods
for random-effects meta-analysis. Using REML does produce better estimates
of the effect size, in terms of coverage than maximum likelihood. Using the
profile likelihood produces better results again, so maybe that is what
should be used. On the other hand it is likely that the distribution of the
random effects isn't normal, so it probably isn't important.

If it is important there are now more general ways of fixing the bias and
coverage, parametric bootstrap should work nicely, so it doesn't seem
useful to use a technique that only has limited application. Maximum
likelihood is applied to mixed effects logistic where it has much the same
problems, and everyone just seems to ignore them.

On 6 July 2015 at 10:31, John Maindonald <john.maindonald at anu.edu.au> wrote:

Quoting Douglas:

. . .  In mathematical statistics you say, "assuming that
the model is correct, these are the consequences" and that is all there

is

to it.  The way that the game is actually played is that, when you get to
the end of the proof and discover that you need some conditions to make

it

work, you go back to the beginning and add those conditions.

So mathematical statistics is a ?game?.  That is surely a rather damning
comment!
It does however raise important points.  My perception is that the
situation has
improved greatly from the ?that is all there is to it? typical stance in
the published
literature of the 1960s and 1970s.  There?s greater pressure to back up
theoretical
development with computations with (often, somewhat) real data.

The situation is though uneven.   In some parts of the literature though
(the literature
on smoothing seems to me particularly rife with this problem) serious
issues with the
unreality of iid or at least id (independence) assumptions, for time
series and/or
spatial data, are just ignored!  That is just one example.  [For
interpolation, maybe
the iid assumption often makes reasonable sense for spatial data.]

More important than whether the estimator has likelihood in its name, or
whether it
is misleading to call it some kind of likelihood estimator, is whether it
serves the
intended purpose.  Use the median for sure where it makes sense, which
incidentally
is neither unbiased nor ML.  I do not think that one would get away with
quoting
maximum likelihood estimators (eg, for wages and wage differentials in
various
sectors) in a set of national account figures.  Here, one might be tempted
to make
politically fraught comments about the malign effects of highly skew
distributions!
Nuf sed.

In many contexts, it is thought important to have numbers that add up.
That, after
all is what analysis of variance breakdowns of the mean are all about.
Medians do
not work well in that context; they do not give a table that adds up.  But
of course,
just because one is presenting a table where effects add up to give a
grand mean,
distributions that are close to symmetric are important, for conceptual as
well as for
theoretical (normality of sampling distribution) reasons.  This all
becomes a whole
lot more fraught for GLM models.

The main benefit of REML may be that it matches what comes out
computationally
to the theory.  Does this do serious damage to the numbers that come out,
relative
to the way that they will be used?  Doug, do your strictures apply also to
the t-statistic,
which is a REML type statistic?  (One uses an unbiased estimate of the
variance in
the denominator.)  Or is the issue that it is just a means to an end?

On moment estimators, comments made by Tukey seem to me relevant:
"Do not assume ?that we always know what in fact we never know ? the exact
probability structure . . .
No data set is large enough to tell us how it should be analysed.?
[Tukey: More honest foundations for data analysis. Journal of Statistical
Planning and Inference,
vol. 57, no. 1, pp. 21-28, 1997]

Nor, I want to add, do we commonly have all the needed background
information.

Moment estimators can be a way to get an estimator that applies to a wide
class of distributions.
I and many others think this more than sufficient justification for the
dispersion estimate that is
widely used in quasi-likelihood computations, notable to GLM models with
quasi-Poisson or
quasi-binomial distributions.  A key question is of course whether the
dispersion might vary
with the mean.  And yes, this does make a whole lot more sense if one is
working on a scale
where the sampling distribution(s) is(are) symmetric.  So perhaps what is
wrong with standard
quasi models is that they inflate the variance on a scale where
distributions are nothing like
symmetric! What is totally wrong is any failure to adjust for an inflated
variance in cases (in
most areas, e.g., ecology, the great majority) where the variance clearly
is inflated relative to
the Poisson or binomial.  Note that this applies also to glmer models,
notably where Poisson
or binomial errors are specified.  One can specify an observation level
random effect. I suspect
that results are often compromised because of failure to attend to this
issue.

In summary, there are some very important issues here, but I do not see
that substituting one
 mathematical simplification for another is an answer.  In the end, we
want our models to be
useful, useful I would hope for more than purposes of getting promotion!

John Maindonald             email: john.maindonald at anu.edu.au

On 6/07/2015, at 01:21, Douglas Bates <bates at stat.wisc.edu> wrote:

My apologies for making such a statement then not following up.  As has
been mentioned, this is a holiday weekend in the U.S.

The section that Landon quoted does get at the point of my comment.

The usual justification for REML is that REML estimators of variance
components are less biased than are the maximum likelihood estimators
(mle).  On the surface this seems to be a convincing argument, for who
would want to use a "biased" estimator?

But why should we be concerned with the estimator of the variance?  Why

not

the estimator of the standard deviation, or the logarithm of the standard
deviation?  The distribution of variance estimators are highly skewed in
most cases.  Consider the simplest case of estimating the variance from

an

i.i.d. sample from a Gaussian distribution.  The distribution of the
estimator is a Chi-squared distribution, which is highly skewed.  The
distribution of the estimator of ? is less skewed.  The distribution of

the

estimator of log(?) is more-or-less symmetric.

The important point here is that "bias" relates to the expected value of
the estimator.  The argument for REML is based on the expected value of a
quantity with a highly skewed distribution, but we know that this is a

poor

measure of location for such a distribution.  That's why it is more
informative to consider median salaries instead of average salaries.  The
fact that the average wealth of members of LeBron James's high school
basketball team is very high doesn't make them all rich.

Mle's have an invariance property in that the mle of ? is the square root
of the mle of ??; the mle of log(??) is the logarithm of the mle of ??,
etc.  Unbiased estimators aren't invariant under transformation.  The
square root of an unbiased estimator of ?? is not an unbiased estimator

of

?.

If an unbiased estimator were so important then we should probably

consider

the estimate of log(??), not ?? itself.  The reason for our being fixated
on ?? is more computational than practical.  When using hand calculations
it is easiest to estimate ?? then derive an estimate of ? from that.

These

considerations are less convincing when using computers.

In summary, the case for REML is less convincing than it seems at first
glance.  It is a consequence of a certain type of mathematical

exposition,

where your assumptions are never questioned.  You only care about going
from "if" to "then".  In mathematical statistics you say, "assuming that
the model is correct, these are the consequences" and that is all there

is

to it.  The way that the game is actually played is that, when you get to
the end of the proof and discover that you need some conditions to make

it

work, you go back to the beginning and add those conditions.  It helps if
you call this case the "regular" case or the "normal" case or some other
word with favorable connotations.

So if you want to characterize the "best" estimator you do it by peeling
off properties related to the first moment, the second moment, etc. For

the

first moment you say that the expected value of the estimator must be

equal

to the parameter being estimated and you call that the "unbiased" case.
Technically this is just a mathematical property but the connotation of

the

word gives it much more heft than the mathematical property.  In
mathematical statistics it is irrelevant to question why it is this
particular estimator or this particular scale that is of interest - the
only objective is to prove the theorem and publish the result.

(The folklore in our department is that George Box's famous statement

about

"all models are wrong" originated in a thesis defense where the candidate
began by stating that "Assuming that the model is correct" and George
interrupted to say "But all models are wrong".  It wasn't a good day for
the candidate.  I'm sorry to say that I don't know if this story is
accurate as I never took the opportunity to ask him.)

On Sat, Jul 4, 2015 at 11:36 PM landon hurley <ljrhurley at gmail.com>

wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 07/05/2015 12:14 AM, Phillip Alday wrote:

On Sat, 2015-07-04 at 21:21 +0200, Karl Ove Hufthammer wrote:

Den 04. juli 2015 18:18, Douglas Bates skreiv:

Having said all this I will admit that the original sin, the
"REML" criterion, was committed by statisticians.  In retrospect
I wish that we had not incorporated that criterion into the nlme
and lme4 packages but, at the time we wrote them, our work would
have been dismissed as wrong if our answers did not agree with
SAS PROC MIXED, etc.  So we opted for bug-for-bug compatibility
with existing software.

Hm. I find this statement surprising. I was under the impression
REML is *preferred* to ML in many situations (e.g. in simple
random intercept models with few observations for each random
intercept), and that *ML estimation* may result in severe bias. Do
you consider maximising the REML criterion as a bug?

This was my question as well. My understanding was that REML, like
Bessel's correction for the sample variance, was motivated by bias in
the maximum-likelihood estimator for small numbers of observations.
The corrected estimator is in both cases no longer the MLE, so that
the ML part is bit of a misnomer, but if you take "residualized"
expansion of RE instead of "restricted", then REML seems more like a
function of ML and not a "type" of ML.

IIRC, the default in MixedModels.jl is now ML -- have you changed
your opinion about the utility of REML? Is there some type of weird
paradoxical situation with REML like with Bessel's correction -- the
variance estimates are no longer biased, but the s.d. estimates
are?


Or is the original sin the use of the name REML when REML is no
longer *the* maximum likelihood?

I had assumed that he would have responded by now, but it is a holiday
in the US. The position Bates is taking is explained (I think) in his
2010 report
lme4: Mixed effects modelling with R in Section 5.5 `The REML
Criterion', roughly page 123-124 in the pdf [0]. It's a short read, but
the most relevant bit I think is:

The argument for preferring ?_R to ?_L as an estimate of ?**2 is
that the numerator in both estimates is the sum of squared
residuals at ? and, although the residual vector, yobs ? X? , is an
n-dimensional vector, the residual at ? satisfies p linearly
independent constraints, X**{T} (yobs ? X? ) = 0. That is, the residual
at ? is the projection of the observed response vector, yobs , into
an (n ? p)-dimensional linear subspace of the n-dimensional response
space. The estimate ?R takes into account the fact that ?**2 is
estimated from residuals that have only n ? p degrees of freedom.

Another argument often put forward for REML estimation is that ?_R is
an unbiased estimate of ?**2 , in the sense that the expected value of
the estimator is equal to the value of the parameter. However,
determining the expected value of an estimator involves integrating
with respect to the density of the estimator and we have seen that
densities of estimators of variances will be skewed, often highly
skewed. It is not clear why we should be interested in the expected
value of a highly skewed estimator. If we were to transform to a
more symmetric scale, such as the estimator of the standard deviation
or the estimator of the logarithm of the standard deviation, the
REML estimator would no longer be unbiased. Furthermore, this
property of unbiasedness of variance estimators does not generalize
from the linear regression model to linear mixed models. This is all
to say that the distinction between REML and ML estimates of
variances and variance components is probably less important that
many people believe.


best,

landon
[0]:

www.researchgate.net/publictopics.PublicPostFileLoader.html?id=53326f19d5a3f206348b45af&key=6a85e53326f199010f

Best, Phillip Alday _______________________________________________
R-sig-mixed-models at r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

*Ken Beath*
Lecturer
Statistics Department
MACQUARIE UNIVERSITY NSW 2109, Australia

Phone: +61 (0)2 9850 8516

Level 2, AHH
http://stat.mq.edu.au/our_staff/staff_-_alphabetical/staff/beath,_ken/

CRICOS Provider No 00002J
This message is intended for the addressee named and may...{{dropped:9}}