Please note that I have no problem with this.
Many thanks,
Carlos Fam?lia
On 3 Oct 2016, at 14:19, Fox, John <jfox at mcmaster.ca
<mailto:jfox at mcmaster.ca> > wrote:
Dear Carlos,
If I understand properly what's troubling you and the nature of the
suggestion, I wouldn't do it -- that is, randomly discard data to get
one observation per case.
Paul Johnson explained clearly why the pattern you noticed in the
residuals vs. fitted values plot occurred, as a consequence of
shrinkage. One way of thinking about this is that using a mixed model is
*more* important when you have few observations per case, where
shrinkage will be greater, than when you have many observations per
case, where the "estimates" of the random effects will nearly coincide
with the case means (or in a more complex model, within-case
regressions).
Best,
John
-----------------------------
John Fox, Professor
McMaster University
Hamilton, Ontario
Canada L8S 4M4
Web: socserv.mcmaster.ca/jfox <http://socserv.mcmaster.ca/jfox>
-----Original Message-----
From: R-sig-mixed-models [mailto:r-sig-mixed-models-
bounces at r-project.org]
On Behalf Of Carlos Familia
Sent: October 3, 2016 8:20 AM
To: Thierry Onkelinx <thierry.onkelinx at inbo.be
<mailto:thierry.onkelinx at inbo.be> >
Cc: r-sig-mixed-models at r-project.org <mailto:r-sig-mixed-
models at r-project.org>
Subject: Re: [R-sig-ME] Model diagnostics show slope in
residuals plot and
slope on the observed vs fitted plot is different than y = x
Do you think this approach is sound and easily justifiable in
a paper?
Many thanks,
Carlos
On 3 Oct 2016, at 13:10, Thierry Onkelinx
<thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be> > wrote:
In case you have more than one measurement per ID,
select one of them
at random. Something like
library(dplyr)
df %>%
group_by(id) %>%
sample_n(1)
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research
Institute for Nature
and Forest team Biometrie & Kwaliteitszorg / team
Biometrics & Quality
Assurance Kliniekstraat 25
1070 Anderlecht
Belgium
To call in the statistician after the experiment is done
may be no
more than asking him to perform a post-mortem
examination: he may be
able to say what the experiment died of. ~ Sir Ronald
Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner The
combination of
some data and an aching desire for an answer does not
ensure that a
reasonable answer can be extracted from a given body of
data. ~ John
Tukey
2016-10-03 14:06 GMT+02:00 Carlos Familia
<carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>
<mailto:carlosfamilia at gmail.com>>:
Dear Thierry,
When you say sampling the data to have only one
observation per ID, you
mean reducing the dataset to cases where the Y variable was
only measured in
1 condition?
I have though about going lm with this, but it just
didn?t feel wright...
Many thanks,
Carlos
On 3 Oct 2016, at 13:02, Thierry Onkelinx
<thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>
<mailto:thierry.onkelinx at inbo.be>> wrote:
Dear Carlos,
I concur with Paul. After play a bit with the data
you send me privately, I see
a few things which cause problems:
1) the number of measurements per ID is low. 1/3
has one measurement in
each level of C, 1/3 in two out of three levels of C and 1/3
in only one level of C.
2) the variance of ID is larger than the residual
variance
3) the effect of W is small compared to the
variance of ID
If possible try to add extra covariates. If not I'd
fall back on a simple lm.
Either with ignoring the repeated measurements or by sampling
the data so
you have only one observation per ID.
Best regards,
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research
Institute for
Nature and Forest team Biometrie & Kwaliteitszorg /
team Biometrics &
Quality Assurance Kliniekstraat 25
1070 Anderlecht
Belgium
To call in the statistician after the experiment is
done may be no
more than asking him to perform a post-mortem
examination: he may be
able to say what the experiment died of. ~ Sir
Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination
of some data and an aching desire for an answer
does not ensure that
a reasonable answer can be extracted from a given
body of data. ~
John Tukey
2016-10-03 11:17 GMT+02:00 Paul Johnson
<paul.johnson at glasgow.ac.uk <mailto:paul.johnson at glasgow.ac.uk>
<mailto:paul.johnson at glasgow.ac.uk>>:
Hi Carlos,
Your plot didn?t come through, as Thierry noted.
However it?s expected that,
unlike a standard linear regression model, an LMM with a
nested structure
such as yours will give a positive linear relationship
between the residuals and
the fitted values (might also be true for other structures?),
provided the
residual and random effect variances are > 0. Somebody will
hopefully chip in
with a formal explanation, but it?s basically a similar
phenomenon to
regression to the mean.
Imagine a group of students taking each taking an
aptitude test three times.
There are two random factors: the difference in underlying
aptitude between
the students, modelled by the student ID random effect; and
random variation
between time points within each student (e.g. how good a
particular student is
feeling on a particular day). I?m ignoring variation between
tests ? let?s
unrealistically assume they?s all the same and students
completely forget about
them between tests.
The students with the best mean scores will be a
mixture of excellent
students having three so-so (some good, some bad) days, and
moderately good
students having the good luck to have three good days, and
the very best
scores will come from students who were both excellent and
lucky (although
this category will be small). An important point is that
there is no way of using
the data to separate the moderate-student-lucky-days high
scores from the
excellent-student-average-days scores. If we simply took the
mean of the
scores, we would be overestimating the performance of the
students on
average (we?d have good estimates of the excellent students
and
overestimates of the moderate ones), so the best estimate is
achieved by
shrinking the scores towards the mean.
This is what happens when the model is fitted. Each
student is given a
residual (random effect) at the student level (how good the
student is relative
to the value predicted by the fixed effects) and three
residuals at the
observation (between-test-within-student) level. For students
with good
scores, this will be a compromise between the inseparable
excellent-student-
average-days scenario and the moderate-student-lucky-days
scenario. As a
result, students with high student-level residuals (the
student random effects)
will also tend to have high inter-test residuals. The same is
also true in negative
for poor students and students having three bad days. So the
student random
effects (which are part of the fitted values) and the
residuals will be positively
correlated.
You can check this using by simulating new data
from the fitted model re-
fitting the model, and comparing the residuals-x-fitted plot
(which will be
"perfect?) to the one from your data. Here?s a function that
does this for lme4
fits:
devtools::install_github("pcdjohnson/GLMMmisc")
library(GLMMmisc)
sim.residplot(fit) # repeat this a few times to
account for sampling
error
If all is well, you should see a similar slope
between the real and the
simulated plots, in fact the general pattern of the residuals
should be similar.
(The new package DHARMa ?
https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-
packa
ge-for-residual-diagnostics-of-glmms/
<https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-
pack
age-for-residual-diagnostics-of-glmms/> ? takes a
similar approach to
assessing residuals, but in a less quick-and-dirty,
more formally
justified way.)
All the best,
Paul
On 2 Oct 2016, at 16:57, Carlos Familia
<carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>
<mailto:carlosfamilia at gmail.com>> wrote:
Hello,
I have in hands a quite large and unbalanced
dataset, for which a Y
continuous dependent variable was measured in 3 different
conditions (C) for
about 3000 subjects (ID) (although, not all subjects have Y
values for the 3
conditions). Additionally, there is continuous measure W
which was measured
for all subjects.
I am interested in testing the following:
- Is the effect of W significant overall
- Is the effect of W significant at each
level of C
- Is the effect of C significant
In order to try to answer this, I have
specified the following model with
lmer:
lmer( Y ~ W * C + (1 | ID), data = df)
Which seems to proper reflect the structure
of the data (I might be wrong
here, any suggestions would be welcome).
However when running the diagnostic plots I
noticed a slope in the
residuals plot and a slope different than y = x for the
observed vs fitted plot (as
shown bellow). Which made me question the validity of the
model for
inference.
Could I still use this model for inference?
Should I specify a different
formula? Should I turn to lme and try to include different
variances for each
level of conditions (C)? Any ideas?
I would be really appreciated if anyone could
help me with this.
Thanks in advance,
Carlos Fam?lia