Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x

Dear Carlos,
-----Original Message-----
From: Carlos Familia [mailto:carlosfamilia at gmail.com]
Sent: Monday, October 3, 2016 2:03 PM
To: Fox, John <jfox at mcmaster.ca>
Cc: r-sig-mixed-models at r-project.org; Thierry Onkelinx
<thierry.onkelinx at inbo.be>
Subject: Re: [R-sig-ME] Model diagnostics show slope in residuals plot
and slope on the observed vs fitted plot is different than y = x

Dear John,

Do you suggest then that I should get a different linear regression
model per condition, and instead of analysing the overall effect of W
across conditions, analyse it for each condition separately?
No, I'm not recommending that, or anything in particular beyond the observation that randomly discarding data in order to fit a fixed effect model to the remaining independent observations is probably not a good idea. I also don't see what's to be gained by fitting separate models to the conditions; if you expect interactions with condition, why not model them? I don't think that it would be responsible for me to give you statistical advice by email knowing next to nothing about your research.
Do you think this would be fine for a referee?
I have no idea. I probably wouldn't feel that I could predict an unknown referee's response to your work even if I were in a position to recommend what you should do.

I'm sorry that I can't be more helpful.

Best,
 John
Please note that I have no problem with this.

Many thanks,
Carlos Fam?lia

	On 3 Oct 2016, at 14:19, Fox, John <jfox at mcmaster.ca
<mailto:jfox at mcmaster.ca> > wrote:

	Dear Carlos,

	If I understand properly what's troubling you and the nature of the
suggestion, I wouldn't do it -- that is, randomly discard data to get
one observation per case.

	Paul Johnson explained clearly why the pattern you noticed in the
residuals vs. fitted values plot occurred, as a consequence of
shrinkage. One way of thinking about this is that using a mixed model is
*more* important when you have few observations per case, where
shrinkage will be greater, than when you have many observations per
case, where the "estimates" of the random effects will nearly coincide
with the case means (or in a more complex model, within-case
regressions).

	Best,
	John

	-----------------------------
	John Fox, Professor
	McMaster University
	Hamilton, Ontario
	Canada L8S 4M4
	Web: socserv.mcmaster.ca/jfox <http://socserv.mcmaster.ca/jfox>

		-----Original Message-----
		From: R-sig-mixed-models [mailto:r-sig-mixed-models-
bounces at r-project.org]
		On Behalf Of Carlos Familia
		Sent: October 3, 2016 8:20 AM
		To: Thierry Onkelinx <thierry.onkelinx at inbo.be
<mailto:thierry.onkelinx at inbo.be> >
		Cc: r-sig-mixed-models at r-project.org <mailto:r-sig-mixed-
models at r-project.org>
		Subject: Re: [R-sig-ME] Model diagnostics show slope in
residuals plot and
		slope on the observed vs fitted plot is different than y = x

		Do you think this approach is sound and easily justifiable in
a paper?

		Many thanks,
		Carlos

			On 3 Oct 2016, at 13:10, Thierry Onkelinx
<thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be> > wrote:

			In case you have more than one measurement per ID,
select one of them
			at random. Something like

			library(dplyr)
			df %>%
			group_by(id) %>%
			sample_n(1)

			ir. Thierry Onkelinx
			Instituut voor natuur- en bosonderzoek / Research
Institute for Nature
			and Forest team Biometrie & Kwaliteitszorg / team
Biometrics & Quality
			Assurance Kliniekstraat 25
			1070 Anderlecht
			Belgium

			To call in the statistician after the experiment is done
may be no
			more than asking him to perform a post-mortem
examination: he may be
			able to say what the experiment died of. ~ Sir Ronald
Aylmer Fisher
			The plural of anecdote is not data. ~ Roger Brinner The
combination of
			some data and an aching desire for an answer does not
ensure that a
			reasonable answer can be extracted from a given body of
data. ~ John
			Tukey

			2016-10-03 14:06 GMT+02:00 Carlos Familia
<carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>

		<mailto:carlosfamilia at gmail.com>>:

			Dear Thierry,

			When you say sampling the data to have only one
observation per ID, you

		mean reducing the dataset to cases where the Y variable was
only measured in
		1 condition?

			I have though about going lm  with this, but it just
didn?t feel wright...

			Many thanks,
			Carlos

				On 3 Oct 2016, at 13:02, Thierry Onkelinx
<thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>

		<mailto:thierry.onkelinx at inbo.be>> wrote:

				Dear Carlos,

				I concur with Paul. After play a bit with the data
you send me privately, I see

		a few things which cause problems:

				1) the number of measurements per ID is low. 1/3
has one measurement in

		each level of C, 1/3 in two out of three levels of C and 1/3
in only one level of C.

				2) the variance of ID is larger than the residual
variance
				3) the effect of W is small compared to the
variance of ID

				If possible try to add extra covariates. If not I'd
fall back on a simple lm.

		Either with ignoring the repeated measurements or by sampling
the data so
		you have only one observation per ID.

				Best regards,

				ir. Thierry Onkelinx
				Instituut voor natuur- en bosonderzoek / Research
Institute for
				Nature and Forest team Biometrie & Kwaliteitszorg /
team Biometrics &
				Quality Assurance Kliniekstraat 25
				1070 Anderlecht
				Belgium

				To call in the statistician after the experiment is
done may be no
				more than asking him to perform a post-mortem
examination: he may be
				able to say what the experiment died of. ~ Sir
Ronald Aylmer Fisher
				The plural of anecdote is not data. ~ Roger Brinner
The combination
				of some data and an aching desire for an answer
does not ensure that
				a reasonable answer can be extracted from a given
body of data. ~
				John Tukey

				2016-10-03 11:17 GMT+02:00 Paul Johnson
<paul.johnson at glasgow.ac.uk <mailto:paul.johnson at glasgow.ac.uk>

		<mailto:paul.johnson at glasgow.ac.uk>>:

				Hi Carlos,

				Your plot didn?t come through, as Thierry noted.
However it?s expected that,

		unlike a standard linear regression model, an LMM with a
nested structure
		such as yours will give a positive linear relationship
between the residuals and
		the fitted values (might also be true for other structures?),
provided the
		residual and random effect variances are > 0. Somebody will
hopefully chip in
		with a formal explanation, but it?s basically a similar
phenomenon to
		regression to the mean.

				Imagine a group of students taking each taking an
aptitude test three times.

		There are two random factors: the difference in underlying
aptitude between
		the students, modelled by the student ID random effect; and
random variation
		between time points within each student (e.g. how good a
particular student is
		feeling on a particular day). I?m ignoring variation between
tests ? let?s
		unrealistically assume they?s all the same and students
completely forget about
		them between tests.

				The students with the best mean scores will be a
mixture of excellent

		students having three so-so (some good, some bad) days, and
moderately good
		students having the good luck to have three good days, and
the very best
		scores will come from students who were both excellent and
lucky (although
		this category will be small). An important point is that
there is no way of using
		the data to separate the moderate-student-lucky-days high
scores from the
		excellent-student-average-days scores. If we simply took the
mean of the
		scores, we would be overestimating the performance of the
students on
		average (we?d have good estimates of the excellent students
and
		overestimates of the moderate ones), so the best estimate is
achieved by
		shrinking the scores towards the mean.

				This is what happens when the model is fitted. Each
student is given a

		residual (random effect) at the student level (how good the
student is relative
		to the value predicted by the fixed effects) and three
residuals at the
		observation (between-test-within-student) level. For students
with good
		scores, this will be a compromise between the inseparable
excellent-student-
		average-days scenario and the moderate-student-lucky-days
scenario. As a
		result, students with high student-level residuals (the
student random effects)
		will also tend to have high inter-test residuals. The same is
also true in negative
		for poor students and students having three bad days. So the
student random
		effects (which are part of the fitted values) and the
residuals will be positively
		correlated.

				You can check this using by simulating new data
from the fitted model re-

		fitting the model, and comparing the residuals-x-fitted plot
(which will be
		"perfect?) to the one from your data. Here?s a function that
does this for lme4
		fits:

				devtools::install_github("pcdjohnson/GLMMmisc")
				library(GLMMmisc)
				sim.residplot(fit) # repeat this a few times to
account for sampling
				error

				If all is well, you should see a similar slope
between the real and the

		simulated plots, in fact the general pattern of the residuals
should be similar.

				(The new package DHARMa ?

	https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-
packa
				ge-for-residual-diagnostics-of-glmms/

	<https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-
pack
				age-for-residual-diagnostics-of-glmms/> ? takes a
similar approach to
				assessing residuals, but in a less quick-and-dirty,
more formally
				justified way.)

				All the best,
				Paul

					On 2 Oct 2016, at 16:57, Carlos Familia
<carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>

		<mailto:carlosfamilia at gmail.com>> wrote:

					Hello,

					I have in hands a quite large and unbalanced
dataset, for which a Y

		continuous dependent variable was measured in 3 different
conditions (C) for
		about 3000 subjects (ID) (although, not all subjects have Y
values for the 3
		conditions). Additionally, there is continuous measure W
which was measured
		for all subjects.

					I am interested in testing the following:

					- Is the effect of W significant overall
					- Is the effect of W significant at each
level of C
					- Is the effect of C significant

					In order to try to answer this, I have
specified the following model with

		lmer:

					lmer( Y ~ W * C + (1 | ID), data = df)

					Which seems to proper reflect the structure
of the data (I might be wrong

		here, any suggestions would be welcome).

					However when running the diagnostic plots I
noticed a slope in the

		residuals plot and a slope different than y = x for the
observed vs fitted plot (as
		shown bellow). Which made me question the validity of the
model for
		inference.

					Could I still use this model for inference?
Should I specify a different

		formula? Should I turn to lme and try to include different
variances for each
		level of conditions (C)? Any ideas?

					I would be really appreciated if anyone could
help me with this.

					Thanks in advance,
					Carlos Fam?lia

	_______________________________________________
					R-sig-mixed-models at r-project.org <mailto:R-
sig-mixed-models at r-project.org>
					<mailto:R-sig-mixed-models at r-project.org>
mailing list
					https://stat.ethz.ch/mailman/listinfo/r-sig-
mixed-models
					<https://stat.ethz.ch/mailman/listinfo/r-sig-
mixed-models>

				_______________________________________________
				R-sig-mixed-models at r-project.org <mailto:R-sig-
mixed-models at r-project.org>
				<mailto:R-sig-mixed-models at r-project.org> mailing
list
				https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-
models
				<https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-
models>

		[[alternative HTML version deleted]]

		_______________________________________________
		R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-
models at r-project.org>  mailing list
		https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x

Thread (18 messages)