model for clustered longitudinal binary data

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20131008/6812b7f5/attachment.pl>
Adrien Combaz <Adrien.Combaz at ...> writes:
Dear list members,
[snip]
I measure a longitudinal binary outcome (correctness of detection,
0: incorrect, 1: correct) with respect to 5 different experimental
conditions (1 baseline and 4 treatments). The outcome is always
measured at the same 10 time points. Each of the 9 subjects
participated in all 5 conditions.  Additionally, for each subject
and condition, the experiment was replicated 36 times. I therefore
end up with 9*5*36=1620 binary longitudinal series (= trials of 10
points each).
My aim is to assess the influence of the experimental condition on
my binary outcome. I need to build a model that would take into
consideration the correlation along time for a given trial and the
correlation among trials for a given subject.
Correlation among trials for a given subject should be straightforward,
correlation along time for a given trial may be difficult (see below).
I am considering a 3 levels logistic models where 10 consecutive
binary measurements (level 1) are obtained on replicates (level 2)
which are clustered into subjects (level 3). My only level 1
covariate would be the time of measurement (ordinal factor, T = 1,
..., 10) and as level 2 covariate, I consider the experimental
condition. I don't consider any level 3 covariate per se, but still
want the model to account for between-subject variability.
This all seems reasonable.  If you really want time to be treated
as ordinal, you'll want to look at the clmm function from the 'ordinal'
package.  In most R modeling packages you don't need to state
explicitly which levels the covariates are measured at (but keeping
track of it is of course useful for thinking about issues of
identifiability, etc.)

A simple model would be something like

 response ~ time + expcond + (1|rep/sub)

As a more complete model you could consider

 response ~ time + expcond + (time|rep/sub) + (expcond|sub)
Thanks Ben for your reply,
Dear list members,

[snip]

I measure a longitudinal binary outcome (correctness of detection,
0: incorrect, 1: correct) with respect to 5 different experimental
conditions (1 baseline and 4 treatments). The outcome is always
measured at the same 10 time points. Each of the 9 subjects
participated in all 5 conditions.  Additionally, for each subject and
condition, the experiment was replicated 36 times. I therefore end up
with 9*5*36=1620 binary longitudinal series (= trials of 10 points
each).

My aim is to assess the influence of the experimental condition on my
binary outcome. I need to build a model that would take into
consideration the correlation along time for a given trial and the
correlation among trials for a given subject.
  Correlation among trials for a given subject should be straightforward,
correlation along time for a given trial may be difficult (see below).
Yes, this is my main issue.

I am considering a 3 levels logistic models where 10 consecutive
binary measurements (level 1) are obtained on replicates (level 2)
which are clustered into subjects (level 3). My only level 1 covariate
would be the time of measurement (ordinal factor, T = 1, ..., 10) and
as level 2 covariate, I consider the experimental condition. I don't
consider any level 3 covariate per se, but still want the model to
account for between-subject variability.
This all seems reasonable.  If you really want time to be treated as ordinal,
you'll want to look at the clmm function from the 'ordinal'
package.  In most R modeling packages you don't need to state explicitly
which levels the covariates are measured at (but keeping track of it is of
course useful for thinking about issues of identifiability, etc.)
I am not sure to understand how I can use the clmm function, I am not familiar with it but from what I could read, it is used to fit cumulative link models for an ordinal response variable, while in my case time is not the response variable but a factor (and my response variable is binary).

I preferred to treat time as discrete factor rather than a continuous variable for 2 reasons:
1) it represents a number of cycles which is discrete and ordered by nature
2) on average, the correctness (logit) increases with time, but the relationship is nonlinear. It means that, if I use the time as a continuous variable, I should choose an adequate transformation to obtain a linear relationship, which can be very subjective. Since my main objective is to study the influence of the experimental condition, I didn't really want to go there.
A simple model would be something like

 response ~ time + expcond + (1|rep/sub)
I tried something like that with the lmer function, only difference is that I had as random effect (1|sub/rep). I thought that it was the proper syntax for replicates nested within subjects, giving a random intercept for each subject and for each replicate within subject. Am I missing something?
As a more complete model you could consider

 response ~ time + expcond + (time|rep/sub) + (expcond|sub)
With such a model where expcond is also used to define the random effect structure, can I use the anova function to compare it to the following "null model":
response ~ time + (time|rep/sub) + (expcond|sub)
and make a statement on the significance of the effect of the experiment condition?

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Adrien Combaz <Adrien.Combaz at ...> writes:

[snip]
I measure a longitudinal binary outcome (correctness of detection,
0: incorrect, 1: correct) with respect to 5 different experimental
conditions (1 baseline and 4 treatments). The outcome is always
measured at the same 10 time points. Each of the 9 subjects
participated in all 5 conditions.  Additionally, for each subject and
condition, the experiment was replicated 36 times. I therefore end up
with 9*5*36=1620 binary longitudinal series (= trials of 10 points
each).
[snip]
  Correlation among trials for a given subject 
should be straightforward,
correlation along time for a given trial may be difficult (see below).
Yes, this is my main issue.
I forgot to say that unless you are explicitly interested
in the estimated correlation structure, you could hope to get
around this by fitting the model without correlation and then
showing that the temporal autocorrelation in the residuals is
negligible ....
I am considering a 3 levels logistic models where 10 consecutive
binary measurements (level 1) are obtained on replicates (level 2)
which are clustered into subjects (level 3). My only level 1 covariate
would be the time of measurement (ordinal factor, T = 1, ..., 10) and
as level 2 covariate, I consider the experimental condition. I don't
consider any level 3 covariate per se, but still want the model to
account for between-subject variability.
This all seems reasonable.  If you really want time to be treated
as ordinal, you'll want to look at the clmm function from the
'ordinal' package.   
[snip]
I am not sure to understand how I can use the clmm function, I am
not familiar with it but from what I could read, it is used to fit
cumulative link models for an ordinal response variable, while in my
case time is not the response variable but a factor (and my response
variable is binary).
You're right, my bad.  The only difference between ordered and
unordered factors in the standard R approach to model-fitting is
that by default, treatment contrasts are used for unordered and
orthogonal polynomial contrasts are used for ordered factors.  Another
perhaps underused option is to specify successive-differences
contrasts, using the contr.sdif() function in the MASS package.
None of these will make a difference in the overall complexity or
fit of the model, just in the interpretation of the parameters.
I preferred to treat time as discrete factor rather than a
continuous variable for 2 reasons: 1) it represents a number of
cycles which is discrete and ordered by nature 2) on average, the
correctness (logit) increases with time, but the relationship is
nonlinear. It means that, if I use the time as a continuous
variable, I should choose an adequate transformation to obtain a
linear relationship, which can be very subjective. Since my main
objective is to study the influence of the experimental condition, I
didn't really want to go there.
A simple model would be something like

 response ~ time + expcond + (1|rep/sub)
I tried something like that with the lmer function, only difference
is that I had as random effect (1|sub/rep). I thought that it was
the proper syntax for replicates nested within subjects, giving a
random intercept for each subject and for each replicate within
subject. Am I missing something?
No, my bad again.  it should be sub/rep

As a more complete model you could consider

 response ~ time + expcond + (time|rep/sub) + (expcond|sub)
With such a model where expcond is also used to define the random
effect structure, can I use the anova function to compare it to the
following "null model": response ~ time + (time|rep/sub) +
(expcond|sub) and make a statement on the significance of the effect
of the experiment condition? 
Yes.
-----Original Message-----
From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-
models-bounces at r-project.org] On Behalf Of Ben Bolker
Sent: Wednesday, October 09, 2013 11:46 PM
To: r-sig-mixed-models at r-project.org
Subject: Re: [R-sig-ME] model for clustered longitudinal binary data

Adrien Combaz <Adrien.Combaz at ...> writes:

[snip]

I measure a longitudinal binary outcome (correctness of detection,
0: incorrect, 1: correct) with respect to 5 different experimental
conditions (1 baseline and 4 treatments). The outcome is always
measured at the same 10 time points. Each of the 9 subjects
participated in all 5 conditions.  Additionally, for each subject
and condition, the experiment was replicated 36 times. I therefore
end up with 9*5*36=1620 binary longitudinal series (= trials of 10
points each).
[snip]

  Correlation among trials for a given subject
should be straightforward,
correlation along time for a given trial may be difficult (see below).
Yes, this is my main issue.
  I forgot to say that unless you are explicitly interested in the estimated
correlation structure, you could hope to get around this by fitting the model
without correlation and then showing that the temporal autocorrelation in
the residuals is negligible ....

That would indeed be nice.
Although, I was advised to avoid looking at residuals when doing logistic mixed models on binary data. I'm actually not sure about what they represent. When doing a normal mixed model, I'm able to retrieve my observed data by adding up fitted values and residuals, but it's not the case with logistic regression.
Therefore I'm wondering what they really represent and if looking at their autocorrelation will give me the information I expect.

I am considering a 3 levels logistic models where 10 consecutive
binary measurements (level 1) are obtained on replicates (level 2)
which are clustered into subjects (level 3). My only level 1
covariate
would be the time of measurement (ordinal factor, T = 1, ..., 10)
and as level 2 covariate, I consider the experimental condition. I
don't consider any level 3 covariate per se, but still want the
model to account for between-subject variability.

This all seems reasonable.  If you really want time to be treated as
ordinal, you'll want to look at the clmm function from the
'ordinal' package.
[snip]

I am not sure to understand how I can use the clmm function, I am not
familiar with it but from what I could read, it is used to fit
cumulative link models for an ordinal response variable, while in my
case time is not the response variable but a factor (and my response
variable is binary).
 You're right, my bad.  The only difference between ordered and unordered
factors in the standard R approach to model-fitting is that by default,
treatment contrasts are used for unordered and orthogonal polynomial
contrasts are used for ordered factors.  Another perhaps underused option is
to specify successive-differences contrasts, using the contr.sdif() function in
the MASS package.
None of these will make a difference in the overall complexity or fit of the
model, just in the interpretation of the parameters.

I preferred to treat time as discrete factor rather than a continuous
variable for 2 reasons: 1) it represents a number of cycles which is
discrete and ordered by nature 2) on average, the correctness (logit)
increases with time, but the relationship is nonlinear. It means that,
if I use the time as a continuous variable, I should choose an
adequate transformation to obtain a linear relationship, which can be
very subjective. Since my main objective is to study the influence of
the experimental condition, I didn't really want to go there.

A simple model would be something like

 response ~ time + expcond + (1|rep/sub)

I tried something like that with the lmer function, only difference is
that I had as random effect (1|sub/rep). I thought that it was the
proper syntax for replicates nested within subjects, giving a random
intercept for each subject and for each replicate within subject. Am I
missing something?
  No, my bad again.  it should be sub/rep

As a more complete model you could consider

 response ~ time + expcond + (time|rep/sub) + (expcond|sub)

With such a model where expcond is also used to define the random
effect structure, can I use the anova function to compare it to the
following "null model": response ~ time + (time|rep/sub) +
(expcond|sub) and make a statement on the significance of the effect
of the experiment condition?
  Yes.
Although this model seems nice, I'm reaching the maximum number of iterations without getting convergence, so I'll probably have to go for something a bit simpler.

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models