Accounting for overdispersion in a mixed-effect model with a proportion response variable and categorical explanatory variables.

2 messages · Richard Friedman, Ben Bolker

Thu, Apr 21, 2011 12:26 PM #

Dear R-help-list,

	I have a problem in which the explanatory variables are categorical,  
the response variable is  a proportion, and experiment contains  
technical replicates (pseudoreplicates) as well as biological  
replicated. I am new to both generalized linear models and mixed- 
effects models and would greatly appreciate the advice of experienced  
analysts in this matter.

I analyzed the data in 4 ways and want to know which is the best way.  
The 4 ways are:

1.	A generalized linear model with binomial error in which the  
positive and negative counts for each biological replicate is summed  
over technical replicates.
2.	Same as 1 with a quasibinomial error model.
3.	A generalized linear mixed-effects model with binomial error in  
which technical replication is treated as a random effect.
4.	A generalized linear mixed-effects model with binomial error in  
which technical replication is treated as a random effect and  
overdispersion is taken into account by individual level variability.

Here are the relevant data for each model:

For everything:

 > sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] lme4_0.999375-39   Matrix_0.999375-50 lattice_0.19-23

loaded via a namespace (and not attached):
[1] grid_2.13.0   nlme_3.1-100  stats4_2.13.0 tools_2.13.0



treatment Apositivesum Bnegativesum
1     A	208             439
2     A	215             395
3     A	235             411
4     A	304             450
5     A	215             395
6     B   224             353
7     B   279             405
8     B   265             418
9     B   278             392
10    B   249             383
11     C        196             385
12     C      NA              NA
13     C        266             397
14     C       216             460
15     C        264             419
16    D         283             401
17    D              NA              NA
18    D             270             410
19    D             248             316
20    D             302             386

1.	A generalized linear model with binomial error in which the  
positive and negative counts for each biological replicate is summed  
over technical replicates.

 > y<-cbind(Apositivesum , Bnegativesum)
 > model<-glm(y ~ treatment, binomial)
 > summary(model)

Call:
glm(formula = y ~ treatment, family = binomial)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.3134  -0.5712  -0.3288   0.8616   2.4352

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)         -0.574195   0.036443 -15.756  < 2e-16 ***
treatmentB  0.164364   0.051116   3.216  0.00130 **
treatmentC  0.007025   0.054696   0.128  0.89780
treatmentD  0.258135   0.053811   4.797 1.61e-06 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 60.467  on 17  degrees of freedom
Residual deviance: 28.711  on 14  degrees of freedom
  (2 observations deleted due to missingness)
AIC: 160.35

Number of Fisher Scoring iterations: 3

Since Residual deviance >> degrees of freedom I tried


2.	Same as 1 with a quasibinomial error model.

 > model<-glm(y ~ treatment, quasibinomial)
 > summary(model)

Call:
glm(formula = y ~ treatment, family = quasibinomial)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.3134  -0.5712  -0.3288   0.8616   2.4352

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)         -0.574195   0.052173 -11.006 2.82e-08 ***
treatmentB  0.164364   0.073180   2.246  0.04136 *
treatmentC  0.007025   0.078306   0.090  0.92978
treatmentD  0.258135   0.077038   3.351  0.00476 **
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

(Dispersion parameter for quasibinomial family taken to be 2.049598)

    Null deviance: 60.467  on 17  degrees of freedom
Residual deviance: 28.711  on 14  degrees of freedom
  (2 observations deleted due to missingness)
AIC: NA

Number of Fisher Scoring iterations: 3

 > anova(model,test=F)
Error in match.arg(test) : 'arg' must be NULL or a character vector
 > anova(model,test="F")
Analysis of Deviance Table

Model: quasibinomial, link: logit

Response: y

Terms added sequentially (first to last)


            Df Deviance Resid. Df Resid. Dev      F  Pr(>F)
NULL                           17     60.467
treatment  3   31.756        14     28.711 5.1646 0.01303 *
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
 >


I then tried to take the effect of pseudoreplication into account with  
3:

3.	A generalized linear mixed-effects model with binomial error in  
which technical replication is treated as a random effect.


Here is the input file:

treatment	mouse	observation	positive	negative
A	1	1	73	149
A	1	2	50	129
A	1	3	85	161
A	2	1	73	139
A	2	2	89	144
A	2	3	53	112
A	3	1	69	97
A	3	2	82	128
A	3	3	84	186
A	4	1	111	145
A	4	2	78	146
A	4	3	115	159
A	5	1	74	133
A	5	2	82	153
A	5	3	59	109
B	1	1	90	146
B	1	2	58	108
B	1	3	76	99
B	2	1	105	136
B	2	2	99	139
B	2	3	75	130
B	3	1	95	160
B	3	2	95	135
B	3	3	75	123
B	4	1	95	129
B	4	2	101	130
B	4	3	82	133
B	5	1	63	109
B	5	2	86	132
B	5	3	100	142
C	1	1	72	128
C	1	2	57	137
C	1	3	67	120
C	2	1	86	110
C	2	2	79	121
C	2	3	101	166
C	3	1	82	231
C	3	2	60	125
C	3	3	74	104
C	4	1	90	155
C	4	2	84	141
C	4	3	90	123
D	1	1	91	107
D	1	2	101	183
D	1	3	91	111
D	2	1	79	146
D	2	2	97	155
D	2	3	94	109
D	3	1	69	88
D	3	2	84	107
D	3	3	95	121
D	4	1	92	127
D	4	2	112	140
D	4	3	98	119

y<-cbind(positive,negative)
treatment<-factor(treatment)
mouse<-factor(mouse)
model<-lmer(y~treatment+(1|treatment/mouse),family=binomial)
 > data<-read.table(mixedinp.txt",header=T,sep="\t")
 > dim(data)
[1] 54  5
 > dim(data)
[1] 54  5
 > y<-cbind(positive,negative)


 > treatment<-factor(treatment)
 > mouse<-factor(mouse)
 > model<-lmer(y~treatment+(1|treatment/mouse),family=binomial)
 > summary(model)
Generalized linear mixed model fit by the Laplace approximation
Formula: y ~ treatment + (1 | treatment/mouse)
   AIC   BIC logLik deviance
91.85 103.8 -39.93    79.85
Random effects:
Groups          Name        Variance  Std.Dev.
mouse:treatment (Intercept) 0.0039316 0.062702
treatment       (Intercept) 0.0000000 0.000000
Number of obs: 54, groups: mouse:treatment, 18; treatment, 4

Fixed effects:
                   Estimate Std. Error z value Pr(>|z|)
(Intercept)       -0.577592   0.046034 -12.547  < 2e-16 ***
treatmentB  0.166954   0.064749   2.578 0.009923 **
treatmentC  0.008545   0.069065   0.124 0.901537
treatmentD  0.262350   0.068364   3.838 0.000124 ***
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Correlation of Fixed Effects:
            (Intr) tBPBS_ tCPBS_
trt -0.711
trt -0.667  0.474
trt -0.673  0.479  0.449


Since the non-mixed effects model showed evidence of overdispersion,  
and since there is no quasibinomial option in lme4, I tried to account  
for overdispesion by including individual level variability with..



4.	A generalized linear mixed-effects model with binomial error in  
which technical replication is treated as a random effect and  
overdispersion is taken into account by individual level variability.

 > data$obs<-1:nrow(data)
 > names(data)
[1] "treatment"   "mouse"       "observation" "positive"    "negative"
[6] "obs"
 > detach(data)
 > attach(data)
The following object(s) are masked _by_ '.GlobalEnv':

    mouse, treatment
 > model2<-lmer(y~treatment+(1|treatment/mouse)+(1| 
obs) ,family=binomial)
Number of levels of a grouping factor for the random effects
is *equal* to n, the number of observations
 > summary(model2)
Generalized linear mixed model fit by the Laplace approximation
Formula: y ~ treatment + (1 | treatment/mouse) + (1 | obs)
   AIC   BIC logLik deviance
89.75 103.7 -37.88    75.75
Random effects:
Groups          Name        Variance   Std.Dev.
obs             (Intercept) 1.0529e-02 1.0261e-01
mouse:treatment (Intercept) 1.3158e-12 1.1471e-06
treatment       (Intercept) 0.0000e+00 0.0000e+00
Number of obs: 54, groups: obs, 54; mouse:treatment, 18; treatment, 4

Fixed effects:
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)       -0.57842    0.04522 -12.792  < 2e-16 ***
treatmentB  0.16590    0.06356   2.610  0.00904 **
treatmentC  0.01642    0.06784   0.242  0.80878
treatmentD  0.26677    0.06710   3.976 7.01e-05 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Correlation of Fixed Effects:
            (Intr) tBPBS_ tCPBS_
trtB -0.711
trtC -0.666  0.474
trtD -0.674  0.479  0.449
 >

I then compared the 2 models.

 > anova(model,model2,test="F")
Data:
Models:
model: y ~ treatment + (1 | treatment/mouse)
model2: y ~ treatment + (1 | treatment/mouse) + (1 | obs)
       Df    AIC    BIC  logLik  Chisq Chi Df Pr(>Chisq)
model   6 91.853 103.79 -39.926
model2  7 89.750 103.67 -37.875 4.1023      1    0.04282 *
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1



My questions:

1.	Am I correct that, of the 4 models,  I should use the mixed effect  
model with individual variability?
Although both models make the same effects to be significant, I would  
like to know which one I should report and use as input to a  
subsequent multiple comparisons analysis with multiicomp.
2.	If I am incorrect what model should I use and why?
3.	I would appreciate any  further suggestions for analyzing this data.

Thanks and best wishes,
Rich
------------------------------------------------------------
Richard A. Friedman, PhD
Associate Research Scientist,
Biomedical Informatics Shared Resource
Herbert Irving Comprehensive Cancer Center (HICCC)
Lecturer,
Department of Biomedical Informatics (DBMI)
Educational Coordinator,
Center for Computational Biology and Bioinformatics (C2B2)/
National Center for Multiscale Analysis of Genomic Networks (MAGNet)
Room 824
Irving Cancer Research Center
Columbia University
1130 St. Nicholas Ave
New York, NY 10032
(212)851-4765 (voice)
friedman at cancercenter.columbia.edu
http://cancercenter.columbia.edu/~friedman/

I am a Bayesian. When I see a multiple-choice question on a test and I  
don't
know the answer I say "eeney-meaney-miney-moe".

Rose Friedman, Age 14

Ben Bolker

Thu, Apr 21, 2011 5:49 PM #

Richard Friedman <friedman <at> cancercenter.columbia.edu> writes:

Off the top of my head, I would say that #4 is best.

X <- read.table("mouse.dat",header=TRUE,
colClasses=rep(c("factor","numeric"),c(3,2)))
Xagg <- aggregate(cbind(positive,negative)~treatment+mouse,data=X,FUN=sum)
## gets your aggregated data

model <- glm(cbind(positive,negative)~treatment,family=binomial,data=Xagg)

  If all the assumptions of the model were actually met this might be OK (since
if all observations on all mice were really independent, the sum of binomials
would be binomial) but the residual deviance suggests overdispersion (dev/df
approx. 2)

(note that your 2 NA observations got dropped during my aggregation
step)

yes (I'm working through this before I see your comments, but
you seem to be on the right track)

Seems reasonable.

[snip]

X$obs<-1:nrow(X)

library(lme4)
## it doesn't make sense to include treatment in the random effect
##  since it is already present as a fixed effect in the model
## "mouse within treatment" (= mouse:treatment) is what you want
model2<-lmer(cbind(positive,negative)~treatment+(1|mouse:treatment)+
             (1|obs) ,data=X, family=binomial)
## note that mouse:treatment variance comes out to zero

model2B<-lmer(cbind(positive,negative)~treatment+(1|mouse:treatment),
             data=X, family=binomial)

anova(model2,model2B)
## and yet model with observation-level RE appears significantly better
## (even though this is a conservative test because the null hypothesis
## is on the boundary)


I then compared the 2 models.

##   note that test="F" is ignored

I think so.  It may however comfort you to see that the coefficients
and their standard errors are practically the same under each model.

  For the gold standard you may want to do parametric bootstrapping
(see ?refit in the most recent (r-forge) version of lme4)

  I would suggest that future questions along these lines go to the
r-sig-mixed-models mailing list ...

  Ben Bolker