Skip to content

comparing proportions

4 messages · Robert A LaBudde, array chip, Mike Marchywka

#
1. If you use a random effects model, you should make Subject the 
random factor. I.e., a random intercepts model with 1|Subject. Group 
is a fixed effect: You have only 2 groups. Even if you had more than 
2 groups, treating Group as random would return a standard deviation, 
not a P-value as you wanted. Finally, I doubt you believe the groups 
used are meaningless, and only the population of groups is of 
interest. Instead you consider them special, so Group is a fixed effect.

2. The number of observations for each Subject is the number of 
trials, which you previously indicated were 7 to 10 in the cases listed.

3. If you have no interest in the Subject effect, you can use a fixed 
Subject factor instead with glm() instead of glmer() or other mixed 
model function. This is a good idea so long as the number of subjects 
is, say, less than 10. Otherwise a mixed model would be a better idea.

I suggest you fit all three models to learn about what you're doing: 
1) glmer() or equivalent, with cbind(successes, failures) ~ 1|Subject 
+ Group; 2) glm() with cbind(successes, failures) ~ Subject + Group; 
and 3) lm(p ~ Subject + Group), where p is the proportion success for 
a particular subject and group.

Then compare the results. They will probably all 3 give the same 
conclusion to the hypothesis question about Group. I would guess the 
glmer() P-value will be larger, then the glm() and finally the lm(), 
but the last two may reverse. The lm() model may actually perform 
fairly well, as the Edgeworth series converges rapidly to normal for 
binomial distributions with p within 0.15 to 0.85 and 10+ replicates, 
as I stated before.

I'd be interested in seeing the results of these 3 fits myself just 
for curiosity.
At 01:21 PM 2/10/2011, array chip wrote:
================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"
#
Robert, thank you!

I tried all 3 models you suggested. Since each subject only has one line of data 
in my dataset, would including Subject as a factor in glm() or lm() lead to 0 df 
for resaiduals?

Attached is my dataset and here is my version of the 3 models:

test<-read.table("test.txt",sep='\t',header=T)
library(lme4)

## model 1: generalized mixed model
fit1<-glmer(cbind(success,failure)~group+(1|subject),data=test,family=binomial)

## model 2: generalized model
fit2<-glm(cbind(success,failure)~group,data=test,family=binomial)

## model 3: linear model
fit3<-lm(prop.success~group,weights=success+failure,data=test)
Generalized linear mixed model fit by the Laplace approximation 
Formula: cbind(success, failure) ~ group + (1 | subject) 
?? Data: test 
?? AIC?? BIC logLik deviance
?54.75 57.89 -24.38??? 48.75
Random effects:
?Groups? Name??????? Variance Std.Dev.
?subject (Intercept) 1.8256?? 1.3511? 
Number of obs: 21, groups: subject, 21
Fixed effects:
??????????? Estimate Std. Error z value Pr(>|z|)
(Intercept)? -0.6785???? 0.4950? -1.371??? 0.170
groupgroup2? -0.7030???? 0.6974? -1.008??? 0.313
Call:
glm(formula = cbind(success, failure) ~ group, family = binomial, 
??? data = test)
Deviance Residuals: 
??? Min?????? 1Q?? Median?????? 3Q????? Max? 
-2.8204? -2.0789? -0.5407?? 1.0403?? 4.0539? 
Coefficients:
??????????? Estimate Std. Error z value Pr(>|z|)? 
(Intercept)? -0.4400???? 0.2080? -2.115?? 0.0344 *
groupgroup2? -0.6587???? 0.3108? -2.119?? 0.0341 *
---
Signif. codes:? 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 
(Dispersion parameter for binomial family taken to be 1)
??? Null deviance: 78.723? on 20? degrees of freedom
Residual deviance: 74.152? on 19? degrees of freedom
Call:
lm(formula = prop.success ~ group, data = test, weights = success + 
??? failure)
Residuals:
??? Min????? 1Q? Median????? 3Q???? Max 
-1.1080 -0.7071 -0.2261? 0.4754? 1.9157 
Coefficients:
??????????? Estimate Std. Error t value Pr(>|t|)??? 
(Intercept)? 0.39175??? 0.08809?? 4.447 0.000276 ***
groupgroup2 -0.14175??? 0.12364? -1.146 0.265838??? 
---
Signif. codes:? 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 
Residual standard error: 0.8676 on 19 degrees of freedom
Multiple R-squared: 0.0647,???? Adjusted R-squared: 0.01548 
F-statistic: 1.314 on 1 and 19 DF,? p-value: 0.2658 
?
As you can see, all 3 models gave quite different results, the model 2 
(generalized linear model) gave significant p value while model 1 (generalized 
mixed model) and model 3 (regular linear model) did not. Also as you can from 
the data, prop.success has some value outside the range 0.15 to 0.85, so maybe 
regular linear model may not be appropriate?
?
Thank you,
?
John
?
?
#
I had a longer draft before but I'll just ask if you ever looked at your data?

cr=rainbow(3);
plot(df$success+df$failure,df$prop.success,col=cr[as.numeric(df$group)],cex=as.numeric(df$group))


you'd have to suspect that p is a function of the number of trials per subject
and that differences between groups would be an issue with how their trial counts
were distributed. Speculating further, the one group seems to look like a candidate for linear
fit and the other perhaps non-monotonic curve suggestive of "learning" with more trials and
maybe fatigue in both groups.
This wouldn't show up in your lm model if I read it right and indeed the weights may not
help at all- but you could try something with numbers of trials as an independent variable
and look for something of the form p~an+bn^2 perhaps in a nonlinear model. 

As long as you are just doing post hoc analysis with quite limited data, and even if you had a specific plan
figured out a head of time, it is always a good idea to plot your data and try many
different tests and relationships among your variables. Presumably all this
stuff is judged by how well it helps you understand the system of interest.? 



----------------------------------------
Date: Thu, 10 Feb 2011 14:17:29 -0800
From: arrayprofile at yahoo.com
To: ral at lcfltd.com
CC: R-help at stat.math.ethz.ch; gunter.berton at gene.com
Subject: Re: [R] comparing proportions


Robert, thank you!

I tried all 3 models you suggested. Since each subject only has one line of data
in my dataset, would including Subject as a factor in glm() or lm() lead to 0 df
for resaiduals?

Attached is my dataset and here is my version of the 3 models:

test<-read.table("test.txt",sep='\t',header=T)
library(lme4)

## model 1: generalized mixed model
fit1<-glmer(cbind(success,failure)~group+(1|subject),data=test,family=binomial)

## model 2: generalized model
fit2<-glm(cbind(success,failure)~group,data=test,family=binomial)

## model 3: linear model
fit3<-lm(prop.success~group,weights=success+failure,data=test)
Generalized linear mixed model fit by the Laplace approximation
Formula: cbind(success, failure) ~ group + (1 | subject)
   Data: test
   AIC   BIC logLik deviance
 54.75 57.89 -24.38    48.75
Random effects:
 Groups  Name        Variance Std.Dev.
 subject (Intercept) 1.8256   1.3511
Number of obs: 21, groups: subject, 21
Fixed effects:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.6785     0.4950  -1.371    0.170
groupgroup2  -0.7030     0.6974  -1.008    0.313
Call:
glm(formula = cbind(success, failure) ~ group, family = binomial,
    data = test)
Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.8204  -2.0789  -0.5407   1.0403   4.0539
Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.4400     0.2080  -2.115   0.0344 *
groupgroup2  -0.6587     0.3108  -2.119   0.0341 *
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 78.723  on 20  degrees of freedom
Residual deviance: 74.152  on 19  degrees of freedom
Call:
lm(formula = prop.success ~ group, data = test, weights = success +
    failure)
Residuals:
    Min      1Q  Median      3Q     Max
-1.1080 -0.7071 -0.2261  0.4754  1.9157
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.39175    0.08809   4.447 0.000276 ***
groupgroup2 -0.14175    0.12364  -1.146 0.265838
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Residual standard error: 0.8676 on 19 degrees of freedom
Multiple R-squared: 0.0647,     Adjusted R-squared: 0.01548
F-statistic: 1.314 on 1 and 19 DF,  p-value: 0.2658

As you can see, all 3 models gave quite different results, the model 2
(generalized linear model) gave significant p value while model 1 (generalized
mixed model) and model 3 (regular linear model) did not. Also as you can from
the data, prop.success has some value outside the range 0.15 to 0.85, so maybe
regular linear model may not be appropriate?

Thank you,

John
#
You need to change models 2 and 3 to use ~ group 
+ subject. You left subject out as a fixed factor.
At 05:17 PM 2/10/2011, array chip wrote:
================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"
================================================================