Skip to content

modelling proportions, with aggregated data, and the new/old lme4

11 messages · Malcolm Fairbrother, Rolf Turner, Joerg Luedicke +3 more

#
Dear list,

I have a question for which a more theoretical or more technical answer may be helpful--I'm not sure. Any assistance would be appreciated.

I am trying to model Census data, where individual responses to a question with two answer options have been aggregated up to the community level, and communities are nested in subdistricts. So the outcome of interest is a proportion (randing from 0 to 1), and I know the absolute number of "successes" and "failures". For each community, the data are available in a slightly disaggregated form for different categories of people (I'll use the separate numbers of successes and failures for women and men as an example). Thus the data look like:

head(dat)
  successes failures id sex subdist
1       560      726  1   F       4
2       844      510  1   M       4
3       340      438  2   F       4
4       616      273  2   M       4
5         7        0  3   F       4
6         3        1  3   M       4

In community #1, which is in subdistrict #4, there are 560 women who are "successes" and 726 who are "failures" on this social indicator, etc. (The data are available via a link below.)

How should I model these data? I was originally thinking to use a binomial distribution, and a call defining the outcome as "cbind(successes, failures)", with each observation/row nested in (a) "id" and (b) "subdist".

(The idea of nesting the separate Census categories like this, and using a multilevel approach, comes from: http://www.hsph.harvard.edu/faculty/sv-subramanian/files/epa2001_33_3_399_417.pdf. There was a recent discussion of the use of "cbind" like this, but it seems workable in this case as a way to reduce tends of millions of observations to a few tens of thousands, with no loss of information: https://stat.ethz.ch/pipermail/r-sig-mixed-models/2012q1/007552.html.)

However, I gather that the binomial distribution is only appropriate where one is sampling a number of Bernoulli trials (i.e., people in this case) with replacement, making them independent. As I understand it, even if the sampling is not actually done with replacement, if N >> n (say at least 10x larger) then the binomial approach can be justified, as the trials are independent enough to be treated as such. However, in my case, I have no sample at all--I have data on the population--so the binomial approach appears to be a non-starter. Is that right?

I further gather that the less-well-known hypergeometric distribution is really most appropriate in cases where sampling is done without replacement, though I believe that neither lme4 nor MCMCglmm allows for this distribution, and again I have population rather than sample data.

I have a vague idea that another possibility would be to use the logit of each row (e.g., log(560/726) for the first row), and then simply model the logit with Normal errors. But what would I do then with rows that have proportions of 0 or 1? And, setting that issue aside, is the logit in principle an appropriate way to go?

Can anybody suggest a way forward, and/or explain where my thinking above has gone wrong? For a final twist, the code below shows the binomial approach, where the old lme4 (now lme4.0) quickly returns seemingly sensible results, but the new lme4 (formerly lme4Eigen) returns an interesting error.

Many thanks,
Malcolm


load(url("http://dl.dropbox.com/u/46385700/dat.RData"))
closeAllConnections()

M1 <- glm(cbind(successes, failures) ~ sex, dat, family=binomial)
cbind(as.numeric(by(dat, dat$sex, function(x) sum(x$successes)/(sum(x$successes)+sum(x$failures)))), plogis(cumsum(coef(M1))))

library(lme4.0) # lme4.0_0.9999-1

system.time(M2 <- lmer(cbind(successes, failures) ~ sex + (1 | subdist), dat, family=binomial))
cbind(as.numeric(by(dat, dat$sex, function(x) sum(x$successes)/(sum(x$successes)+sum(x$failures)))), plogis(cumsum(fixef(M2))))

system.time(M3 <- lmer(cbind(successes, failures) ~ sex + (1 | id) + (1 | subdist), dat, family=binomial))
cbind(as.numeric(by(dat, dat$sex, function(x) sum(x$successes)/(sum(x$successes)+sum(x$failures)))), plogis(cumsum(fixef(M3))))

detach(package:lme4.0)
library(lme4) # lme4_0.99990234375-0

# however, with the new lme4 (former lme4Eigen) I get an error:

system.time(M4 <- lmer(cbind(successes, failures) ~ sex + (1 | subdist), dat, family=binomial))
Error in fn(nM$xeval()) : 
  step factor reduced below 0.001 without reducing pwrss
Timing stopped at: 16.159 0.639 16.787
#
Malcolm Fairbrother <m.fairbrother at ...> writes:
[snip]
you mean 510, right?
Hmmm. For better or worse, I've seen the binomial distribution
used in (I think) a reasonable way in cases where strict independence
is not reasonable (e.g. in ecological predation trials where the
prey are all in a single tank with the predator).  Technically you're
assuming both independence *and* homogeneity ... I would probably do
it this way, perhaps testing for overdispersion and/or adding
an individual-level random effect.
I think this would be a lot harder ...
I don't think it's ridiculous, but yes, you have to deal with the
0/1 cases.  This fussing-with-zeros-and-ones stuff applies even if you
were to use a Beta regression (which you can do via glmmADMB, although
it'll probably be a lot slower than lme4), which would in some sense
be a more principled way to deal with proportion data.  I like the
idea of keeping the denominators in there and using a binomial model.
I don't think you're missing anything obvious, though.
For a quick answer, try optimizer="bobyqa", and/or varying tolPwrss
settings, but please be very cautious/compare your answers.  We're still
working on adjusting optimizer choice/settings to make GLMMs in the
new version fast, robust, and accurate ...

  Ben
#
On 19/03/12 07:10, Ben Bolker wrote:
<SNIP>
<SNIP>

Sure looks like 560 to me.  Time for a trek to the optometrist, Ben?

     cheers,

         Rolf
#
I would certainly check out a Poisson model with the number of
successes as outcome and successes+failures as an offset.

Joerg
On Sun, Mar 18, 2012 at 11:51 AM, Rolf Turner <r.turner at auckland.ac.nz> wrote:
#
Joerg Luedicke <joerg.luedicke at ...> writes:
That seems odd to me; the Poisson+offset model
should be appropriate when p<<1 (i.e. for very small p,
the Poisson variance mu=n*p is approximately the same as
the binomial variance n*p*(1-p); in this case p is not small.

  Of course, I could be wrong.
[snip]
Looked at the wrong line (failures in men rather than successes in women).
#
Joerg,

What is the point in the Poisson model in that way (assuming
log(total) as offset)? This would be a binomial model; if you observe
N events from a Poisson distribution of which X are successes (or
marked in some way) with probability p, then X will be binomial(p, N).

Best regards,
Fredrik Nilsson

PS. Here's a practical "proof", the real is sketched below.
invlogit<-function(x) exp(x)/(1+exp(x))
n<-10
nn<-ceiling(exp(rnorm(n)+3))
pp<-seq(0.25,.75,length=n)
ca<-rbinom(rep(1,n),nn,prob=pp)
co<-nn-ca
fa<-gl(n,1)

#binomial model
test.glm<-glm(cbind(ca,co)~fa-1, family=binomial)
#conditional poisson
ptest.glm<-glm(ca~fa-1+offset(log(nn)), family=poisson)
summary(ptest.glm)
exp(coef(ptest.glm))
invlogit(coef(test.glm))

all.equal(exp(coef(ptest.glm)), invlogit(coef(test.glm)))

# proof.
X~Poisson(a) and  Y~Poisson(b), X & Y independent -> X+Y~Poisson(a+b)
(e.g. by generating functions).

Prob(X=x| X+Y=n) = Prob(X=x, Y=n-x)/Prob(X+Y=n) =
Prob(X=x)*Prob(Y=n-x)/Prob(X+Y=n) = a^x/x! exp(a) b^(n-x)/(n-x)!
exp(b) /((a+b)^n/n! exp(a+b)) =
n!/x!/(n-x)! (a/(a+b))^x (b/(a+b))^(n-x) = n!/x!/(n-x)! (a/(a+b))^x
(1-a/(a+b))^(n-x) , i.e. binomial(x,n,p) with p=a/(a+b).


2012/3/19 Ben Bolker <bbolker at gmail.com>:
#
Hi Ben and Fredrik,

Thanks for the comments and the proof! Next time I will be less hasty
in responding...

Cheers,

Joerg
On Mon, Mar 19, 2012 at 3:09 AM, Fredrik Nilsson <laf.nilsson at gmail.com> wrote:
#
Joerg Luedicke <joerg.luedicke at ...> writes:
Don't forget that I was wrong too (I forgot about the conditioning). 
 I know that this (i.e. using the Poisson model with an offset to
model binomial data) is a common practice, maybe it was done in the past for
computational reasons?

library("fortunes")
fortune("great-great grandchildren")

  cheers
    Ben
1 day later
#
Not sure what I'm missing here, but I'm not finding the offset Poisson and binomial to be equal with my dataset.?
R version 2.14.1 (2011-12-22)

Platform: x86_64-pc-mingw32/x64 (64-bit)



locale:

[1] LC_COLLATE=English_United States.1252? LC_CTYPE=English_United 
States.1252??? LC_MONETARY=English_United States.1252 
LC_NUMERIC=C????????????????????????? 

[5] LC_TIME=English_United States.1252??? 



attached base packages:

?[1] stats4??? splines?? tcltk???? stats???? graphics? grDevices utils???? datasets? methods?? base???? 



other attached packages:

?[1] lme4_0.999375-42?? bbmle_1.0.4.1????? numDeriv_2010.11-1 
R2admb_0.7.5?????? Hmisc_3.9-1??????? survival_2.36-10?? 
NCStats_0.2-7????? sciplot_1.0-9???? 

?[9] mgcv_1.7-13??????? Matrix_1.0-3?????? lattice_0.20-0???? 
MASS_7.3-16??????? AED_1.0??????????? circular_0.4-3???? 
boot_1.3-4???????? plotrix_3.3-3???? 



loaded via a namespace (and not attached):

?[1] car_2.0-12??????? cluster_1.14.1??? gamm4_0.1-5?????? 
gdata_2.8.2?????? glmmADMB_0.7.2.5? gplots_2.10.1???? grid_2.14.1?????? 
gtools_2.6.2????? multcomp_1.2-9?? 

[10] nlme_3.1-103????? TeachingDemos_2.7 tools_2.14.1????
???????? a????????? b????????? c??????? a:b??????? a:c??????? b:c 
0.03225038 0.15195288 1.40174126 3.48192066 1.01892662 0.97212475
???????? a????????? b????????? c??????? a:b??????? a:c??????? b:c 
0.03158208 0.13227007 0.58381656 0.77643718 0.50446378 0.49263445
[1] "Mean relative difference: 0.6428341"

Could it be related to the number of zeros?

Adam Smith
Dept. Natural Resources Science
University of Rhode Island
#
Sorry, the binomial model should have read:

b.glm <- glm(cbind(success,total-success) ~ (a+b+c)^2 - 1, family="binomial", data=test)

But the outcome is little changed...

Adam