Dear list, I have a question for which a more theoretical or more technical answer may be helpful--I'm not sure. Any assistance would be appreciated. I am trying to model Census data, where individual responses to a question with two answer options have been aggregated up to the community level, and communities are nested in subdistricts. So the outcome of interest is a proportion (randing from 0 to 1), and I know the absolute number of "successes" and "failures". For each community, the data are available in a slightly disaggregated form for different categories of people (I'll use the separate numbers of successes and failures for women and men as an example). Thus the data look like: head(dat) successes failures id sex subdist 1 560 726 1 F 4 2 844 510 1 M 4 3 340 438 2 F 4 4 616 273 2 M 4 5 7 0 3 F 4 6 3 1 3 M 4 In community #1, which is in subdistrict #4, there are 560 women who are "successes" and 726 who are "failures" on this social indicator, etc. (The data are available via a link below.) How should I model these data? I was originally thinking to use a binomial distribution, and a call defining the outcome as "cbind(successes, failures)", with each observation/row nested in (a) "id" and (b) "subdist". (The idea of nesting the separate Census categories like this, and using a multilevel approach, comes from: http://www.hsph.harvard.edu/faculty/sv-subramanian/files/epa2001_33_3_399_417.pdf. There was a recent discussion of the use of "cbind" like this, but it seems workable in this case as a way to reduce tends of millions of observations to a few tens of thousands, with no loss of information: https://stat.ethz.ch/pipermail/r-sig-mixed-models/2012q1/007552.html.) However, I gather that the binomial distribution is only appropriate where one is sampling a number of Bernoulli trials (i.e., people in this case) with replacement, making them independent. As I understand it, even if the sampling is not actually done with replacement, if N >> n (say at least 10x larger) then the binomial approach can be justified, as the trials are independent enough to be treated as such. However, in my case, I have no sample at all--I have data on the population--so the binomial approach appears to be a non-starter. Is that right? I further gather that the less-well-known hypergeometric distribution is really most appropriate in cases where sampling is done without replacement, though I believe that neither lme4 nor MCMCglmm allows for this distribution, and again I have population rather than sample data. I have a vague idea that another possibility would be to use the logit of each row (e.g., log(560/726) for the first row), and then simply model the logit with Normal errors. But what would I do then with rows that have proportions of 0 or 1? And, setting that issue aside, is the logit in principle an appropriate way to go? Can anybody suggest a way forward, and/or explain where my thinking above has gone wrong? For a final twist, the code below shows the binomial approach, where the old lme4 (now lme4.0) quickly returns seemingly sensible results, but the new lme4 (formerly lme4Eigen) returns an interesting error. Many thanks, Malcolm load(url("http://dl.dropbox.com/u/46385700/dat.RData")) closeAllConnections() M1 <- glm(cbind(successes, failures) ~ sex, dat, family=binomial) cbind(as.numeric(by(dat, dat$sex, function(x) sum(x$successes)/(sum(x$successes)+sum(x$failures)))), plogis(cumsum(coef(M1)))) library(lme4.0) # lme4.0_0.9999-1 system.time(M2 <- lmer(cbind(successes, failures) ~ sex + (1 | subdist), dat, family=binomial)) cbind(as.numeric(by(dat, dat$sex, function(x) sum(x$successes)/(sum(x$successes)+sum(x$failures)))), plogis(cumsum(fixef(M2)))) system.time(M3 <- lmer(cbind(successes, failures) ~ sex + (1 | id) + (1 | subdist), dat, family=binomial)) cbind(as.numeric(by(dat, dat$sex, function(x) sum(x$successes)/(sum(x$successes)+sum(x$failures)))), plogis(cumsum(fixef(M3)))) detach(package:lme4.0) library(lme4) # lme4_0.99990234375-0 # however, with the new lme4 (former lme4Eigen) I get an error: system.time(M4 <- lmer(cbind(successes, failures) ~ sex + (1 | subdist), dat, family=binomial)) Error in fn(nM$xeval()) : step factor reduced below 0.001 without reducing pwrss Timing stopped at: 16.159 0.639 16.787
modelling proportions, with aggregated data, and the new/old lme4
11 messages · Malcolm Fairbrother, Rolf Turner, Joerg Luedicke +3 more
Malcolm Fairbrother <m.fairbrother at ...> writes:
[snip]
I am trying to model Census data, where individual responses to a question with two answer options have been aggregated up to the community level, and communities are nested in subdistricts. So the outcome of interest is a proportion (ranging from 0 to 1), and I know the absolute number of "successes" and "failures". For each community, the data are available in a slightly disaggregated form for different categories of people (I'll use the separate numbers of successes and failures for women and men as an example). Thus the data look like:
head(dat) successes failures id sex subdist 1 560 726 1 F 4 2 844 510 1 M 4 3 340 438 2 F 4 4 616 273 2 M 4 5 7 0 3 F 4 6 3 1 3 M 4 In community #1, which is in subdistrict #4, there are 560 women
you mean 510, right?
who are "successes" and 726 who are "failures" on this social indicator, etc. (The data are available via a link below.)
How should I model these data? I was originally thinking to use a binomial distribution, and a call defining the outcome as "cbind(successes, failures)", with each observation/row nested in (a) "id" and (b) "subdist".
(The idea of nesting the separate Census categories like this, and using a multilevel approach, comes from: http://www.hsph.harvard.edu/faculty/ sv-subramanian/files/epa2001_33_3_399_417.pdf. There was a recent discussion of the use of "cbind" like this, but it seems workable in this case as a way to reduce tends of millions of observations to a few tens of thousands, with no loss of information: https://stat.ethz.ch/pipermail/r-sig-mixed-models/2012q1/007552.html.)
However, I gather that the binomial distribution is only appropriate where one is sampling a number of Bernoulli trials (i.e., people in this case) with replacement, making them independent. As I understand it, even if the sampling is not actually done with replacement, if N >> n (say at least 10x larger) then the binomial approach can be justified, as the trials are independent enough to be treated as such. However, in my case, I have no sample at all--I have data on the population--so the binomial approach appears to be a non-starter. Is that right?
Hmmm. For better or worse, I've seen the binomial distribution used in (I think) a reasonable way in cases where strict independence is not reasonable (e.g. in ecological predation trials where the prey are all in a single tank with the predator). Technically you're assuming both independence *and* homogeneity ... I would probably do it this way, perhaps testing for overdispersion and/or adding an individual-level random effect.
I further gather that the less-well-known hypergeometric distribution is really most appropriate in cases where sampling is done without replacement, though I believe that neither lme4 nor MCMCglmm allows for this distribution, and again I have population rather than sample data.
I think this would be a lot harder ...
I have a vague idea that another possibility would be to use the logit of each row (e.g., log(560/726) for the first row), and then simply model the logit with Normal errors. But what would I do then with rows that have proportions of 0 or 1? And, setting that issue aside, is the logit in principle an appropriate way to go?
I don't think it's ridiculous, but yes, you have to deal with the 0/1 cases. This fussing-with-zeros-and-ones stuff applies even if you were to use a Beta regression (which you can do via glmmADMB, although it'll probably be a lot slower than lme4), which would in some sense be a more principled way to deal with proportion data. I like the idea of keeping the denominators in there and using a binomial model. I don't think you're missing anything obvious, though.
Can anybody suggest a way forward, and/or explain where my thinking above has gone wrong? For a final twist, the code below shows the binomial approach, where the old lme4 (now lme4.0) quickly returns seemingly sensible results, but the new lme4 (formerly lme4Eigen) returns an interesting error.
For a quick answer, try optimizer="bobyqa", and/or varying tolPwrss settings, but please be very cautious/compare your answers. We're still working on adjusting optimizer choice/settings to make GLMMs in the new version fast, robust, and accurate ... Ben
On 19/03/12 07:10, Ben Bolker wrote:
<SNIP>
head(dat) successes failures id sex subdist 1 560 726 1 F 4 2 844 510 1 M 4 3 340 438 2 F 4 4 616 273 2 M 4 5 7 0 3 F 4 6 3 1 3 M 4 In community #1, which is in subdistrict #4, there are 560 women
you mean 510, right?
<SNIP>
Sure looks like 560 to me. Time for a trek to the optometrist, Ben?
cheers,
Rolf
I would certainly check out a Poisson model with the number of successes as outcome and successes+failures as an offset. Joerg
On Sun, Mar 18, 2012 at 11:51 AM, Rolf Turner <r.turner at auckland.ac.nz> wrote:
On 19/03/12 07:10, Ben Bolker wrote: <SNIP>
head(dat) ? successes failures id sex subdist 1 ? ? ? 560 ? ? ?726 ?1 ? F ? ? ? 4 2 ? ? ? 844 ? ? ?510 ?1 ? M ? ? ? 4 3 ? ? ? 340 ? ? ?438 ?2 ? F ? ? ? 4 4 ? ? ? 616 ? ? ?273 ?2 ? M ? ? ? 4 5 ? ? ? ? 7 ? ? ? ?0 ?3 ? F ? ? ? 4 6 ? ? ? ? 3 ? ? ? ?1 ?3 ? M ? ? ? 4 In community #1, which is in subdistrict #4, there are 560 women
? you mean 510, right?
<SNIP> Sure looks like 560 to me. ?Time for a trek to the optometrist, Ben? ? ?cheers, ? ? ? ?Rolf
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Joerg Luedicke <joerg.luedicke at ...> writes:
I would certainly check out a Poisson model with the number of successes as outcome and successes+failures as an offset.
That seems odd to me; the Poisson+offset model should be appropriate when p<<1 (i.e. for very small p, the Poisson variance mu=n*p is approximately the same as the binomial variance n*p*(1-p); in this case p is not small. Of course, I could be wrong.
On Sun, Mar 18, 2012 at 11:51 AM, Rolf Turner <r.turner at ...> wrote:
On 19/03/12 07:10, Ben Bolker wrote: <SNIP>
head(dat) ? successes failures id sex subdist 1 ? ? ? 560 ? ? ?726 ?1 ? F ? ? ? 4 2 ? ? ? 844 ? ? ?510 ?1 ? M ? ? ? 4
[snip]
In community #1, which is in subdistrict #4, there are 560 women
? you mean 510, right?
<SNIP> Sure looks like 560 to me. ?Time for a trek to the optometrist, Ben?
Looked at the wrong line (failures in men rather than successes in women).
Joerg, What is the point in the Poisson model in that way (assuming log(total) as offset)? This would be a binomial model; if you observe N events from a Poisson distribution of which X are successes (or marked in some way) with probability p, then X will be binomial(p, N). Best regards, Fredrik Nilsson PS. Here's a practical "proof", the real is sketched below. invlogit<-function(x) exp(x)/(1+exp(x)) n<-10 nn<-ceiling(exp(rnorm(n)+3)) pp<-seq(0.25,.75,length=n) ca<-rbinom(rep(1,n),nn,prob=pp) co<-nn-ca fa<-gl(n,1) #binomial model test.glm<-glm(cbind(ca,co)~fa-1, family=binomial) #conditional poisson ptest.glm<-glm(ca~fa-1+offset(log(nn)), family=poisson) summary(ptest.glm) exp(coef(ptest.glm)) invlogit(coef(test.glm)) all.equal(exp(coef(ptest.glm)), invlogit(coef(test.glm))) # proof. X~Poisson(a) and Y~Poisson(b), X & Y independent -> X+Y~Poisson(a+b) (e.g. by generating functions). Prob(X=x| X+Y=n) = Prob(X=x, Y=n-x)/Prob(X+Y=n) = Prob(X=x)*Prob(Y=n-x)/Prob(X+Y=n) = a^x/x! exp(a) b^(n-x)/(n-x)! exp(b) /((a+b)^n/n! exp(a+b)) = n!/x!/(n-x)! (a/(a+b))^x (b/(a+b))^(n-x) = n!/x!/(n-x)! (a/(a+b))^x (1-a/(a+b))^(n-x) , i.e. binomial(x,n,p) with p=a/(a+b). 2012/3/19 Ben Bolker <bbolker at gmail.com>:
Joerg Luedicke <joerg.luedicke at ...> writes:
I would certainly check out a Poisson model with the number of successes as outcome and successes+failures as an offset.
?That seems odd to me; the Poisson+offset model should be appropriate when p<<1 (i.e. for very small p, the Poisson variance mu=n*p is approximately the same as the binomial variance n*p*(1-p); in this case p is not small. ?Of course, I could be wrong.
On Sun, Mar 18, 2012 at 11:51 AM, Rolf Turner <r.turner at ...> wrote:
On 19/03/12 07:10, Ben Bolker wrote: <SNIP>
head(dat) ? successes failures id sex subdist 1 ? ? ? 560 ? ? ?726 ?1 ? F ? ? ? 4 2 ? ? ? 844 ? ? ?510 ?1 ? M ? ? ? 4
?[snip]
In community #1, which is in subdistrict #4, there are 560 women
? you mean 510, right?
<SNIP> Sure looks like 560 to me. ?Time for a trek to the optometrist, Ben?
?Looked at the wrong line (failures in men rather than successes in women).
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Hi Ben and Fredrik, Thanks for the comments and the proof! Next time I will be less hasty in responding... Cheers, Joerg
On Mon, Mar 19, 2012 at 3:09 AM, Fredrik Nilsson <laf.nilsson at gmail.com> wrote:
Joerg, What is the point in the Poisson model in that way (assuming log(total) as offset)? This would be a binomial model; if you observe N events from a Poisson distribution of which X are successes (or marked in some way) with probability p, then X will be binomial(p, N). Best regards, Fredrik Nilsson PS. Here's a practical "proof", the real is sketched below. invlogit<-function(x) exp(x)/(1+exp(x)) n<-10 nn<-ceiling(exp(rnorm(n)+3)) pp<-seq(0.25,.75,length=n) ca<-rbinom(rep(1,n),nn,prob=pp) co<-nn-ca fa<-gl(n,1) #binomial model test.glm<-glm(cbind(ca,co)~fa-1, family=binomial) #conditional poisson ptest.glm<-glm(ca~fa-1+offset(log(nn)), family=poisson) summary(ptest.glm) exp(coef(ptest.glm)) invlogit(coef(test.glm)) all.equal(exp(coef(ptest.glm)), invlogit(coef(test.glm))) # proof. X~Poisson(a) and ?Y~Poisson(b), X & Y independent -> X+Y~Poisson(a+b) (e.g. by generating functions). Prob(X=x| X+Y=n) = Prob(X=x, Y=n-x)/Prob(X+Y=n) = Prob(X=x)*Prob(Y=n-x)/Prob(X+Y=n) = a^x/x! exp(a) b^(n-x)/(n-x)! exp(b) /((a+b)^n/n! exp(a+b)) = n!/x!/(n-x)! (a/(a+b))^x (b/(a+b))^(n-x) = n!/x!/(n-x)! (a/(a+b))^x (1-a/(a+b))^(n-x) , i.e. binomial(x,n,p) with p=a/(a+b). 2012/3/19 Ben Bolker <bbolker at gmail.com>:
Joerg Luedicke <joerg.luedicke at ...> writes:
I would certainly check out a Poisson model with the number of successes as outcome and successes+failures as an offset.
?That seems odd to me; the Poisson+offset model should be appropriate when p<<1 (i.e. for very small p, the Poisson variance mu=n*p is approximately the same as the binomial variance n*p*(1-p); in this case p is not small. ?Of course, I could be wrong.
On Sun, Mar 18, 2012 at 11:51 AM, Rolf Turner <r.turner at ...> wrote:
On 19/03/12 07:10, Ben Bolker wrote: <SNIP>
head(dat) ? successes failures id sex subdist 1 ? ? ? 560 ? ? ?726 ?1 ? F ? ? ? 4 2 ? ? ? 844 ? ? ?510 ?1 ? M ? ? ? 4
?[snip]
In community #1, which is in subdistrict #4, there are 560 women
? you mean 510, right?
<SNIP> Sure looks like 560 to me. ?Time for a trek to the optometrist, Ben?
?Looked at the wrong line (failures in men rather than successes in women).
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Joerg Luedicke <joerg.luedicke at ...> writes:
Hi Ben and Fredrik, Thanks for the comments and the proof! Next time I will be less hasty in responding... Cheers, Joerg
Don't forget that I was wrong too (I forgot about the conditioning).
I know that this (i.e. using the Poisson model with an offset to
model binomial data) is a common practice, maybe it was done in the past for
computational reasons?
library("fortunes")
fortune("great-great grandchildren")
cheers
Ben
1 day later
Not sure what I'm missing here, but I'm not finding the offset Poisson and binomial to be equal with my dataset.?
sessionInfo()
R version 2.14.1 (2011-12-22) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252? LC_CTYPE=English_United States.1252??? LC_MONETARY=English_United States.1252 LC_NUMERIC=C????????????????????????? [5] LC_TIME=English_United States.1252??? attached base packages: ?[1] stats4??? splines?? tcltk???? stats???? graphics? grDevices utils???? datasets? methods?? base???? other attached packages: ?[1] lme4_0.999375-42?? bbmle_1.0.4.1????? numDeriv_2010.11-1 R2admb_0.7.5?????? Hmisc_3.9-1??????? survival_2.36-10?? NCStats_0.2-7????? sciplot_1.0-9???? ?[9] mgcv_1.7-13??????? Matrix_1.0-3?????? lattice_0.20-0???? MASS_7.3-16??????? AED_1.0??????????? circular_0.4-3???? boot_1.3-4???????? plotrix_3.3-3???? loaded via a namespace (and not attached): ?[1] car_2.0-12??????? cluster_1.14.1??? gamm4_0.1-5?????? gdata_2.8.2?????? glmmADMB_0.7.2.5? gplots_2.10.1???? grid_2.14.1?????? gtools_2.6.2????? multcomp_1.2-9?? [10] nlme_3.1-103????? TeachingDemos_2.7 tools_2.14.1????
#
# Compare offset Poisson with binomial
invlogit<-function(x) exp(x)/(1+exp(x))
test <- read.csv("http://dl.dropbox.com/u/23278690/test.csv", header=T)
b.glm <- glm(cbind(success,total) ~ (a+b+c)^2 - 1, family="binomial", data=test)
p.glm <- glm(success ~ (a+b+c)^2 - 1 + offset(log(total)), family="poisson", data=test)
#
exp(coef(p.glm))
???????? a????????? b????????? c??????? a:b??????? a:c??????? b:c 0.03225038 0.15195288 1.40174126 3.48192066 1.01892662 0.97212475
# inv.logit(coef(b.glm))
???????? a????????? b????????? c??????? a:b??????? a:c??????? b:c 0.03158208 0.13227007 0.58381656 0.77643718 0.50446378 0.49263445
# all.equal(exp(coef(p.glm)), invlogit(coef(b.glm)))
[1] "Mean relative difference: 0.6428341" Could it be related to the number of zeros? Adam Smith Dept. Natural Resources Science University of Rhode Island
Sorry, the binomial model should have read: b.glm <- glm(cbind(success,total-success) ~ (a+b+c)^2 - 1, family="binomial", data=test) But the outcome is little changed... Adam
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20120322/eb4d3ef2/attachment-0001.pl>