Hi all, I have run a (glm) analysis where the dependent variable is the gender (family=binomial) and the predictors are percentages. I get a warning saying "fitted probabilities numerically 0 or 1 occurred" that is indicating that quasi-separation or separation is occurring. This makes sense given that one of these predictors have a very influential effect that is depending on a specific threshold separating these effects, in other words in my analysis one of these variables predicts males about the 80% of times when its values are less or equal to zero and females about the 80% when its values are greater than zero. I have been looking at other posts about this but I haven?t understood how I should act when the separation (or quasi separation) is not a statistical artifact but it is something real. As suggested in http://r.789695.n4.nabble.com/OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331 http://r.789695.n4.nabble.com/OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331 (the last post is mine) I tried to use brglm procedure that uses a penalized maximum likelihood but it made no difference. What would you do if you were in my shoes? Thanks in advance for any help. Simone -- View this message in context: http://r.789695.n4.nabble.com/binomial-GLM-quasi-separation-tp3901687p3901687.html Sent from the R help mailing list archive at Nabble.com.
binomial GLM quasi separation
7 messages · Simone Santoro, Ben Bolker, Uwe Ligges +1 more
lincoln <miseno77 <at> hotmail.com> writes:
Hi all, I have run a (glm) analysis where the dependent variable is the gender (family=binomial) and the predictors are percentages. I get a warning saying "fitted probabilities numerically 0 or 1 occurred" that is indicating that quasi-separation or separation is occurring. This makes sense given that one of these predictors have a very influential effect that is depending on a specific threshold separating these effects, in other words in my analysis one of these variables predicts males about the 80% of times when its values are less or equal to zero and females about the 80% when its values are greater than zero. I have been looking at other posts about this but I haven?t understood how I should act when the separation (or quasi separation) is not a statistical artifact but it is something real. As suggested in http://r.789695.n4.nabble.com/ OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331 http://r.789695.n4.nabble.com/ OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331
[warning, broke URLs to make gmane happy]
(the last post is mine) I tried to use brglm procedure that uses a penalized maximum likelihood but it made no difference.
I'm not sure what's going on here, and I don't know why brglm() shouldn't work ... from a squint at your Nabble post (I can't really see the figure very well), I agree that the hcp profile is funky, but I wouldn't immediately conclude that the profile is bad -- in particular, it seems that the x-axis range is -45 to -15, rather than something like (-600,-300) as I would expect from the estimated parameter (ca. -400) and standard error (ca. 60). I would start by setting which=3 (to confine your attention to the hcp parameter) and messing around with the gridsize, stepsize, stdn parameters in profileModel to see what's going on. If that doesn't work you might have to post data, or a subset of data, in order to get any more help ...
On 13.10.2011 21:46, Ben Bolker wrote:
lincoln<miseno77<at> hotmail.com> writes:
Hi all, I have run a (glm) analysis where the dependent variable is the gender (family=binomial) and the predictors are percentages. I get a warning saying "fitted probabilities numerically 0 or 1 occurred" that is indicating that quasi-separation or separation is occurring. This makes sense given that one of these predictors have a very influential effect that is depending on a specific threshold separating these effects, in other words in my analysis one of these variables predicts males about the 80% of times when its values are less or equal to zero and females about the 80% when its values are greater than zero. I have been looking at other posts about this but I haven?t understood how I should act when the separation (or quasi separation) is not a statistical artifact but it is something real. As suggested in http://r.789695.n4.nabble.com/ OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331 http://r.789695.n4.nabble.com/ OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331
[warning, broke URLs to make gmane happy]
(the last post is mine) I tried to use brglm procedure that uses a penalized maximum likelihood but it made no difference.
I'm not sure what's going on here, and I don't know why brglm() shouldn't work ... from a squint at your Nabble post (I can't really see the figure very well), I agree that the hcp profile is funky, but I wouldn't immediately conclude that the profile is bad -- in particular, it seems that the x-axis range is -45 to -15, rather than something like (-600,-300) as I would expect from the estimated parameter (ca. -400) and standard error (ca. 60). I would start by setting which=3 (to confine your attention to the hcp parameter) and messing around with the gridsize, stepsize, stdn parameters in profileModel to see what's going on. If that doesn't work you might have to post data, or a subset of data, in order to get any more help ...
Or if just the separating hyperplane is to be found (and no tests have to be considered), I'd use an lda rather than logistic regression in such a case. Uwe Ligges
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
As you suggested I had a further look at the profile by changing default values of stepsize (I tried to modify the others but apparently there was any change). Here they go the scripts I have used:
dati<-read.table("simone.txt",header=T,sep="\t",as.is=T)
glm.sat<-glm(sex~twp+hwp+hcp+hnp,binomial,data=dati)
Mensajes de aviso perdidos glm.fit: fitted probabilities numerically 0 or 1 occurred
pp<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4)
Preliminary iteration . Done Profiling for parameter hcp ... Done
pp1<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=20)
Preliminary iteration . Done Profiling for parameter hcp ... Done
pp2<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=100)
Preliminary iteration . Done Profiling for parameter hcp ... Done
plot(pp)
mtext("Default stepsize",adj=0,cex=2,line=1)
dev.new()
plot(pp)
mtext("Stepsize=20",adj=0,cex=2,line=1)
dev.new()
plot(pp)
mtext("Stepsize=100",adj=0,cex=2,line=1)
And these are the plots as they look like: http://r.789695.n4.nabble.com/file/n3904261/plot1.png http://r.789695.n4.nabble.com/file/n3904261/plot2.png http://r.789695.n4.nabble.com/file/n3904261/plot3.png I have tried to understand what is going on but I don't know how to interpret this. It's quite a long time that I am trying to solve this but I have not been able to. Here ( http://r.789695.n4.nabble.com/file/n3904261/simone.txt simone.txt ) I attach a subset of the data I am working with that comprises the variables specified in the above glm model and by the way the "funky" variable called "hcp". Thank you for take your time to help me in this. -- View this message in context: http://r.789695.n4.nabble.com/binomial-GLM-quasi-separation-tp3901687p3904261.html Sent from the R help mailing list archive at Nabble.com.
On Fri, 2011-10-14 at 02:32 -0700, lincoln wrote:
As you suggested I had a further look at the profile by changing default values of stepsize (I tried to modify the others but apparently there was any change).
Have you read ?glm, specifically this bit:
Details:
....
For the background to warning messages about ?fitted probabilities
numerically 0 or 1 occurred? for binomial GLMs, see Venables &
Ripley (2002, pp. 197-8).
There, V&R say (me paraphrasing) that if there are some large fitted
\beta_i the curvature of the log-likelihood at the fitted \beta can be
much less than at \beta_i = 0. The Wald approximation underestimates the
change in the LL on setting \beta_i = 0. As the absolute value of the
fitted \beta_i becomes large (tends to infinity) the t statistic tends
to 0. This is the Hauck Donner effect.
Whilst I am (so very) far from being an expert here - this does seem to
fit the results you showed.
Furthermore, did you follow the steps Ioannis Kosmidis took me through
with my data in that email thread? I have with your data and everything
seems to follow the explanation/situation given by Ioannis. Namely, if
you increase the number of iterations and tolerance in the glm() call
you get the same fit as with a standard glm() call:
## normal glm() summary(mod)
Call:
glm(formula = sex ~ twp + hwp + hcp + hnp, family = binomial,
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9703 -0.1760 0.3181 0.6061 3.5235
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4362 0.2687 5.345 9.02e-08 ***
twp 5.5673 1.3602 4.093 4.26e-05 ***
hwp -4.2793 2.3781 -1.799 0.0719 .
hcp -450.1918 56.6823 -7.942 1.98e-15 ***
hnp -4.5302 3.2825 -1.380 0.1676
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 581.51 on 425 degrees of freedom
Residual deviance: 294.00 on 421 degrees of freedom
(41 observations deleted due to missingness)
AIC: 304
Number of Fisher Scoring iterations: 8
## now with control = glm.control(epsilon = 1e-16, maxit = 1000) ## in the call summary(mod3)
Call:
glm(formula = sex ~ twp + hwp + hcp + hnp, family = binomial,
data = dat, control = glm.control(epsilon = 1e-16, maxit = 1000))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9703 -0.1760 0.3181 0.6061 3.5235
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4362 0.2687 5.345 9.02e-08 ***
twp 5.5673 1.3602 4.093 4.26e-05 ***
hwp -4.2793 2.3781 -1.799 0.0719 .
hcp -450.1918 56.6823 -7.942 1.98e-15 ***
hnp -4.5302 3.2825 -1.380 0.1676
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 581.51 on 425 degrees of freedom
Residual deviance: 294.00 on 421 degrees of freedom
(41 observations deleted due to missingness)
AIC: 304
Number of Fisher Scoring iterations: 9
From the thread you link to, we would expect the effects to diverge.
Profiling the model shows that the LL starts to increase again at low values, but does so slowly. The LL is very flat around the estimates and is far from 0, which seems to correspond with the description of the Hauck Donner effect given By Venables and Ripley in their book. In your case however, the statistic is still sufficiently large for it to be identified as significant via the Wald test. If we fit the model via brglm() we get essentially the same "result" as fitted by glm():
summary(mod2)
Call:
brglm(formula = sex ~ twp + hwp + hcp + hnp, family = binomial,
data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4262 0.2662 5.357 8.47e-08 ***
twp 5.3696 1.3323 4.030 5.57e-05 ***
hwp -4.2813 2.3504 -1.821 0.0685 .
hcp -435.9212 55.0566 -7.918 2.42e-15 ***
hnp -4.6295 3.2459 -1.426 0.1538
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 560.48 on 425 degrees of freedom
Residual deviance: 294.08 on 421 degrees of freedom
Penalized deviance: 302.8212
(41 observations deleted due to missingness)
AIC: 304.08
Ioannis points out that if there were separation, the brglm results
would differ markedly from the glm() ones, and they don't.
As Ioannis mentioned in that thread, you can get the warning about the
fitted probabilities without a separation problem - In my case it was
because there were some very small fitted probabilities, which R was
just warning about.
HTH
G
Here they go the scripts I have used:
dati<-read.table("simone.txt",header=T,sep="\t",as.is=T)
glm.sat<-glm(sex~twp+hwp+hcp+hnp,binomial,data=dati)
Mensajes de aviso perdidos glm.fit: fitted probabilities numerically 0 or 1 occurred
pp<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4)
Preliminary iteration . Done Profiling for parameter hcp ... Done
pp1<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=20)
Preliminary iteration . Done Profiling for parameter hcp ... Done
pp2<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=100)
Preliminary iteration . Done Profiling for parameter hcp ... Done
plot(pp)
mtext("Default stepsize",adj=0,cex=2,line=1)
dev.new()
plot(pp)
mtext("Stepsize=20",adj=0,cex=2,line=1)
dev.new()
plot(pp)
mtext("Stepsize=100",adj=0,cex=2,line=1)
And these are the plots as they look like: http://r.789695.n4.nabble.com/file/n3904261/plot1.png http://r.789695.n4.nabble.com/file/n3904261/plot2.png http://r.789695.n4.nabble.com/file/n3904261/plot3.png I have tried to understand what is going on but I don't know how to interpret this. It's quite a long time that I am trying to solve this but I have not been able to. Here ( http://r.789695.n4.nabble.com/file/n3904261/simone.txt simone.txt ) I attach a subset of the data I am working with that comprises the variables specified in the above glm model and by the way the "funky" variable called "hcp". Thank you for take your time to help me in this. -- View this message in context: http://r.789695.n4.nabble.com/binomial-GLM-quasi-separation-tp3901687p3904261.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
1 day later
#Uwe: I have realized that in the firstly linked post ( http://r.789695.n4.nabble.com/OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331 OT-quasi-separation-in-a-logistic-GLM ) I have told something misleading: in fact my independent variables are not log-normally distributed since there are lot of zeros that constitute the more frequent values. I have not been able to normalize them: I don't even know if it is possible to do it. For the assumption of normally distributed predictors I believe I can't use a lda. #Gavin: I have read carefully your thread but I am not sure to understand what you are suggesting (my gaps in statistics!). You say that it should be due to the /Hauck Donner/ effect and that it is not a quasi separation or separation issue. Even though, I am still unsure to understand why I found such a high asymptotic standard error. Anyway, how should I consider this result? Should I find another way to analyze this process or I could consider this as correct? If I am understanding this enough, this warning message and the very high estimates should be due to /Hauck-Donner/. Regarding that reference to Venables and Ripley (2002) on this issue, I have found this ( http://kups.ku.edu/maillist/classes/ps707/2005/msg00023.html Hauck-Donner ) where it is said that "The practical advice, then, is to run the model with all of the variables, and then run again with the questionable one removed, and conduct a likelihood ratio test./ and I suppose that the p-values for hcp should be the LRT p-value, isn't it? Thanks for taking your time to help me in this. Simone -- View this message in context: http://r.789695.n4.nabble.com/binomial-GLM-quasi-separation-tp3901687p3907716.html Sent from the R help mailing list archive at Nabble.com.
1 day later
On Sat, 2011-10-15 at 09:11 -0700, lincoln wrote:
#Uwe:
<snip />
#Gavin: I have read carefully your thread but I am not sure to understand what you are suggesting (my gaps in statistics!). You say that it should be due to the /Hauck Donner/ effect and that it is not a quasi separation or separation issue. Even though, I am still unsure to understand why I found such a high asymptotic standard error.
I don't believe this is a separation issue - the sorts of things we'd expect to see if this were due to separation do not show up. Given the large estimate for the coefficient for the term it is not that surprising that the associated uncertainty is also high:
set.seed(2) var(runif(100))
[1] 0.08911998
set.seed(2) var(runif(100)*10000)
[1] 8911998 All I did there was increase the "units" of the data in the second example and the variance is huge, but only because the data were expressed in units 10000 times bigger than the first example. In the same way, the coefficient estimate is large so it's standard error is also large; the question one needs to ask is, is the estimate of the coefficient for hcp bounded away from zero, given the uncertainty in the estimate. If you were to produce a profile confidence interval it too would be large. So you have a large estimate, which is somewhat uncertain. Given that the slope of the log likelihood is low at the estimate and quite different from the slope at \beta == 0, it is not unreasonable to assume that the Hauck Donner effect might be present...
Anyway, how should I consider this result? Should I find another way to analyze this process or I could consider this as correct?
...however, in the case of the snippet of data you showed, it doesn't affect the result - on the basis of the Wald test you would still accept that hcp is significant/important. The Hauck Donner effect might be leading to a lower value of the test statistic, but it hasn't affected the outcome of the test. To check, fit the model with and without hcp and then use the anova() function to compare the two models. This will do a likelihood ratio test.
If I am understanding this enough, this warning message
Possibly, but it could just be that the fitted probabilities really are 0 or 1.
and the very high estimates should be due to /Hauck-Donner/. Regarding that reference to Venables and Ripley (2002) on this issue, I have found this ( http://kups.ku.edu/maillist/classes/ps707/2005/msg00023.html Hauck-Donner ) where it is said that "The practical advice, then, is to run the model with all of the variables, and then run again with the questionable one removed, and conduct a likelihood ratio test./ and I suppose that the p-values for hcp should be the LRT p-value, isn't it?
Yes. Well it is the result of applying a likelihood ratio test. I don't think there is such a thing as *the* p-value for a term in a model, just different ways of computing *a* p-value. In this case, what does it matter? If the Wald test is *under*estimating z but the term *is* still significant, the LRT should only confirm this and give an even lower p-value than the already very low one.
Thanks for taking your time to help me in this.
Would it hurt you to reply via an email? Regardless of what Nabble thinks, R-help is a mailing list and your *posts* keep on removing all the context - I have to keep on hunting for the thread in the archives just to keep track of what you have told us. G
Simone -- View this message in context: http://r.789695.n4.nabble.com/binomial-GLM-quasi-separation-tp3901687p3907716.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%