Model validation for Presence / Absence, (binomial) GLMs - R-SIG-mixed-models

Wed, Jul 3, 2013 3:53 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20130703/554e02a9/attachment.pl>

S Ellison

Thu, Jul 4, 2013 7:32 AM #

The current CRAN package of lme4 (0.999999-2), running under R 3, says that VarCorr returns an (S4) object of  class VarCorr.

The function does seem not do that; it still returns a list as in earlier versions of lme4. 

Have I missed something, or has the help file has got a bit ahead of the code... ?

Steve Ellison
LGC

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

Ben Bolker

Thu, Jul 4, 2013 10:55 AM #

Highland Statistics Ltd <highstat at ...> writes:

That's an interesting document: I think the bottom line is:
  
  * if the Bernoulli data can be grouped, i.e. if there are
in general multiple observations with the same set of covariates,
then overdispersion can be identified, because the data are
really equivalent to a binomial response within the groups.

  For example, the trivial example

grp  resp
A    1
A    0
A    1
B    0
B    0
B    1

is equivalent to:

grp  successes total
A    2         3
B    1         3

this sort of aggregation can be done with plyr::ddply, or
in lots of different ways -- it's generally faster and 
more efficient to fit the data in the grouped form than
in the binary form.

It might have been more convenient for readers to be able to
go directly to the link:

http://www.highstat.com/BGS/GLMGLMM/pdfs/
 HILBE-Can_binary_logistic_models_be_overdispersed2Jul2013.pdf
{URL broken to make Gmane  happy}

rather than having to click 5 times to get through the
Highland Statistics material to the PDF of interest ...

Ben Bolker

Thu, Jul 4, 2013 11:10 AM #

S Ellison <S.Ellison at ...> writes:

versions of lme4.

This does look like a documentation typo.  Will be fixed
in lme4.0 on R-forge; we probably won't release another 
version of lme4 to CRAN before the 1.0 release.

  thanks
   Ben Bolker

Ken Knoblauch

Fri, Jul 5, 2013 3:48 AM #

Ben Bolker <bbolker at ...> writes:

plethora of great model

look at the % of correct

fitting statistic.

individual level random

There is more on

responses (as also

models

overdispersed?

covariates,

Agreed that this is very interesting but still a bit mysterious
as everything looks the same on the surface.  
The likelihoods only differ by the log of the binomial coefficients 
as can easily be verified on Ben's example above and as expected
from the likelihood equations:

Grpd <- read.table(
textConnection("grp  resp
A    1
A    0
A    1
B    0
B    0
B    1"), TRUE)

UnGrpd <- read.table(
textConnection("grp  successes total
A    2         3
B    1         3"), TRUE)

-logLik(glm(resp ~ grp, binomial, Grpd)) +
logLik(glm(cbind(successes, total - successes) ~ grp, binomial, UnGrpd))

with(UnGrpd, sum(log(choose(total, successes))))

However, looking at the outputs of the glm, the degrees of freedom
differ, being 4 on the binary responses and 0 for the binomial response.
Should degrees of freedom really be computed differently in the two cases
since it is easy to transform the two cases back and forth?
And, if so, what does that mean?

Ken

Kenneth Knoblauch
Inserm U846
Stem-cell and Brain Research Institute
Department of Integrative Neurosciences
18 avenue du Doyen L?pine
69500 Bron
France
tel: +33 (0)4 72 91 34 77
fax: +33 (0)4 72 91 34 61
portable: +33 (0)6 84 10 64 10
http://www.sbri.fr/members/kenneth-knoblauch.html

Gabriel Baud-Bovy

Fri, Jul 5, 2013 4:34 AM #

On 05/07/2013 12:48 PM, Ken Knoblauch wrote:

Ben Bolker <bbolker at ...> writes:

Highland Statistics Ltd <highstat <at> ...> writes:

This is something I always battle with given the

plethora of great model

fitting methods available for other models.

I always use a variant of Hugh's suggestion and

  look at the % of correct

predictions between models as a quick model

fitting statistic.

And for overdispersion I believe one way is to fit

individual level random

effects and see if this is a substantively better model.

There is more on

this in the wiki http://glmm.wikidot.com/faq

    Yes, but this is unidentifiable for Bernoulli

responses (as also

explained there).

The statement on  'unidentifiable for Bernoulli
responses'....well...apparently this is not that trivial.
See:  http://www.highstat.com/BGGLM.htm

Follow the link to: the Discussion Board....
Go to: Chapter 1 Introduction to generalized linear

models

And see the topic: Can binary logistic models be

overdispersed?

Alain

   That's an interesting document: I think the bottom line is:

   * if the Bernoulli data can be grouped, i.e. if there are
in general multiple observations with the same set of

covariates,

then overdispersion can be identified, because the data are
really equivalent to a binomial response within the groups.

   For example, the trivial example

grp  resp
A    1
A    0
A    1
B    0
B    0
B    1

is equivalent to:

grp  successes total
A    2         3
B    1         3

Agreed that this is very interesting but still a bit mysterious
as everything looks the same on the surface.
The likelihoods only differ by the log of the binomial coefficients
as can easily be verified on Ben's example above and as expected
from the likelihood equations:

Grpd <- read.table(
textConnection("grp  resp
A    1
A    0
A    1
B    0
B    0
B    1"), TRUE)

UnGrpd <- read.table(
textConnection("grp  successes total
A    2         3
B    1         3"), TRUE)

-logLik(glm(resp ~ grp, binomial, Grpd)) +
logLik(glm(cbind(successes, total - successes) ~ grp, binomial, UnGrpd))

with(UnGrpd, sum(log(choose(total, successes))))

However, looking at the outputs of the glm, the degrees of freedom
differ, being 4 on the binary responses and 0 for the binomial response.
Should degrees of freedom really be computed differently in the two cases
since it is easy to transform the two cases back and forth?
And, if so, what does that mean?

Ken

The issue of DoF seems similar in assessing Goodness of Fit for logistic 
regression, which can be done with grouped data (or by defining bins as 
in Hosmer's test, which is based on the same idea of pooling data).

I would have to look back at the references below to see how these 
goodness-of-ift tests are affected by overdispersion.

Hosmer et al. (91) The Importance of Assessing the Fit of Logistic 
Regression Models: A Case Study. Americal Journal of Public Health, 
December 1991, Vol. 81, No. 12

Hosmer DW, Lemeshow S. A goodness-of-fit test for the multiple logistic 
regression
model. Commu in Stat. 1980;A10:1043-1069.

Lemeshow S, Hosmer DW. A review of goodness-of-fit statistics for use in 
the development of logistic regression models.Am JEpidemioL 
1982;115:92-106.

Hosmer DW, Lemeshow S, Klar J. Goodness-of-fit testing for multiple 
logistic regression analysis when the estimated probabilities are small. 
Biometrcal J. 1988;30(7):1-14.

Gabriel

---------------------------------------------------------------------
Gabriel Baud-Bovy               tel.: (+39) 02 2643 4839 (office)
UHSR University                       (+39) 02 2643 3429 (laboratory)
via Olgettina, 58                     (+39) 02 2643 4891 (secretary)
20132 Milan, Italy               fax: (+39) 02 2643 4892

S Ellison

Fri, Jul 5, 2013 6:20 AM #

No problem; the present format is reasonably easily deciphered, especially if you have an older version's help file about.

An aside, though; I found it because I was trying to get numbers out of VarCorr for further use (specifically, I needed the SDs rather than variances), and made sure I was getting the right data by comparing with the summary output. 
Which leads me to a suggestion. All the numbers (at least for my model) are packaged much more tidily in the formatted table of random effects in summary() for lmer objects. That table, though, is apparently in character format. Wouldn't it be more useful to create that as a numerical data frame and use the default methods for data frame output to display it, or to format at print time? That way the summary object would hold the numbers, and we could get the random effect variances or sd's easily from the summary object..

Steve E


*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

Ben Bolker

Fri, Jul 5, 2013 1:12 PM #

On 13-07-05 09:20 AM, S Ellison wrote:

Maybe.  I'd say we (or at I at least) am still thinking about the best
way to return var-corr information in the most useful/general way
possible.  I had some ideas for this (see vcconv.R in the source code if
you're interested), but I don't think they're going to make it in for
release 1.0 ... the main thing about exporting the var-corr info (the
reason it's done as character()) is that for more complex structures
there are often empty correlation cells ...  Probably the quickest
solution to this would be to write an as.data.frame.VarCorr.merMod
(yuck!) method based on converting lme4::formatVC() output to numeric ...

  If you want to post this as an issue on github that would be fine ...