Specifying outcome variable in binomial glmm: single responses vs cbind? - R-SIG-mixed-models

Mon, Jul 4, 2016 11:11 AM #

Hi Ben,
This thread is relevant in this regard:
https://stat.ethz.ch/pipermail/r-sig-mixed-models/2015q4/024241.html
At least on my machine, I found a substantial difference in the parameter
estimates. The second form seemed more reliable than the first, as you'll
see from the thread.
Do you get the same result?
Best wishes,
Malcolm



Date: Sat, 2 Jul 2016 13:06:30 -0400

From: Ben Bolker <bbolker at gmail.com>
To: r-sig-mixed-models at r-project.org
Subject: Re: [R-sig-ME] Specifying outcome variable in binomial glmm:
        single responses vs cbind?



On 16-07-01 07:37 PM, a y wrote:

What is the difference between fitting a binomial glmm (without random

item

effects) in the following two ways?

1.
Data formatted in the following way:

(data_long)
ID    correct    condition    itemID
1      1             A               i1
1      0             A               i2
1      1             A               i3
1      1             A               i4
2      0             B               i1
2      1             B               i2
2      1             B               i3
2      0             B               i4

Fitting a model without item random effects:

glmer(correct ~ condition + (1|ID), family = binomial, data = data_long)


2.
Data formatted this way (summing over the correct responses):

(data_short)
ID     sum_correct    condition     itemID
1       3                      A                NA
2       2                      B                NA

Fitting the following model, assuming there were only 4 items  (I've seen
dozens of examples like this):
glmer(cbind(sum_correct, 4 - sum_correct) ~ condition + (1|ID), family =
binomial, data = data_short)

---
I figured these models should be identical, but in my experience they are
very much not. What am I missing? When is the second (more) appropriate?

Thanks for any help,
Andrew

  I believe they should give different likelihoods but identical
parameter estimates, *differences* among likelihoods (i.e. among
competing models fitted with the same data), etc..  That is,
disaggregating the data leads to an extra additive constant in the
log-likelihood. I would be very interested to see a counter-example to
that statement!  In general, the second form should be quicker to fit,
provide residuals that are easier to interpret, etc..

Ben Bolker

Mon, Jul 4, 2016 1:10 PM #

Really interesting (and somewhat disconcerting).

  Running it with glmmTMB (which uses Laplace!) gives different results
from glmer with nAGQ=1 -- suggesting some issue not just with Laplace,
but with lme4's implementation thereof?? (I don't think the problem is
an optimization failure ...)
   It makes *some* sense that Gauss-Hermite quadrature would be useful
for this case (since binary data is far from fitting a Normality
assumption), but that doesn't necessarily hold up to scrutiny since what
needs to be approximately Normal is not the likelihood per point, but
the likelihood per conditional mode [which should be the same, up to a
constant, for the aggregated and disaggregated data ...]

  Doug Bates, if you're reading would you be willing to try this out
with MixedModels.jl ... ?

  Ben Bolker

On 16-07-04 02:11 PM, Malcolm Fairbrother wrote:

Hi Ben,
This thread is relevant in this regard:
https://stat.ethz.ch/pipermail/r-sig-mixed-models/2015q4/024241.html
At least on my machine, I found a substantial difference in the
parameter estimates. The second form seemed more reliable than the
first, as you'll see from the thread.
Do you get the same result?
Best wishes,
Malcolm



    Date: Sat, 2 Jul 2016 13:06:30 -0400
    From: Ben Bolker <bbolker at gmail.com <mailto:bbolker at gmail.com>>
    To: r-sig-mixed-models at r-project.org
    <mailto:r-sig-mixed-models at r-project.org>
    Subject: Re: [R-sig-ME] Specifying outcome variable in binomial glmm:
            single responses vs cbind?



    On 16-07-01 07:37 PM, a y wrote:

    > What is the difference between fitting a binomial glmm (without

    random item

    > effects) in the following two ways?
    >
    > 1.
    > Data formatted in the following way:
    >
    > (data_long)
    > ID    correct    condition    itemID
    > 1      1             A               i1
    > 1      0             A               i2
    > 1      1             A               i3
    > 1      1             A               i4
    > 2      0             B               i1
    > 2      1             B               i2
    > 2      1             B               i3
    > 2      0             B               i4
    >
    > Fitting a model without item random effects:
    >
    > glmer(correct ~ condition + (1|ID), family = binomial, data =

    data_long)

    >
    >
    > 2.
    > Data formatted this way (summing over the correct responses):
    >
    > (data_short)
    > ID     sum_correct    condition     itemID
    > 1       3                      A                NA
    > 2       2                      B                NA
    >
    > Fitting the following model, assuming there were only 4 items

    (I've seen

    > dozens of examples like this):
    > glmer(cbind(sum_correct, 4 - sum_correct) ~ condition + (1|ID),

    family =

    > binomial, data = data_short)
    >
    > ---
    > I figured these models should be identical, but in my experience

    they are

    > very much not. What am I missing? When is the second (more)

    appropriate?

    >
    > Thanks for any help,
    > Andrew
    >

      I believe they should give different likelihoods but identical
    parameter estimates, *differences* among likelihoods (i.e. among
    competing models fitted with the same data), etc..  That is,
    disaggregating the data leads to an extra additive constant in the
    log-likelihood. I would be very interested to see a counter-example to
    that statement!  In general, the second form should be quicker to fit,
    provide residuals that are easier to interpret, etc..