Skip to content

A mixed effects model for variances of binomial probability as a response

6 messages · nimrod.rubinstein, Ramon Diaz-Uriarte

#
Dear Nimrod,

Just my 2 cents (even less than 2):
On Thu, 01-01-1970, at 01:00, nimrod.rubinstein <nimrod.rubinstein at gmail.com> wrote:
But, as you say in point 1. below (and in another reply), if this is really
binomial, then you already know that if probabilities in G1 are different
from probabilities in G2 your variances will differ.

So I am not sure I even understand this hypothesis/question.
That last issue seems very problematic to me. If I understand correctly, we
could think of a perverse situation where, say, you can never observe low
success probabilities if those are related to number of trials around 0.


Can you do a simple plot of probability of success (or its logit) vs number
of trials, to rule out perverse scenarios? Is probability of success always
larger than 0? And number of trials always sufficiently different from 0?


And you say "varies by subject", so the number of trials is always the same
for a subject? And the potential relationship prob <-> number of trials is
due to among-subject variation? So you could approach the problem via a
random effect for subject? If you compare G1 and G2 for number of trials:
are there differences?
I do not see how this would answer your questions, given the above
concerns/issues and other possible approaches (just compare prob. between
G1 and G2, accounting for environment).


Best,


R.

  
    
#
Hi Ramon (and others reading this),

Thanks for the response.

I'm attaching my real data - it has these columns:
subject - subject name (S1, S2,...)
group - group name (G1, G2) - a subject can only belong to either of the
two groups
environment - environment name (E1, E2, E3).
the next 27 columns (named n.1-n.27) are the number of trails in each
replicate
the following 27 columns (named k.1-k.27) are the number of successes out
of the number of trials in each corresponding replicate (so k.1 is the
number of success out of n.1 trials, and so on).

Note that there are NA values in corresponding n and k cells in the table.
Also, not all subjects were sampled from all 3 environments due to
technical problems, so the data are not fully balanced.

I know this table is a bit different from my earlier description of the
data. It is so only and it is because I wanted to make it simple.

Let me try and clarify, and hopefully all of the below will answer your
questions.

What may affect the variances of the success probabilities (proportions)?
1. The proportion itself (natural to binomially distributed data). See
attached Fig1 that plots sample variances of success probabilities vs.
their sample mean.
2. The number of trails in each replicate. As I said, the number of trails
is inherent to the subject and the environment it was measured in, the
experimenter has no control over it. As the attached Fig2 shows, subjects
with a high average number of trials have lower sample variances of their
success probabilities. The opposite however is not clear. The sample
variances of subjects with a low average number of trials vary greatly. So
obviously, the number of trails has an effect on the variance of the
proportion but there's still a sizable fraction of variance (of the
proportion variances) that are not explained by that.
3. The group - that's the effect I'm interested in. Fig1 and Fig2 show that
the G1 dots have lower sample variances of success probabilities than G2
dots.
4. The environment. These environments were chosen at random and therefore
I think they should be REs. (In all attached figures the environments isn't
indicated)

I've also attached a figure of the success probabilities vs, the number of
trials (Fig3).

I think the figures show pretty clearly that both number of trails, success
probability, and group affect the variance of success probability, and
therefore the model I'm looking for should estimate all of these.


Thanks a lot,
Nimrod
On Thu, Mar 27, 2014 at 4:39 AM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com>wrote:

            
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fig1.png
Type: image/png
Size: 6661 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20140327/2f26ba93/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fig2.png
Type: image/png
Size: 7407 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20140327/2f26ba93/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fig3.png
Type: image/png
Size: 16521 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20140327/2f26ba93/attachment-0005.png>
#
On Thu, 01-01-1970, at 01:00, nimrod.rubinstein <nimrod.rubinstein at gmail.com> wrote:
But what are those variances? How are you computing them? (See below, though)
Yes, I think the scenario I was concerned about is not here. Good.
It is way too late here, but maybe we are using the same name (variance)
for two different things here: the variance of p, which is just a simple
function of the proportion, of the p itself, and the variance in the
collection of p's, where each p is not necessarily estimating the same
probability. And I think you care about the second problem, something like
saying that individuals in G1 and G2 have different dispersion in their
p's.  So you have a distribution for the p's of G1 and a distribution for
the p's of G2, and you want to know whether those distributions have the
same variance while also possibly controlling for other factors. I think
the book by Gelman and Hill has some examples of this kind (I do not have
it here now, though).



Best,


R.

  
    
#
I'm reattaching my data as an RData object. Here's how it looks like:
subject group environment     n     k
1      S1    G1          E1  1419  1169
2      S1    G1          E2  5992   734
3      S1    G1          E3  5133   519
4      S2    G1          E1 23187 16038
5      S2    G1          E2 93255 46592
6      S2    G1          E3 35382 24486

This is probably much easire to work with now.

After defining the response as:
y=cbind(df$k,(df$n-df$k)) (following "The R Book" by Michael Crawley, Ch.
16)

I tried fitting this model:
model.null=glmer(y~log(n)+(1|subject)+(1|environment),data=df,family=binomial(link="logit"))

but I'm getting these warning messages:
Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.0937283 (tol = 0.001)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?

Are there any parameters that can be tuned in order to achieve convergence?

Also, what should I rescale? Taking log(n) instead of n as a fixed effect
removed this warning:
1: In checkScaleX(X, ctrl = control) :
  Some predictor variables are on very different scales: consider rescaling

Any idea what else should I rescale?

Also, if my alternative model is:
alt.null=glmer(y~group+log(n)+(1|subject)+(1|environment),data=df,family=binomial(link="logit"))

Is there a way I can test whether the over-dispersion in G1 is
significantly smaller than in G2


Thanks a lot


On Fri, Mar 28, 2014 at 10:56 AM, nimrod.rubinstein <
nimrod.rubinstein at gmail.com> wrote: