A mixed effects model for variances of binomial probability as a response

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20140326/e5666b1d/attachment.pl>
Dear Nimrod,

Just my 2 cents (even less than 2):
Hi all,

I have the following data:

50 subjects that belong to 2 groups (25 in each group). For each subject I
have replicated measurements (in the range of 15-30 replicates per subject)
of the number of trails as well as the number successes out of these
trails. Each single trial in each single replicate for each subject is
assumed to be a Bernoulli experiment, therefore the number of successes in
each replicate is assumed to be a binomially distributed random variable.
To complicate things even more, for each subject from each group I have not
1 but 3 series of these replicated measurements, from 3 different
environments (the 3 environments are the same for all subjects).

A few example lines of how my data looks like:
subject group environment     number.trails  number.successes

    S1    G1          E1 10100,13209,17621  8500,11622,14684

    S1    G1          E2 13322,12225,10024  10351,10222,7951

    S1    G1          E3  9812,13500,13214  7245,10622,10684

    .

    .

    .

    S50    G2          E1 60547,75345,68412 12014,13214,15244

    S50    G2          E2 50897,62541,61247 10141,13251,30645

    S50    G2          E3 59214,84264,74156 17245,16425,11123

Where the comma separated number of successes correspond to the comma
separated number of trials in each replicate.

I want to test whether the variances of the success probabilities of
subjects from group G1 are significantly smaller than the variances of the
success probabilities of subjects from group G2.

But, as you say in point 1. below (and in another reply), if this is really
binomial, then you already know that if probabilities in G1 are different
from probabilities in G2 your variances will differ.

So I am not sure I even understand this hypothesis/question.
In other words, for each subject in each replicate a success probability
can be estimated from the observed data. Then the variance of those success
probabilities over all replicates from a specific environment can be
estimated. I would like to know whether these variances are smaller in
subjects from G1 vs. subjects from G2.

The issues are:

   1. The variance of a success probability depends on the success
   probability itself (natural to binomially distributed data). I need to
   account for this effect.
   2. The number of trails (although at least in the hundreds for all
   subjects) varies several orders of magnitude among subjects. That by itself
   affects the accuracy of estimating the success probability, but in
   addition, I cannot rule out an inherent dependency between the number of
   observed trials and the success probability (there is no control over the
   number of trials in the experiment - it is observed). In other words,
   having a lower number of trails by itself may affect the success
   probability.

That last issue seems very problematic to me. If I understand correctly, we
could think of a perverse situation where, say, you can never observe low
success probabilities if those are related to number of trials around 0.

Can you do a simple plot of probability of success (or its logit) vs number
of trials, to rule out perverse scenarios? Is probability of success always
larger than 0? And number of trials always sufficiently different from 0?

And you say "varies by subject", so the number of trials is always the same
for a subject? And the potential relationship prob <-> number of trials is
due to among-subject variation? So you could approach the problem via a
random effect for subject? If you compare G1 and G2 for number of trials:
are there differences?
I though a simple way to test my hypothesis is to compute the success
probability in each replicate as number.successes/number.trails, and then
the sample mean and sample variance of these success probabilities over all
replicates, as well as the mean number of trials over all replicates (added
to my data as sample.mean, sample.var and mean.trials fields). Then I can
use the lmer function in the lme4 package this way:

null.model = lmer(sample.var ~ sample.mean + mean.trials + (1|environment),
my.data, REML = F)
alt.model = lmer(sample.var ~ sample.mean + mean.trials + group +
(1|environment), my.data, REML = F)

So fixed effects are: sample.mean, mean.trials and group, and random
effects is environment since the three environments in my data are simply a
random sample from many other possible environments.
I do not see how this would answer your questions, given the above
concerns/issues and other possible approaches (just compare prob. between
G1 and G2, accounting for environment).

Best,

R.
To test whether group has a significant effect I'll do:

anova(null.model, alt.model).

My concerns are:

   1. Should I add subject as a random effect? My intuition is not to since
   subjects are unique to groups, which is the effect I'm after. However, I do
   want to account for the fact that sample variances of the same subject from
   the three different environments are correlated.
   2. The response - sample variances, cannot be assumed to follow a normal
   distribution (they only take positive values to begin with and as far as I
   know (n-1)*sample.variance/population.variance ~ chi.square with df = n-1)
   and therefore the assumptions of a linear mixed effects model do not
   strictly hold. Is there a generalized linear model that would be
   appropriate for this?

Thanks a lot,
nimrod

	[[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Aut?noma de Madrid 
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz
Hi Ramon (and others reading this),

Thanks for the response.

I'm attaching my real data - it has these columns:
subject - subject name (S1, S2,...)
group - group name (G1, G2) - a subject can only belong to either of the
two groups
environment - environment name (E1, E2, E3).
the next 27 columns (named n.1-n.27) are the number of trails in each
replicate
the following 27 columns (named k.1-k.27) are the number of successes out
of the number of trials in each corresponding replicate (so k.1 is the
number of success out of n.1 trials, and so on).

Note that there are NA values in corresponding n and k cells in the table.
Also, not all subjects were sampled from all 3 environments due to
technical problems, so the data are not fully balanced.

I know this table is a bit different from my earlier description of the
data. It is so only and it is because I wanted to make it simple.

Let me try and clarify, and hopefully all of the below will answer your
questions.

What may affect the variances of the success probabilities (proportions)?
1. The proportion itself (natural to binomially distributed data). See
attached Fig1 that plots sample variances of success probabilities vs.
their sample mean.
2. The number of trails in each replicate. As I said, the number of trails
is inherent to the subject and the environment it was measured in, the
experimenter has no control over it. As the attached Fig2 shows, subjects
with a high average number of trials have lower sample variances of their
success probabilities. The opposite however is not clear. The sample
variances of subjects with a low average number of trials vary greatly. So
obviously, the number of trails has an effect on the variance of the
proportion but there's still a sizable fraction of variance (of the
proportion variances) that are not explained by that.
3. The group - that's the effect I'm interested in. Fig1 and Fig2 show that
the G1 dots have lower sample variances of success probabilities than G2
dots.
4. The environment. These environments were chosen at random and therefore
I think they should be REs. (In all attached figures the environments isn't
indicated)

I've also attached a figure of the success probabilities vs, the number of
trials (Fig3).

I think the figures show pretty clearly that both number of trails, success
probability, and group affect the variance of success probability, and
therefore the model I'm looking for should estimate all of these.

Thanks a lot,
Nimrod

Dear Nimrod,

Just my 2 cents (even less than 2):

On Thu, 01-01-1970, at 01:00, nimrod.rubinstein <
nimrod.rubinstein at gmail.com> wrote:
Hi all,

I have the following data:

50 subjects that belong to 2 groups (25 in each group). For each subject
I
have replicated measurements (in the range of 15-30 replicates per
subject)
of the number of trails as well as the number successes out of these
trails. Each single trial in each single replicate for each subject is
assumed to be a Bernoulli experiment, therefore the number of successes
in
each replicate is assumed to be a binomially distributed random variable.
To complicate things even more, for each subject from each group I have
not
1 but 3 series of these replicated measurements, from 3 different
environments (the 3 environments are the same for all subjects).

A few example lines of how my data looks like:
subject group environment     number.trails  number.successes

    S1    G1          E1 10100,13209,17621  8500,11622,14684

    S1    G1          E2 13322,12225,10024  10351,10222,7951

    S1    G1          E3  9812,13500,13214  7245,10622,10684

    .

    .

    .

    S50    G2          E1 60547,75345,68412 12014,13214,15244

    S50    G2          E2 50897,62541,61247 10141,13251,30645

    S50    G2          E3 59214,84264,74156 17245,16425,11123

Where the comma separated number of successes correspond to the comma
separated number of trials in each replicate.

I want to test whether the variances of the success probabilities of
subjects from group G1 are significantly smaller than the variances of
the
success probabilities of subjects from group G2.

But, as you say in point 1. below (and in another reply), if this is really
binomial, then you already know that if probabilities in G1 are different
from probabilities in G2 your variances will differ.

So I am not sure I even understand this hypothesis/question.

In other words, for each subject in each replicate a success probability
can be estimated from the observed data. Then the variance of those
success
probabilities over all replicates from a specific environment can be
estimated. I would like to know whether these variances are smaller in
subjects from G1 vs. subjects from G2.

The issues are:

   1. The variance of a success probability depends on the success
   probability itself (natural to binomially distributed data). I need to
   account for this effect.
   2. The number of trails (although at least in the hundreds for all
   subjects) varies several orders of magnitude among subjects. That by
itself
   affects the accuracy of estimating the success probability, but in
   addition, I cannot rule out an inherent dependency between the number
of
   observed trials and the success probability (there is no control over
the
   number of trials in the experiment - it is observed). In other words,
   having a lower number of trails by itself may affect the success
   probability.

That last issue seems very problematic to me. If I understand correctly, we
could think of a perverse situation where, say, you can never observe low
success probabilities if those are related to number of trials around 0.

Can you do a simple plot of probability of success (or its logit) vs number
of trials, to rule out perverse scenarios? Is probability of success always
larger than 0? And number of trials always sufficiently different from 0?

And you say "varies by subject", so the number of trials is always the same
for a subject? And the potential relationship prob <-> number of trials is
due to among-subject variation? So you could approach the problem via a
random effect for subject? If you compare G1 and G2 for number of trials:
are there differences?

I though a simple way to test my hypothesis is to compute the success
probability in each replicate as number.successes/number.trails, and then
the sample mean and sample variance of these success probabilities over
all
replicates, as well as the mean number of trials over all replicates
(added
to my data as sample.mean, sample.var and mean.trials fields). Then I can
use the lmer function in the lme4 package this way:

null.model = lmer(sample.var ~ sample.mean + mean.trials +
(1|environment),
my.data, REML = F)
alt.model = lmer(sample.var ~ sample.mean + mean.trials + group +
(1|environment), my.data, REML = F)

So fixed effects are: sample.mean, mean.trials and group, and random
effects is environment since the three environments in my data are
simply a
random sample from many other possible environments.

I do not see how this would answer your questions, given the above
concerns/issues and other possible approaches (just compare prob. between
G1 and G2, accounting for environment).

Best,

R.

To test whether group has a significant effect I'll do:

anova(null.model, alt.model).

My concerns are:

   1. Should I add subject as a random effect? My intuition is not to
since
   subjects are unique to groups, which is the effect I'm after.
However, I do
   want to account for the fact that sample variances of the same
subject from
   the three different environments are correlated.
   2. The response - sample variances, cannot be assumed to follow a
normal
   distribution (they only take positive values to begin with and as far
as I
   know (n-1)*sample.variance/population.variance ~ chi.square with df =
n-1)
   and therefore the assumptions of a linear mixed effects model do not
   strictly hold. Is there a generalized linear model that would be
   appropriate for this?

Thanks a lot,
nimrod

      [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Aut?noma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fig1.png
Type: image/png
Size: 6661 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20140327/2f26ba93/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fig2.png
Type: image/png
Size: 7407 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20140327/2f26ba93/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fig3.png
Type: image/png
Size: 16521 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20140327/2f26ba93/attachment-0005.png>
Hi Ramon (and others reading this),

Thanks for the response.

I'm attaching my real data - it has these columns:
subject - subject name (S1, S2,...)
group - group name (G1, G2) - a subject can only belong to either of the
two groups
environment - environment name (E1, E2, E3).
the next 27 columns (named n.1-n.27) are the number of trails in each
replicate
the following 27 columns (named k.1-k.27) are the number of successes out
of the number of trials in each corresponding replicate (so k.1 is the
number of success out of n.1 trials, and so on).

Note that there are NA values in corresponding n and k cells in the table.
Also, not all subjects were sampled from all 3 environments due to
technical problems, so the data are not fully balanced.

I know this table is a bit different from my earlier description of the
data. It is so only and it is because I wanted to make it simple.

Let me try and clarify, and hopefully all of the below will answer your
questions.

What may affect the variances of the success probabilities (proportions)?
1. The proportion itself (natural to binomially distributed data). See
attached Fig1 that plots sample variances of success probabilities vs.
their sample mean.
But what are those variances? How are you computing them? (See below, though)
2. The number of trails in each replicate. As I said, the number of trails
is inherent to the subject and the environment it was measured in, the
experimenter has no control over it. As the attached Fig2 shows, subjects
with a high average number of trials have lower sample variances of their
success probabilities. The opposite however is not clear. The sample
variances of subjects with a low average number of trials vary greatly. So
obviously, the number of trails has an effect on the variance of the
proportion but there's still a sizable fraction of variance (of the
proportion variances) that are not explained by that.
3. The group - that's the effect I'm interested in. Fig1 and Fig2 show that
the G1 dots have lower sample variances of success probabilities than G2
dots.
4. The environment. These environments were chosen at random and therefore
I think they should be REs. (In all attached figures the environments isn't
indicated)

I've also attached a figure of the success probabilities vs, the number of
trials (Fig3).
Yes, I think the scenario I was concerned about is not here. Good.
I think the figures show pretty clearly that both number of trails, success
probability, and group affect the variance of success probability, and
therefore the model I'm looking for should estimate all of these.

It is way too late here, but maybe we are using the same name (variance)
for two different things here: the variance of p, which is just a simple
function of the proportion, of the p itself, and the variance in the
collection of p's, where each p is not necessarily estimating the same
probability. And I think you care about the second problem, something like
saying that individuals in G1 and G2 have different dispersion in their
p's.  So you have a distribution for the p's of G1 and a distribution for
the p's of G2, and you want to know whether those distributions have the
same variance while also possibly controlling for other factors. I think
the book by Gelman and Hill has some examples of this kind (I do not have
it here now, though).

Best,

R.
Thanks a lot,
Nimrod

On Thu, Mar 27, 2014 at 4:39 AM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com>wrote:

Dear Nimrod,

Just my 2 cents (even less than 2):

On Thu, 01-01-1970, at 01:00, nimrod.rubinstein <
nimrod.rubinstein at gmail.com> wrote:
Hi all,

I have the following data:

50 subjects that belong to 2 groups (25 in each group). For each subject
I
have replicated measurements (in the range of 15-30 replicates per
subject)
of the number of trails as well as the number successes out of these
trails. Each single trial in each single replicate for each subject is
assumed to be a Bernoulli experiment, therefore the number of successes
in
each replicate is assumed to be a binomially distributed random variable.
To complicate things even more, for each subject from each group I have
not
1 but 3 series of these replicated measurements, from 3 different
environments (the 3 environments are the same for all subjects).

A few example lines of how my data looks like:
subject group environment     number.trails  number.successes

    S1    G1          E1 10100,13209,17621  8500,11622,14684

    S1    G1          E2 13322,12225,10024  10351,10222,7951

    S1    G1          E3  9812,13500,13214  7245,10622,10684

    .

    .

    .

    S50    G2          E1 60547,75345,68412 12014,13214,15244

    S50    G2          E2 50897,62541,61247 10141,13251,30645

    S50    G2          E3 59214,84264,74156 17245,16425,11123

Where the comma separated number of successes correspond to the comma
separated number of trials in each replicate.

I want to test whether the variances of the success probabilities of
subjects from group G1 are significantly smaller than the variances of
the
success probabilities of subjects from group G2.

But, as you say in point 1. below (and in another reply), if this is really
binomial, then you already know that if probabilities in G1 are different
from probabilities in G2 your variances will differ.

So I am not sure I even understand this hypothesis/question.

In other words, for each subject in each replicate a success probability
can be estimated from the observed data. Then the variance of those
success
probabilities over all replicates from a specific environment can be
estimated. I would like to know whether these variances are smaller in
subjects from G1 vs. subjects from G2.

The issues are:

   1. The variance of a success probability depends on the success
   probability itself (natural to binomially distributed data). I need to
   account for this effect.
   2. The number of trails (although at least in the hundreds for all
   subjects) varies several orders of magnitude among subjects. That by
itself
   affects the accuracy of estimating the success probability, but in
   addition, I cannot rule out an inherent dependency between the number
of
   observed trials and the success probability (there is no control over
the
   number of trials in the experiment - it is observed). In other words,
   having a lower number of trails by itself may affect the success
   probability.

That last issue seems very problematic to me. If I understand correctly, we
could think of a perverse situation where, say, you can never observe low
success probabilities if those are related to number of trials around 0.

Can you do a simple plot of probability of success (or its logit) vs number
of trials, to rule out perverse scenarios? Is probability of success always
larger than 0? And number of trials always sufficiently different from 0?

And you say "varies by subject", so the number of trials is always the same
for a subject? And the potential relationship prob <-> number of trials is
due to among-subject variation? So you could approach the problem via a
random effect for subject? If you compare G1 and G2 for number of trials:
are there differences?

I though a simple way to test my hypothesis is to compute the success
probability in each replicate as number.successes/number.trails, and then
the sample mean and sample variance of these success probabilities over
all
replicates, as well as the mean number of trials over all replicates
(added
to my data as sample.mean, sample.var and mean.trials fields). Then I can
use the lmer function in the lme4 package this way:

null.model = lmer(sample.var ~ sample.mean + mean.trials +
(1|environment),
my.data, REML = F)
alt.model = lmer(sample.var ~ sample.mean + mean.trials + group +
(1|environment), my.data, REML = F)

So fixed effects are: sample.mean, mean.trials and group, and random
effects is environment since the three environments in my data are
simply a
random sample from many other possible environments.

I do not see how this would answer your questions, given the above
concerns/issues and other possible approaches (just compare prob. between
G1 and G2, accounting for environment).

Best,

R.

To test whether group has a significant effect I'll do:

anova(null.model, alt.model).

My concerns are:

   1. Should I add subject as a random effect? My intuition is not to
since
   subjects are unique to groups, which is the effect I'm after.
However, I do
   want to account for the fact that sample variances of the same
subject from
   the three different environments are correlated.
   2. The response - sample variances, cannot be assumed to follow a
normal
   distribution (they only take positive values to begin with and as far
as I
   know (n-1)*sample.variance/population.variance ~ chi.square with df =
n-1)
   and therefore the assumptions of a linear mixed effects model do not
   strictly hold. Is there a generalized linear model that would be
   appropriate for this?

Thanks a lot,
nimrod

      [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Aut?noma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz

Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Aut?noma de Madrid 
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20140328/f59af234/attachment.pl>
I'm reattaching my data as an RData object. Here's how it looks like:
head(df)
subject group environment     n     k
1      S1    G1          E1  1419  1169
2      S1    G1          E2  5992   734
3      S1    G1          E3  5133   519
4      S2    G1          E1 23187 16038
5      S2    G1          E2 93255 46592
6      S2    G1          E3 35382 24486

This is probably much easire to work with now.

After defining the response as:
y=cbind(df$k,(df$n-df$k)) (following "The R Book" by Michael Crawley, Ch.
16)

I tried fitting this model:
model.null=glmer(y~log(n)+(1|subject)+(1|environment),data=df,family=binomial(link="logit"))

but I'm getting these warning messages:
Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.0937283 (tol = 0.001)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?

Are there any parameters that can be tuned in order to achieve convergence?

Also, what should I rescale? Taking log(n) instead of n as a fixed effect
removed this warning:
1: In checkScaleX(X, ctrl = control) :
  Some predictor variables are on very different scales: consider rescaling

Any idea what else should I rescale?

Also, if my alternative model is:
alt.null=glmer(y~group+log(n)+(1|subject)+(1|environment),data=df,family=binomial(link="logit"))

Is there a way I can test whether the over-dispersion in G1 is
significantly smaller than in G2

Thanks a lot

On Fri, Mar 28, 2014 at 10:56 AM, nimrod.rubinstein <

Hi Ramon and others,

Thanks a lot for the advice.

I think I need to rephrase my hypothesis, as follows:
The data are binomial - for each subject, from each group, in each
environment, I have replicates of trials and successes in these trials.
I would like to know:
1. Are the data for each group (proportion or probability of success being
the response, environment and perhaps subject begin random effects)
significantly over-dispersed?
2. If 1 is true, then is the over-dispersion in group G1 significantly
smaller than in group G2?

It seems that glmer might be able to achieve that but I'm not entirely
sure how.

Advice would be greatly appreciated.

On Thu, Mar 27, 2014 at 10:06 PM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com>wrote:

On Thu, 01-01-1970, at 01:00, nimrod.rubinstein <
nimrod.rubinstein at gmail.com> wrote:
Hi Ramon (and others reading this),

Thanks for the response.

I'm attaching my real data - it has these columns:
subject - subject name (S1, S2,...)
group - group name (G1, G2) - a subject can only belong to either of the
two groups
environment - environment name (E1, E2, E3).
the next 27 columns (named n.1-n.27) are the number of trails in each
replicate
the following 27 columns (named k.1-k.27) are the number of successes
out
of the number of trials in each corresponding replicate (so k.1 is the
number of success out of n.1 trials, and so on).

Note that there are NA values in corresponding n and k cells in the
table.
Also, not all subjects were sampled from all 3 environments due to
technical problems, so the data are not fully balanced.

I know this table is a bit different from my earlier description of the
data. It is so only and it is because I wanted to make it simple.

Let me try and clarify, and hopefully all of the below will answer your
questions.

What may affect the variances of the success probabilities
(proportions)?
1. The proportion itself (natural to binomially distributed data). See
attached Fig1 that plots sample variances of success probabilities vs.
their sample mean.
But what are those variances? How are you computing them? (See below,
though)

2. The number of trails in each replicate. As I said, the number of
trails
is inherent to the subject and the environment it was measured in, the
experimenter has no control over it. As the attached Fig2 shows,
subjects
with a high average number of trials have lower sample variances of
their
success probabilities. The opposite however is not clear. The sample
variances of subjects with a low average number of trials vary greatly.
So
obviously, the number of trails has an effect on the variance of the
proportion but there's still a sizable fraction of variance (of the
proportion variances) that are not explained by that.
3. The group - that's the effect I'm interested in. Fig1 and Fig2 show
that
the G1 dots have lower sample variances of success probabilities than G2
dots.
4. The environment. These environments were chosen at random and
therefore
I think they should be REs. (In all attached figures the environments
isn't
indicated)

I've also attached a figure of the success probabilities vs, the number
of
trials (Fig3).
Yes, I think the scenario I was concerned about is not here. Good.

I think the figures show pretty clearly that both number of trails,
success
probability, and group affect the variance of success probability, and
therefore the model I'm looking for should estimate all of these.

It is way too late here, but maybe we are using the same name (variance)
for two different things here: the variance of p, which is just a simple
function of the proportion, of the p itself, and the variance in the
collection of p's, where each p is not necessarily estimating the same
probability. And I think you care about the second problem, something like
saying that individuals in G1 and G2 have different dispersion in their
p's.  So you have a distribution for the p's of G1 and a distribution for
the p's of G2, and you want to know whether those distributions have the
same variance while also possibly controlling for other factors. I think
the book by Gelman and Hill has some examples of this kind (I do not have
it here now, though).

Best,

R.

Thanks a lot,
Nimrod

On Thu, Mar 27, 2014 at 4:39 AM, Ramon Diaz-Uriarte <rdiaz02 at gmail.com
wrote:

Dear Nimrod,

Just my 2 cents (even less than 2):

On Thu, 01-01-1970, at 01:00, nimrod.rubinstein <
nimrod.rubinstein at gmail.com> wrote:
Hi all,

I have the following data:

50 subjects that belong to 2 groups (25 in each group). For each
subject
I
have replicated measurements (in the range of 15-30 replicates per
subject)
of the number of trails as well as the number successes out of these
trails. Each single trial in each single replicate for each subject
is
assumed to be a Bernoulli experiment, therefore the number of
successes
in
each replicate is assumed to be a binomially distributed random
variable.
To complicate things even more, for each subject from each group I
have
not
1 but 3 series of these replicated measurements, from 3 different
environments (the 3 environments are the same for all subjects).

A few example lines of how my data looks like:
subject group environment     number.trails  number.successes

    S1    G1          E1 10100,13209,17621  8500,11622,14684

    S1    G1          E2 13322,12225,10024  10351,10222,7951

    S1    G1          E3  9812,13500,13214  7245,10622,10684

    .

    .

    .

    S50    G2          E1 60547,75345,68412 12014,13214,15244

    S50    G2          E2 50897,62541,61247 10141,13251,30645

    S50    G2          E3 59214,84264,74156 17245,16425,11123

Where the comma separated number of successes correspond to the comma
separated number of trials in each replicate.

I want to test whether the variances of the success probabilities of
subjects from group G1 are significantly smaller than the variances
of
the
success probabilities of subjects from group G2.

But, as you say in point 1. below (and in another reply), if this is
really
binomial, then you already know that if probabilities in G1 are
different
from probabilities in G2 your variances will differ.

So I am not sure I even understand this hypothesis/question.

In other words, for each subject in each replicate a success
probability
can be estimated from the observed data. Then the variance of those
success
probabilities over all replicates from a specific environment can be
estimated. I would like to know whether these variances are smaller
in
subjects from G1 vs. subjects from G2.

The issues are:

   1. The variance of a success probability depends on the success
   probability itself (natural to binomially distributed data). I
need to
   account for this effect.
   2. The number of trails (although at least in the hundreds for all
   subjects) varies several orders of magnitude among subjects. That
by
itself
   affects the accuracy of estimating the success probability, but in
   addition, I cannot rule out an inherent dependency between the
number
of
   observed trials and the success probability (there is no control
over
the
   number of trials in the experiment - it is observed). In other
words,
   having a lower number of trails by itself may affect the success
   probability.

That last issue seems very problematic to me. If I understand
correctly, we
could think of a perverse situation where, say, you can never observe
low
success probabilities if those are related to number of trials around
0.

Can you do a simple plot of probability of success (or its logit) vs
number
of trials, to rule out perverse scenarios? Is probability of success
always
larger than 0? And number of trials always sufficiently different from
0?

And you say "varies by subject", so the number of trials is always the
same
for a subject? And the potential relationship prob <-> number of
trials is
due to among-subject variation? So you could approach the problem via a
random effect for subject? If you compare G1 and G2 for number of
trials:
are there differences?

I though a simple way to test my hypothesis is to compute the success
probability in each replicate as number.successes/number.trails, and
then
the sample mean and sample variance of these success probabilities
over
all
replicates, as well as the mean number of trials over all replicates
(added
to my data as sample.mean, sample.var and mean.trials fields). Then
I can
use the lmer function in the lme4 package this way:

null.model = lmer(sample.var ~ sample.mean + mean.trials +
(1|environment),
my.data, REML = F)
alt.model = lmer(sample.var ~ sample.mean + mean.trials + group +
(1|environment), my.data, REML = F)

So fixed effects are: sample.mean, mean.trials and group, and random
effects is environment since the three environments in my data are
simply a
random sample from many other possible environments.

I do not see how this would answer your questions, given the above
concerns/issues and other possible approaches (just compare prob.
between
G1 and G2, accounting for environment).

Best,

R.

To test whether group has a significant effect I'll do:

anova(null.model, alt.model).

My concerns are:

   1. Should I add subject as a random effect? My intuition is not to
since
   subjects are unique to groups, which is the effect I'm after.
However, I do
   want to account for the fact that sample variances of the same
subject from
   the three different environments are correlated.
   2. The response - sample variances, cannot be assumed to follow a
normal
   distribution (they only take positive values to begin with and as
far
as I
   know (n-1)*sample.variance/population.variance ~ chi.square with
df =
n-1)
   and therefore the assumptions of a linear mixed effects model do
not
   strictly hold. Is there a generalized linear model that would be
   appropriate for this?

Thanks a lot,
nimrod

      [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Aut?noma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz

--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Aut?noma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz