How many groups is enough?

On 31/08/2009, at 8:00 AM, Kingsford Jones wrote:

Here are some thoughts, which are just conjecture (caveat emptor).
I'd be interested in hearing contrary facts or opinion.

Because of the flexibility of mixed models I think it's hard to come
up with rules of thumb here. To examine the various scenarios,
simulations would need to look at effects of number of groups, number
of observations within groups and the balance of those observations
between groups, error variance, group variance, ratio of error and
group variances, number of levels, types of random slopes, random
effects covariance structure, error covariance structure, fixed
effects structure, etc... ?Clearly permutations of the above could
lead to an awful lot of simulations, not to mention what happens when
you move away from normal errors and work with GLMMs.

My guess is that in general, for small numbers of groups (or just
small between group variance?) the sampling distribution of the
between group variance will have a long right tail and large spread.
Because the REML estimates are unbiased this would imply that when you
have few groups the majority (and perhaps a large majority) of the
estimates will be low, while some will be very high.

One point, is that for most analyses we are not interested in estimates of
the random effect variances. My impression is that other parameter estimates
are fairly robust to the random effects variance, so if the models fit
sensibly then it seems a reasonable approach. One problem with a small
number of groups may be the use of Empirical Bayes, as it ignores the
estimate uncertainty. I assume somebody has written a paper on this,
advocating full Bayesian analysis. People seem happy to do random effects
meta-analysis with only a few trials.
I agree that the precision of estimates of the variance components can
be poor and that this is not that much of a problem when one is
primarily interested in the estimates of the fixed-effects parameters.
 (By the way, the statement that "REML estimates are unbiased" is not
true in general.  Even in the simple, balanced cases where they are
unbiased, I don't think it is an important property because the
distribution of the estimator is so skewed that characterizing the
distribution by its mean is unrealistic.).

I produced some plots of the profiled likelihood of the variance
components for a simple, balanced example with 6 groups (a model for
the Dyestuff data).  They are rather sobering although one should
expect highly skewed patterns for a variance estimate (think of the
simplest case of the estimate of a variance from the mythical i.i.d.
Gaussian sample).  The plots are available in
http://lme4.r-forge.r-project.org/slides/2009-07-21-Seewiesen/4PrecisionD.pdf
Ken

So the question remains: "what is a 'small' number of groups?". ?I'm
not sure but the following may be suggestive, at least of the symmetry
of the sampling distribution (i.e. chi sq w/ df = # groups - 1):

ngroups <- c(4, 6, 10, 15, 20)
plot(0, type='n', xlim=c(0, 30), ylim=c(0, .3))
for (i in ngroups) {
?plot(function(x) dchisq(x, i - 1), 0, 60, add=TRUE)
}

Also, googling turned up the paper below, which for a sub-class of
mixed models suggests that >=50 groups is sufficient to get
group-level variances and standard errors that are unbiased (but not
necessarily low-variance, AFAICS).

@article{maas2005sufficient,
?title={{Sufficient sample sizes for multilevel modeling}},
?author={Maas, C.J.M. and Hox, J.J.},
?journal={Methodology},
?volume={1},
?number={3},
?pages={86--92},
?year={2005}
?abstract={An important problem in multilevel modeling is what
constitutes a sufficient sample size for accurate estimation. In
multilevel analysis, the major restriction is often the higher-level
sample size. In this paper, a simulation study is used to determine
the influence of different sample sizes at the group level on the
accuracy of the estimates (regression coefficients and variances)
and their standard errors. In addition, the influence of other
factors, such as the lowest-level sample size and different variance
distributions between the levels (different intraclass correlations),
is examined. The results show that only a small sample size
at level two (meaning a sample of 50 or less) leads to biased
estimates of the second-level standard errors. In all of the other
simulated conditions the estimates of the regression coefficients, the
variance components, and the standard errors are unbiased
and accurate.}
}

hth,

Kingsford Jones

On Sun, Aug 30, 2009 at 5:53 AM, Highland Statistics
Ltd.<highstat at highstat.com> wrote:

Alain Zuur's response to a recent posting raises an interesting
question.
To
use a random effects model what number

of groups is actually sufficient?

I have heard talk of a minimum of 20 groups but have seen numerous
examples
in books and published papers with

much less than this. Is there a definitive reference on this?

Graham,

Actually..it turned out that the data set for which the question was
asked,
had about 350 subjects I believe.

But anyway....that is not your question. In general you see the magic "5"
in
some textbooks.....but for what it is worth...I recently had to program a
ZIP for 2-way nested data in RBugs..and in order to do this, I started
with
1-way and 2-way GLMMs (just to build up the code). And to check whether
my
code was "correct", I compared the results with that of 3-4 R packages
(e.g.
glmmPQL, lmer, glmml). ?The data set consisted of multiple observations
per
animal, for 5-30 animals per colony, and 9 colonies. I noticed that the
estimated values for the variance for the random intercept colony
differed a
lot between these packages. But all came with similar estimates for the
animal-within-colony random intercept.

Not that it tells you that much (all packages giving the same result
doesn't
mean it is correct)....but it is a bit worrying. Perhaps a simulation
study
gives you a better answer. The data I use(d) are highly unbalanced..so
that
may have played a role as well.

Alain

--

Dr. Alain F. Zuur
First author of:

1. Analysing Ecological Data (2007).
Zuur, AF, Ieno, EN and Smith, GM. Springer. 680 p.
URL: www.springer.com/0-387-45967-7

2. Mixed effects models and extensions in ecology with R. (2009).
Zuur, AF, Ieno, EN, Walker, N, Saveliev, AA, and Smith, GM. Springer.
http://www.springer.com/life+sci/ecology/book/978-0-387-87457-9

3. A Beginner's Guide to R (2009).
Zuur, AF, Ieno, EN, Meesters, EHWG. Springer
http://www.springer.com/statistics/computational/book/978-0-387-93836-3

Other books: http://www.highstat.com/books.htm

Statistical consultancy, courses, data analysis and software
Highland Statistics Ltd.
6 Laverock road
UK - AB41 6FN Newburgh
Tel: 0044 1358 788177
Email: highstat at highstat.com
URL: www.highstat.com
URL: www.brodgar.com

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

How many groups is enough?

Thread (7 messages)