How many groups is enough?
On Mon, Aug 31, 2009 at 3:25 AM, Ken Beath<ken at kjbeath.com.au> wrote:
On 31/08/2009, at 8:00 AM, Kingsford Jones wrote:
Here are some thoughts, which are just conjecture (caveat emptor). I'd be interested in hearing contrary facts or opinion. Because of the flexibility of mixed models I think it's hard to come up with rules of thumb here. To examine the various scenarios, simulations would need to look at effects of number of groups, number of observations within groups and the balance of those observations between groups, error variance, group variance, ratio of error and group variances, number of levels, types of random slopes, random effects covariance structure, error covariance structure, fixed effects structure, etc... ?Clearly permutations of the above could lead to an awful lot of simulations, not to mention what happens when you move away from normal errors and work with GLMMs. My guess is that in general, for small numbers of groups (or just small between group variance?) the sampling distribution of the between group variance will have a long right tail and large spread. Because the REML estimates are unbiased this would imply that when you have few groups the majority (and perhaps a large majority) of the estimates will be low, while some will be very high.
One point, is that for most analyses we are not interested in estimates of the random effect variances. My impression is that other parameter estimates are fairly robust to the random effects variance, so if the models fit sensibly then it seems a reasonable approach. One problem with a small number of groups may be the use of Empirical Bayes, as it ignores the estimate uncertainty. I assume somebody has written a paper on this, advocating full Bayesian analysis. People seem happy to do random effects meta-analysis with only a few trials.
I agree that the precision of estimates of the variance components can be poor and that this is not that much of a problem when one is primarily interested in the estimates of the fixed-effects parameters. (By the way, the statement that "REML estimates are unbiased" is not true in general. Even in the simple, balanced cases where they are unbiased, I don't think it is an important property because the distribution of the estimator is so skewed that characterizing the distribution by its mean is unrealistic.). I produced some plots of the profiled likelihood of the variance components for a simple, balanced example with 6 groups (a model for the Dyestuff data). They are rather sobering although one should expect highly skewed patterns for a variance estimate (think of the simplest case of the estimate of a variance from the mythical i.i.d. Gaussian sample). The plots are available in http://lme4.r-forge.r-project.org/slides/2009-07-21-Seewiesen/4PrecisionD.pdf
Ken
So the question remains: "what is a 'small' number of groups?". ?I'm
not sure but the following may be suggestive, at least of the symmetry
of the sampling distribution (i.e. chi sq w/ df = # groups - 1):
ngroups <- c(4, 6, 10, 15, 20)
plot(0, type='n', xlim=c(0, 30), ylim=c(0, .3))
for (i in ngroups) {
?plot(function(x) dchisq(x, i - 1), 0, 60, add=TRUE)
}
Also, googling turned up the paper below, which for a sub-class of
mixed models suggests that >=50 groups is sufficient to get
group-level variances and standard errors that are unbiased (but not
necessarily low-variance, AFAICS).
@article{maas2005sufficient,
?title={{Sufficient sample sizes for multilevel modeling}},
?author={Maas, C.J.M. and Hox, J.J.},
?journal={Methodology},
?volume={1},
?number={3},
?pages={86--92},
?year={2005}
?abstract={An important problem in multilevel modeling is what
constitutes a sufficient sample size for accurate estimation. In
multilevel analysis, the major restriction is often the higher-level
sample size. In this paper, a simulation study is used to determine
the influence of different sample sizes at the group level on the
accuracy of the estimates (regression coefficients and variances)
and their standard errors. In addition, the influence of other
factors, such as the lowest-level sample size and different variance
distributions between the levels (different intraclass correlations),
is examined. The results show that only a small sample size
at level two (meaning a sample of 50 or less) leads to biased
estimates of the second-level standard errors. In all of the other
simulated conditions the estimates of the regression coefficients, the
variance components, and the standard errors are unbiased
and accurate.}
}
hth,
Kingsford Jones
On Sun, Aug 30, 2009 at 5:53 AM, Highland Statistics
Ltd.<highstat at highstat.com> wrote:
Alain Zuur's response to a recent posting raises an interesting question. To use a random effects model what number of groups is actually sufficient? I have heard talk of a minimum of 20 groups but have seen numerous examples in books and published papers with much less than this. Is there a definitive reference on this?
Graham, Actually..it turned out that the data set for which the question was asked, had about 350 subjects I believe. But anyway....that is not your question. In general you see the magic "5" in some textbooks.....but for what it is worth...I recently had to program a ZIP for 2-way nested data in RBugs..and in order to do this, I started with 1-way and 2-way GLMMs (just to build up the code). And to check whether my code was "correct", I compared the results with that of 3-4 R packages (e.g. glmmPQL, lmer, glmml). ?The data set consisted of multiple observations per animal, for 5-30 animals per colony, and 9 colonies. I noticed that the estimated values for the variance for the random intercept colony differed a lot between these packages. But all came with similar estimates for the animal-within-colony random intercept. Not that it tells you that much (all packages giving the same result doesn't mean it is correct)....but it is a bit worrying. Perhaps a simulation study gives you a better answer. The data I use(d) are highly unbalanced..so that may have played a role as well. Alain -- Dr. Alain F. Zuur First author of: 1. Analysing Ecological Data (2007). Zuur, AF, Ieno, EN and Smith, GM. Springer. 680 p. URL: www.springer.com/0-387-45967-7 2. Mixed effects models and extensions in ecology with R. (2009). Zuur, AF, Ieno, EN, Walker, N, Saveliev, AA, and Smith, GM. Springer. http://www.springer.com/life+sci/ecology/book/978-0-387-87457-9 3. A Beginner's Guide to R (2009). Zuur, AF, Ieno, EN, Meesters, EHWG. Springer http://www.springer.com/statistics/computational/book/978-0-387-93836-3 Other books: http://www.highstat.com/books.htm Statistical consultancy, courses, data analysis and software Highland Statistics Ltd. 6 Laverock road UK - AB41 6FN Newburgh Tel: 0044 1358 788177 Email: highstat at highstat.com URL: www.highstat.com URL: www.brodgar.com
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models