Skip to content

[R-meta] Effect sizes for mixed-effects models

2 messages · Lena Schäfer, James Pustejovsky

#
Dear James, 

Thank you so much for the detailed response! I apologize for the delay in getting back to you; my graduate school applications got in the way of this. Your suggestion is exactly what we have been looking for and your blogpost has been very informative. I do have a couple of follow-up questions and would be curious to hear what you think:

 Calculating Cohen?s d and its variance for mixed-effects models

Initially, we planned to follow Brysbaert and Stevens' (2018) suggestion to calculate Cohen?s d for mixed-effects models using:

d = difference in means / sqrt(sum of all variance components).

Hedges (2007) proposes three approaches toward scaling the treatment effect in mixed-effects models, namely by standardizing the mean difference by the total variance (i.e., sum of the within- and between-cluster components), the within-cluster variance, or the between-cluster variance. Intuitively, I understood that Brysbaert and Stevens? approach also uses the total variance to scale the treatment effect since *all* variance components are summed up. However, Hedges seems to use another formula for deriving dTotal, namely:

dT = difference in means / sqrt (between-cluster components + ((n-1) / n) * within-cluster components).      

Can you help me understand in which cases it would make sense to scale the difference in means by sqrt(sum of all variance components) and in which cases it would be more reasonable to use sqrt (between-cluster components + ((n-1) / n) * within-cluster components)?

You also provided information on an alternative approach towards calculating the variance of Cohen?s d using: 

Vd = (SEb / S)^2 + d^2 / (2 v)

For our mixed-effects models, I could derive SEb directly from the lme4 output, and I could substitute the standardizer used for calculating Cohen?s d for S (sqrt(sum of all variance components) or sqrt (between-cluster components + ((n-1) / n) * within-cluster components). In an effort to be as conservative as possible, I would use the number of participants as the degrees of freedom (v). Does this make sense?

Comparability of effect sizes derived from between- and within-subjects designs

Finally, I wonder to which extent the alternative formulas suggested in the blogpost allow for comparison across different experimental designs. In our meta-analysis, we aim at including effect sizes derived from between- and within-subjects designs. To be able to synthesize the results from both types of designs in one analysis, we make sure to meet the three criteria outlined in Morris and DeShon (2002): 1) all effect sizes were ultimately transformed into a common metric (between-subjects metric); 2) the same effect of interest was measured in both types of studies; and 3) sampling variances for all effect sizes were estimated based on the original  design of the study (Table 2). Comparing the variance formulas provided in the blogpost to the ones provided in Morris & DeShon, it seems like the latter ones are slightly larger (and thus more conservative, which seems fine). However, I am uncertain about mixing the Morris & DeShon formulas for within- and between-subjects designs (to allow for comparison) with the alternative formulas you provided for calculating Cohen?s d and its respective variance for mixed-effects models. Do you think this might cause any problems for the comparability of our effect sizes? I wonder whether you have some intuition on whether effect sizes derived using the alternative formulas proposed in the blogpost can be across different study designs.

Thank you so much for your help. Your time and effort are very much appreciated!

Best wishes, 

Lena Schaefer

On behalf of a collaborative team that additionally includes Leah Somerville (head of the Affective Neuroscience and Development Laboratory), Katherine Powers (former postdoc in the Affective Neuroscience and Development Laboratory) and Bernd Figner (Radboud University).

  
  
3 days later
#
Hi Lena,

To your first question: the distinction between Brysbaert and Stevens
(2018) and Hedges (2007) has to do with estimation, rather than the
definition of the effect size. Both studies use the same definition of the
effect size parameter (assuming standardization by the total variance).
Brysbaert and Stevens assume that you are working with the results of a
fitted mixed effects model, where the variance components would be
estimated using restricted maximum likelihood (REML). In contrast, Hedges
(2007) uses moment estimators assuming a balanced design. In his notation,
S_B^2 and S_W^2 are sample variances between and within-clusters,
respectively, which are not exactly the same as the REML estimators. The (n
- 1) / n term arises because S_B^2 is an overestimate of sigma_B^2 (the
between-cluster population variance). See the explanation on p. 347 in the
section "Estimation of delta_B". In a balanced design (where all clusters
are the same size), the two approaches to calculation should yield
identical estimates of total variance, I think, and even with some
imbalance the total variance estimates (and resulting effect size
estimates) should come very close.

To your second question about how to get the degrees of freedom, yes I
think using the total number of participants is probably a good and
conservative approximation.

To your final question about comparability across between- and
within-subjects designs: comparability hinges on whether the variance
components used in the denominator of d are the same across both types of
designs. In principle, using the methods outlined in my blog post, you
should be able to define and estimate effect sizes that are comparable
across both types of designs. Of course, in practice there may be factors
that differ across the two types of designs.  For example, how the
treatment is operationalized in a within-subjects design might be different
from how it is typically operationalized in a between-subjects design. Or
the scales used to assess the outcome might differ between the two types of
designs. Thus, I would recommend approaching this issue both conceptually
and empirically. Conceptually, try to obtain effect size estimates that are
comparable in principle. Then empirically, examine whether effect sizes
differ on average according to the type of design.

James

On Fri, Dec 13, 2019 at 8:00 AM Lena Sch?fer <lenaschaefer2304 at gmail.com>
wrote: