Skip to content

[R-meta] Performance of metafor::vcalc() vs clubSandwich::impute_covariance_matrix()

7 messages · Tamar Novetsky, James Pustejovsky, Wolfgang Viechtbauer

#
Hello,

I am working on a script to run multiple meta-regressions on different
subsets of the same dataset, and have been
using clubSandwich::impute_covariance_matrix() to generate the
variance-covariance matrix necessary as an input to metafor::rma.mv().
However, I recently learned that impute_covariance_matrix() has been
superseded by metafor::vcalc(), so I have been working to replace my usage
of the former function with the latter. In that process, I discovered that
vcalc() seems to be much slower than impute_covariance_matrix() - about
150x slower in one use case that I benchmarked using the microbenchmark
package. Since I will be running this many times in a loop, performance
matters quite a lot to me in this context.

Can anyone help me understand why vcalc() would be so much slower? Is it
possible that I'm using it incorrectly?

Secondly/possibly relatedly, I found that the results from vcalc() are
always either exactly the same or exactly double the results from
impute_covariance_matrix(). Does anyone have a sense of why that would be?
Could that be related to the performance differences?

Thanks so much for your help,


*Tamar Novetsky* *(she/her)*
Data Scientist I
Eastern Time Zone
#
Hi Tamar,

The difference in compute time is because of a difference in how the
default output of these functions is structured.
clubSandwich::impute_covariance_matrix() returns a block-diagonal by
default. metafor::vcalc() returns a full (dense) matrix by default. Say
that you have J studies and study j has kj effect sizes. The block-diagonal
matrix has sum(kj^2) entries, whereas the full matrix has sum(kj)^2
entries. If J is large and the kjs are mostly small, this can make for a
really big difference in object size. However, setting the option
vcalc(sparse = TRUE) will return a block-diagonal matrix and should lead to
performance comparable to impute_covariance_matrix().

Regarding your second question, I'm not sure what might be going on. Could
you provide a reproducible example?

James

On Tue, Aug 6, 2024 at 8:20?AM Tamar Novetsky via R-sig-meta-analysis <
r-sig-meta-analysis at r-project.org> wrote:

            

  
  
#
Thanks so much, James! Unfortunately, I didn't find a big enough
improvement in performance using vcalc(sparse = TRUE) - in the example
below, the default vcalc arguments take ~100x longer than
impute_covariance_matrix, while vcalc(sparse = TRUE) takes ~60x longer.

I couldn't reproduce the 2x values using non-proprietary data, so there
might just be something weird going on with my dataset!

Reproducible example (adapted from metafor's examples in the vcalc function
documentation):
```
library(tidyverse)
library(metafor)
library(clubSandwich)
library(microbenchmark)
set.seed(42)

# example data from metafor
dat <- dat.assink2016

# augment data so it has >1500 rows
new_rows <-
  tibble(
    study = 18:167,
    n_esid = sample(x = 1:max(dat$esid), size = 150, replace = TRUE)
  ) %>%
  uncount(n_esid) %>%
  group_by(study) %>%
  mutate(esid = row_number()) %>%
  ungroup() %>%
  mutate(
    id = row_number() + 100,
    yi = rnorm(nrow(.), mean(dat$yi), sd(dat$yi)),
    vi = rnorm(nrow(.), mean(dat$vi), sd(dat$vi)),
    vi = if_else(vi < 0, -1*vi, vi), # make sure vi is always positive
    pubstatus = sample(x = dat$pubstatus, size = nrow(.), replace = TRUE),
    year = sample(x = dat$year, size = nrow(.), replace = TRUE),
    deltype = sample(x = dat$deltype, size = nrow(.), replace = TRUE)
  )
dat_big <- bind_rows(dat, new_rows)

# benchmark performance with full matrix (this takes a minute to run)
res <- microbenchmark(
  "metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
0.6),
  "clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
dat_big$study, r = 0.6, return_list = FALSE),
  times = 10
)
summary(res)

# benchmark performance with sparse matrix (also takes a minute to run)
res_sparse <- microbenchmark(
  "metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
0.6, sparse = TRUE),
  "clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
dat_big$study, r = 0.6, return_list = FALSE),
  times = 10
)
summary(res_sparse)
```

Thanks again,


*Tamar Novetsky* *(she/her)*
Data Scientist I
Eastern Time Zone
On Tue, Aug 6, 2024 at 10:20?AM James Pustejovsky <jepusto at gmail.com> wrote:

            

  
  
#
Hi Tamar,

Thanks for the reprex. Wow this surprised me. It looks like the biggest
source of compute time in vcalc() is assignments to the sparse matrix
representation of V. There might be a way to improve efficiency there, but
will have to confer with Wolfgang.

If you are able to reproduce the other issue with getting different results
between vcalc() and impute_covariance_matrix(), please do share.

Just to provide a little bit of additional context, I am deprecating
impute_covariance_matrix() because it is an outlier in the clubSandwich
package. It is essentially the only function in the package that is
specific to meta-analysis workflows. All other package functionality
involves computing cluster-robust variance estimators and associated
quantities (hypothesis tests, confidence intervals, etc.). In terms of
conceptual organization, it makes more sense for a function like
impute_covariance_matrix() to live in metafor. However,
impute_covariance_matrix() is not going to disappear entirely from
clubSandwich, so you could certainly continue using it in the short term
for purposes of computational efficiency.

James

On Tue, Aug 6, 2024 at 10:19?AM Tamar Novetsky <tamar at growprogress.ai>
wrote:

  
  
6 days later
#
Hi all,

Chiming in here after a while (coming back to the office after my summer break).

I didn't pay much attention to speed issues when writing vcalc(), since model fitting with large datasets is going to be the bottleneck, not the construction of the V matrix (whether constructing V takes 0.1 or 1 second hardly matters when model fitting can take minutes). Also, due to the additional functionality vcalc() provides, some of the internal computations are going to be slower. Not sure if there is a way to use Map() (as cleverly used in impute_covariance_matrix()) to avoid the double-loop I use in vcalc() -- pull requests welcome!

However, for the particular use case given here, one can speed things up further by distilling things down to the mere essence. For example:

vfast <- function(vi, study, rho) {

   ids <- unique(study)
   out <- lapply(ids, function(id) {
      sel <- study == id
      n <- sum(sel)
      R <- matrix(rho, n, n)
      diag(R) <- 1
      S <- diag(sqrt(vi[sel]), n, n)
      S %*% R %*% S
   })
   bldiag(out)

}

all.equal(vcalc(vi, cluster = study, obs = esid, data = dat_big, rho = 0.6),
          vfast(dat_big$vi, dat_big$study, rho = 0.6))

# benchmark performance with full matrix
res <- microbenchmark(
  "metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho = 0.6),
  "clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster = dat_big$study, r = 0.6, return_list = FALSE),
  "vfast" = vfast(dat_big$vi, dat_big$study, rho = 0.6),
  times = 10
)
summary(res)

This is now around 5 times faster than impute_covariance_matrix() (about 0.02 seconds on average) and I assume one could speed this up even further.

@James: The deprecation message (Please use `metaffor::vcalc()` instead.) has a typo (should be metafor::vcalc()).

Best,
Wolfgang
2 days later
#
On Tue, Aug 13, 2024 at 4:45?AM Viechtbauer, Wolfgang (NP) <
wolfgang.viechtbauer at maastrichtuniversity.nl> wrote:

            
Shoot. Thanks for pointing this out. I/ve corrected it in the dev version
on Github.
#
My thanks to James, who made a pull request on Github (https://github.com/wviechtb/metafor/pull/85) with some tweaks that speed up the computations in vcalc() by a factor of 10 or so for large matrices. This has been merged into the devel version.

Best,
Wolfgang