[R-meta] Performance of metafor::vcalc() vs clubSandwich::impute_covariance_matrix()

Tue, Aug 13, 2024 2:45 AM

Hi all,

Chiming in here after a while (coming back to the office after my summer break).

I didn't pay much attention to speed issues when writing vcalc(), since model fitting with large datasets is going to be the bottleneck, not the construction of the V matrix (whether constructing V takes 0.1 or 1 second hardly matters when model fitting can take minutes). Also, due to the additional functionality vcalc() provides, some of the internal computations are going to be slower. Not sure if there is a way to use Map() (as cleverly used in impute_covariance_matrix()) to avoid the double-loop I use in vcalc() -- pull requests welcome!

However, for the particular use case given here, one can speed things up further by distilling things down to the mere essence. For example:

vfast <- function(vi, study, rho) {

   ids <- unique(study)
   out <- lapply(ids, function(id) {
      sel <- study == id
      n <- sum(sel)
      R <- matrix(rho, n, n)
      diag(R) <- 1
      S <- diag(sqrt(vi[sel]), n, n)
      S %*% R %*% S
   })
   bldiag(out)

}

all.equal(vcalc(vi, cluster = study, obs = esid, data = dat_big, rho = 0.6),
          vfast(dat_big$vi, dat_big$study, rho = 0.6))

# benchmark performance with full matrix
res <- microbenchmark(
  "metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho = 0.6),
  "clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster = dat_big$study, r = 0.6, return_list = FALSE),
  "vfast" = vfast(dat_big$vi, dat_big$study, rho = 0.6),
  times = 10
)
summary(res)

This is now around 5 times faster than impute_covariance_matrix() (about 0.02 seconds on average) and I assume one could speed this up even further.

@James: The deprecation message (Please use `metaffor::vcalc()` instead.) has a typo (should be metafor::vcalc()).

Best,
Wolfgang

-----Original Message-----
From: R-sig-meta-analysis <r-sig-meta-analysis-bounces at r-project.org> On Behalf
Of James Pustejovsky via R-sig-meta-analysis
Sent: Tuesday, August 6, 2024 20:50
To: Tamar Novetsky <tamar at growprogress.ai>
Cc: James Pustejovsky <jepusto at gmail.com>; R Special Interest Group for Meta-
Analysis <r-sig-meta-analysis at r-project.org>
Subject: Re: [R-meta] Performance of metafor::vcalc() vs
clubSandwich::impute_covariance_matrix()

Hi Tamar,

Thanks for the reprex. Wow this surprised me. It looks like the biggest
source of compute time in vcalc() is assignments to the sparse matrix
representation of V. There might be a way to improve efficiency there, but
will have to confer with Wolfgang.

If you are able to reproduce the other issue with getting different results
between vcalc() and impute_covariance_matrix(), please do share.

Just to provide a little bit of additional context, I am deprecating
impute_covariance_matrix() because it is an outlier in the clubSandwich
package. It is essentially the only function in the package that is
specific to meta-analysis workflows. All other package functionality
involves computing cluster-robust variance estimators and associated
quantities (hypothesis tests, confidence intervals, etc.). In terms of
conceptual organization, it makes more sense for a function like
impute_covariance_matrix() to live in metafor. However,
impute_covariance_matrix() is not going to disappear entirely from
clubSandwich, so you could certainly continue using it in the short term
for purposes of computational efficiency.

James

On Tue, Aug 6, 2024 at 10:19?AM Tamar Novetsky <tamar at growprogress.ai>
wrote:

Thanks so much, James! Unfortunately, I didn't find a big enough
improvement in performance using vcalc(sparse = TRUE) - in the example
below, the default vcalc arguments take ~100x longer than
impute_covariance_matrix, while vcalc(sparse = TRUE) takes ~60x longer.

I couldn't reproduce the 2x values using non-proprietary data, so there
might just be something weird going on with my dataset!

Reproducible example (adapted from metafor's examples in the vcalc
function documentation):
```
library(tidyverse)
library(metafor)
library(clubSandwich)
library(microbenchmark)
set.seed(42)

# example data from metafor
dat <- dat.assink2016

# augment data so it has >1500 rows
new_rows <-
  tibble(
    study = 18:167,
    n_esid = sample(x = 1:max(dat$esid), size = 150, replace = TRUE)
  ) %>%
  uncount(n_esid) %>%
  group_by(study) %>%
  mutate(esid = row_number()) %>%
  ungroup() %>%
  mutate(
    id = row_number() + 100,
    yi = rnorm(nrow(.), mean(dat$yi), sd(dat$yi)),
    vi = rnorm(nrow(.), mean(dat$vi), sd(dat$vi)),
    vi = if_else(vi < 0, -1*vi, vi), # make sure vi is always positive
    pubstatus = sample(x = dat$pubstatus, size = nrow(.), replace = TRUE),
    year = sample(x = dat$year, size = nrow(.), replace = TRUE),
    deltype = sample(x = dat$deltype, size = nrow(.), replace = TRUE)
  )
dat_big <- bind_rows(dat, new_rows)

# benchmark performance with full matrix (this takes a minute to run)
res <- microbenchmark(
  "metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
0.6),
  "clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
dat_big$study, r = 0.6, return_list = FALSE),
  times = 10
)
summary(res)

# benchmark performance with sparse matrix (also takes a minute to run)
res_sparse <- microbenchmark(
  "metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
0.6, sparse = TRUE),
  "clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
dat_big$study, r = 0.6, return_list = FALSE),
  times = 10
)
summary(res_sparse)
```

Thanks again,


*Tamar Novetsky* *(she/her)*
Data Scientist I
Eastern Time Zone


On Tue, Aug 6, 2024 at 10:20?AM James Pustejovsky <jepusto at gmail.com>
wrote:

Hi Tamar,

The difference in compute time is because of a difference in how the
default output of these functions is structured.
clubSandwich::impute_covariance_matrix() returns a block-diagonal by
default. metafor::vcalc() returns a full (dense) matrix by default. Say
that you have J studies and study j has kj effect sizes. The block-diagonal
matrix has sum(kj^2) entries, whereas the full matrix has sum(kj)^2
entries. If J is large and the kjs are mostly small, this can make for a
really big difference in object size. However, setting the option
vcalc(sparse = TRUE) will return a block-diagonal matrix and should lead to
performance comparable to impute_covariance_matrix().

Regarding your second question, I'm not sure what might be going on.
Could you provide a reproducible example?

James

On Tue, Aug 6, 2024 at 8:20?AM Tamar Novetsky via R-sig-meta-analysis <
r-sig-meta-analysis at r-project.org> wrote:

Hello,

I am working on a script to run multiple meta-regressions on different
subsets of the same dataset, and have been
using clubSandwich::impute_covariance_matrix() to generate the
variance-covariance matrix necessary as an input to metafor::rma.mv().
However, I recently learned that impute_covariance_matrix() has been
superseded by metafor::vcalc(), so I have been working to replace my
usage
of the former function with the latter. In that process, I discovered
that
vcalc() seems to be much slower than impute_covariance_matrix() - about
150x slower in one use case that I benchmarked using the microbenchmark
package. Since I will be running this many times in a loop, performance
matters quite a lot to me in this context.

Can anyone help me understand why vcalc() would be so much slower? Is it
possible that I'm using it incorrectly?

Secondly/possibly relatedly, I found that the results from vcalc() are
always either exactly the same or exactly double the results from
impute_covariance_matrix(). Does anyone have a sense of why that would
be?
Could that be related to the performance differences?

Thanks so much for your help,

*Tamar Novetsky* *(she/her)*
Data Scientist I
Eastern Time Zone

[R-meta] Performance of metafor::vcalc() vs clubSandwich::impute_covariance_matrix()

Thread (7 messages)