Hello, I am working on a script to run multiple meta-regressions on different subsets of the same dataset, and have been using clubSandwich::impute_covariance_matrix() to generate the variance-covariance matrix necessary as an input to metafor::rma.mv(). However, I recently learned that impute_covariance_matrix() has been superseded by metafor::vcalc(), so I have been working to replace my usage of the former function with the latter. In that process, I discovered that vcalc() seems to be much slower than impute_covariance_matrix() - about 150x slower in one use case that I benchmarked using the microbenchmark package. Since I will be running this many times in a loop, performance matters quite a lot to me in this context. Can anyone help me understand why vcalc() would be so much slower? Is it possible that I'm using it incorrectly? Secondly/possibly relatedly, I found that the results from vcalc() are always either exactly the same or exactly double the results from impute_covariance_matrix(). Does anyone have a sense of why that would be? Could that be related to the performance differences? Thanks so much for your help, *Tamar Novetsky* *(she/her)* Data Scientist I Eastern Time Zone
[R-meta] Performance of metafor::vcalc() vs clubSandwich::impute_covariance_matrix()
7 messages · Tamar Novetsky, James Pustejovsky, Wolfgang Viechtbauer
Hi Tamar, The difference in compute time is because of a difference in how the default output of these functions is structured. clubSandwich::impute_covariance_matrix() returns a block-diagonal by default. metafor::vcalc() returns a full (dense) matrix by default. Say that you have J studies and study j has kj effect sizes. The block-diagonal matrix has sum(kj^2) entries, whereas the full matrix has sum(kj)^2 entries. If J is large and the kjs are mostly small, this can make for a really big difference in object size. However, setting the option vcalc(sparse = TRUE) will return a block-diagonal matrix and should lead to performance comparable to impute_covariance_matrix(). Regarding your second question, I'm not sure what might be going on. Could you provide a reproducible example? James On Tue, Aug 6, 2024 at 8:20?AM Tamar Novetsky via R-sig-meta-analysis <
r-sig-meta-analysis at r-project.org> wrote:
Hello,
I am working on a script to run multiple meta-regressions on different
subsets of the same dataset, and have been
using clubSandwich::impute_covariance_matrix() to generate the
variance-covariance matrix necessary as an input to metafor::rma.mv().
However, I recently learned that impute_covariance_matrix() has been
superseded by metafor::vcalc(), so I have been working to replace my usage
of the former function with the latter. In that process, I discovered that
vcalc() seems to be much slower than impute_covariance_matrix() - about
150x slower in one use case that I benchmarked using the microbenchmark
package. Since I will be running this many times in a loop, performance
matters quite a lot to me in this context.
Can anyone help me understand why vcalc() would be so much slower? Is it
possible that I'm using it incorrectly?
Secondly/possibly relatedly, I found that the results from vcalc() are
always either exactly the same or exactly double the results from
impute_covariance_matrix(). Does anyone have a sense of why that would be?
Could that be related to the performance differences?
Thanks so much for your help,
*Tamar Novetsky* *(she/her)*
Data Scientist I
Eastern Time Zone
[[alternative HTML version deleted]]
_______________________________________________ R-sig-meta-analysis mailing list @ R-sig-meta-analysis at r-project.org To manage your subscription to this mailing list, go to: https://stat.ethz.ch/mailman/listinfo/r-sig-meta-analysis
Thanks so much, James! Unfortunately, I didn't find a big enough
improvement in performance using vcalc(sparse = TRUE) - in the example
below, the default vcalc arguments take ~100x longer than
impute_covariance_matrix, while vcalc(sparse = TRUE) takes ~60x longer.
I couldn't reproduce the 2x values using non-proprietary data, so there
might just be something weird going on with my dataset!
Reproducible example (adapted from metafor's examples in the vcalc function
documentation):
```
library(tidyverse)
library(metafor)
library(clubSandwich)
library(microbenchmark)
set.seed(42)
# example data from metafor
dat <- dat.assink2016
# augment data so it has >1500 rows
new_rows <-
tibble(
study = 18:167,
n_esid = sample(x = 1:max(dat$esid), size = 150, replace = TRUE)
) %>%
uncount(n_esid) %>%
group_by(study) %>%
mutate(esid = row_number()) %>%
ungroup() %>%
mutate(
id = row_number() + 100,
yi = rnorm(nrow(.), mean(dat$yi), sd(dat$yi)),
vi = rnorm(nrow(.), mean(dat$vi), sd(dat$vi)),
vi = if_else(vi < 0, -1*vi, vi), # make sure vi is always positive
pubstatus = sample(x = dat$pubstatus, size = nrow(.), replace = TRUE),
year = sample(x = dat$year, size = nrow(.), replace = TRUE),
deltype = sample(x = dat$deltype, size = nrow(.), replace = TRUE)
)
dat_big <- bind_rows(dat, new_rows)
# benchmark performance with full matrix (this takes a minute to run)
res <- microbenchmark(
"metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
0.6),
"clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
dat_big$study, r = 0.6, return_list = FALSE),
times = 10
)
summary(res)
# benchmark performance with sparse matrix (also takes a minute to run)
res_sparse <- microbenchmark(
"metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
0.6, sparse = TRUE),
"clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
dat_big$study, r = 0.6, return_list = FALSE),
times = 10
)
summary(res_sparse)
```
Thanks again,
*Tamar Novetsky* *(she/her)*
Data Scientist I
Eastern Time Zone
On Tue, Aug 6, 2024 at 10:20?AM James Pustejovsky <jepusto at gmail.com> wrote:
Hi Tamar, The difference in compute time is because of a difference in how the default output of these functions is structured. clubSandwich::impute_covariance_matrix() returns a block-diagonal by default. metafor::vcalc() returns a full (dense) matrix by default. Say that you have J studies and study j has kj effect sizes. The block-diagonal matrix has sum(kj^2) entries, whereas the full matrix has sum(kj)^2 entries. If J is large and the kjs are mostly small, this can make for a really big difference in object size. However, setting the option vcalc(sparse = TRUE) will return a block-diagonal matrix and should lead to performance comparable to impute_covariance_matrix(). Regarding your second question, I'm not sure what might be going on. Could you provide a reproducible example? James On Tue, Aug 6, 2024 at 8:20?AM Tamar Novetsky via R-sig-meta-analysis < r-sig-meta-analysis at r-project.org> wrote:
Hello,
I am working on a script to run multiple meta-regressions on different
subsets of the same dataset, and have been
using clubSandwich::impute_covariance_matrix() to generate the
variance-covariance matrix necessary as an input to metafor::rma.mv().
However, I recently learned that impute_covariance_matrix() has been
superseded by metafor::vcalc(), so I have been working to replace my usage
of the former function with the latter. In that process, I discovered that
vcalc() seems to be much slower than impute_covariance_matrix() - about
150x slower in one use case that I benchmarked using the microbenchmark
package. Since I will be running this many times in a loop, performance
matters quite a lot to me in this context.
Can anyone help me understand why vcalc() would be so much slower? Is it
possible that I'm using it incorrectly?
Secondly/possibly relatedly, I found that the results from vcalc() are
always either exactly the same or exactly double the results from
impute_covariance_matrix(). Does anyone have a sense of why that would be?
Could that be related to the performance differences?
Thanks so much for your help,
*Tamar Novetsky* *(she/her)*
Data Scientist I
Eastern Time Zone
[[alternative HTML version deleted]]
_______________________________________________ R-sig-meta-analysis mailing list @ R-sig-meta-analysis at r-project.org To manage your subscription to this mailing list, go to: https://stat.ethz.ch/mailman/listinfo/r-sig-meta-analysis
Hi Tamar, Thanks for the reprex. Wow this surprised me. It looks like the biggest source of compute time in vcalc() is assignments to the sparse matrix representation of V. There might be a way to improve efficiency there, but will have to confer with Wolfgang. If you are able to reproduce the other issue with getting different results between vcalc() and impute_covariance_matrix(), please do share. Just to provide a little bit of additional context, I am deprecating impute_covariance_matrix() because it is an outlier in the clubSandwich package. It is essentially the only function in the package that is specific to meta-analysis workflows. All other package functionality involves computing cluster-robust variance estimators and associated quantities (hypothesis tests, confidence intervals, etc.). In terms of conceptual organization, it makes more sense for a function like impute_covariance_matrix() to live in metafor. However, impute_covariance_matrix() is not going to disappear entirely from clubSandwich, so you could certainly continue using it in the short term for purposes of computational efficiency. James On Tue, Aug 6, 2024 at 10:19?AM Tamar Novetsky <tamar at growprogress.ai> wrote:
Thanks so much, James! Unfortunately, I didn't find a big enough
improvement in performance using vcalc(sparse = TRUE) - in the example
below, the default vcalc arguments take ~100x longer than
impute_covariance_matrix, while vcalc(sparse = TRUE) takes ~60x longer.
I couldn't reproduce the 2x values using non-proprietary data, so there
might just be something weird going on with my dataset!
Reproducible example (adapted from metafor's examples in the vcalc
function documentation):
```
library(tidyverse)
library(metafor)
library(clubSandwich)
library(microbenchmark)
set.seed(42)
# example data from metafor
dat <- dat.assink2016
# augment data so it has >1500 rows
new_rows <-
tibble(
study = 18:167,
n_esid = sample(x = 1:max(dat$esid), size = 150, replace = TRUE)
) %>%
uncount(n_esid) %>%
group_by(study) %>%
mutate(esid = row_number()) %>%
ungroup() %>%
mutate(
id = row_number() + 100,
yi = rnorm(nrow(.), mean(dat$yi), sd(dat$yi)),
vi = rnorm(nrow(.), mean(dat$vi), sd(dat$vi)),
vi = if_else(vi < 0, -1*vi, vi), # make sure vi is always positive
pubstatus = sample(x = dat$pubstatus, size = nrow(.), replace = TRUE),
year = sample(x = dat$year, size = nrow(.), replace = TRUE),
deltype = sample(x = dat$deltype, size = nrow(.), replace = TRUE)
)
dat_big <- bind_rows(dat, new_rows)
# benchmark performance with full matrix (this takes a minute to run)
res <- microbenchmark(
"metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
0.6),
"clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
dat_big$study, r = 0.6, return_list = FALSE),
times = 10
)
summary(res)
# benchmark performance with sparse matrix (also takes a minute to run)
res_sparse <- microbenchmark(
"metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
0.6, sparse = TRUE),
"clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
dat_big$study, r = 0.6, return_list = FALSE),
times = 10
)
summary(res_sparse)
```
Thanks again,
*Tamar Novetsky* *(she/her)*
Data Scientist I
Eastern Time Zone
On Tue, Aug 6, 2024 at 10:20?AM James Pustejovsky <jepusto at gmail.com>
wrote:
Hi Tamar, The difference in compute time is because of a difference in how the default output of these functions is structured. clubSandwich::impute_covariance_matrix() returns a block-diagonal by default. metafor::vcalc() returns a full (dense) matrix by default. Say that you have J studies and study j has kj effect sizes. The block-diagonal matrix has sum(kj^2) entries, whereas the full matrix has sum(kj)^2 entries. If J is large and the kjs are mostly small, this can make for a really big difference in object size. However, setting the option vcalc(sparse = TRUE) will return a block-diagonal matrix and should lead to performance comparable to impute_covariance_matrix(). Regarding your second question, I'm not sure what might be going on. Could you provide a reproducible example? James On Tue, Aug 6, 2024 at 8:20?AM Tamar Novetsky via R-sig-meta-analysis < r-sig-meta-analysis at r-project.org> wrote:
Hello,
I am working on a script to run multiple meta-regressions on different
subsets of the same dataset, and have been
using clubSandwich::impute_covariance_matrix() to generate the
variance-covariance matrix necessary as an input to metafor::rma.mv().
However, I recently learned that impute_covariance_matrix() has been
superseded by metafor::vcalc(), so I have been working to replace my
usage
of the former function with the latter. In that process, I discovered
that
vcalc() seems to be much slower than impute_covariance_matrix() - about
150x slower in one use case that I benchmarked using the microbenchmark
package. Since I will be running this many times in a loop, performance
matters quite a lot to me in this context.
Can anyone help me understand why vcalc() would be so much slower? Is it
possible that I'm using it incorrectly?
Secondly/possibly relatedly, I found that the results from vcalc() are
always either exactly the same or exactly double the results from
impute_covariance_matrix(). Does anyone have a sense of why that would
be?
Could that be related to the performance differences?
Thanks so much for your help,
*Tamar Novetsky* *(she/her)*
Data Scientist I
Eastern Time Zone
[[alternative HTML version deleted]]
_______________________________________________ R-sig-meta-analysis mailing list @ R-sig-meta-analysis at r-project.org To manage your subscription to this mailing list, go to: https://stat.ethz.ch/mailman/listinfo/r-sig-meta-analysis
6 days later
Hi all,
Chiming in here after a while (coming back to the office after my summer break).
I didn't pay much attention to speed issues when writing vcalc(), since model fitting with large datasets is going to be the bottleneck, not the construction of the V matrix (whether constructing V takes 0.1 or 1 second hardly matters when model fitting can take minutes). Also, due to the additional functionality vcalc() provides, some of the internal computations are going to be slower. Not sure if there is a way to use Map() (as cleverly used in impute_covariance_matrix()) to avoid the double-loop I use in vcalc() -- pull requests welcome!
However, for the particular use case given here, one can speed things up further by distilling things down to the mere essence. For example:
vfast <- function(vi, study, rho) {
ids <- unique(study)
out <- lapply(ids, function(id) {
sel <- study == id
n <- sum(sel)
R <- matrix(rho, n, n)
diag(R) <- 1
S <- diag(sqrt(vi[sel]), n, n)
S %*% R %*% S
})
bldiag(out)
}
all.equal(vcalc(vi, cluster = study, obs = esid, data = dat_big, rho = 0.6),
vfast(dat_big$vi, dat_big$study, rho = 0.6))
# benchmark performance with full matrix
res <- microbenchmark(
"metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho = 0.6),
"clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster = dat_big$study, r = 0.6, return_list = FALSE),
"vfast" = vfast(dat_big$vi, dat_big$study, rho = 0.6),
times = 10
)
summary(res)
This is now around 5 times faster than impute_covariance_matrix() (about 0.02 seconds on average) and I assume one could speed this up even further.
@James: The deprecation message (Please use `metaffor::vcalc()` instead.) has a typo (should be metafor::vcalc()).
Best,
Wolfgang
-----Original Message----- From: R-sig-meta-analysis <r-sig-meta-analysis-bounces at r-project.org> On Behalf Of James Pustejovsky via R-sig-meta-analysis Sent: Tuesday, August 6, 2024 20:50 To: Tamar Novetsky <tamar at growprogress.ai> Cc: James Pustejovsky <jepusto at gmail.com>; R Special Interest Group for Meta- Analysis <r-sig-meta-analysis at r-project.org> Subject: Re: [R-meta] Performance of metafor::vcalc() vs clubSandwich::impute_covariance_matrix() Hi Tamar, Thanks for the reprex. Wow this surprised me. It looks like the biggest source of compute time in vcalc() is assignments to the sparse matrix representation of V. There might be a way to improve efficiency there, but will have to confer with Wolfgang. If you are able to reproduce the other issue with getting different results between vcalc() and impute_covariance_matrix(), please do share. Just to provide a little bit of additional context, I am deprecating impute_covariance_matrix() because it is an outlier in the clubSandwich package. It is essentially the only function in the package that is specific to meta-analysis workflows. All other package functionality involves computing cluster-robust variance estimators and associated quantities (hypothesis tests, confidence intervals, etc.). In terms of conceptual organization, it makes more sense for a function like impute_covariance_matrix() to live in metafor. However, impute_covariance_matrix() is not going to disappear entirely from clubSandwich, so you could certainly continue using it in the short term for purposes of computational efficiency. James On Tue, Aug 6, 2024 at 10:19?AM Tamar Novetsky <tamar at growprogress.ai> wrote:
Thanks so much, James! Unfortunately, I didn't find a big enough
improvement in performance using vcalc(sparse = TRUE) - in the example
below, the default vcalc arguments take ~100x longer than
impute_covariance_matrix, while vcalc(sparse = TRUE) takes ~60x longer.
I couldn't reproduce the 2x values using non-proprietary data, so there
might just be something weird going on with my dataset!
Reproducible example (adapted from metafor's examples in the vcalc
function documentation):
```
library(tidyverse)
library(metafor)
library(clubSandwich)
library(microbenchmark)
set.seed(42)
# example data from metafor
dat <- dat.assink2016
# augment data so it has >1500 rows
new_rows <-
tibble(
study = 18:167,
n_esid = sample(x = 1:max(dat$esid), size = 150, replace = TRUE)
) %>%
uncount(n_esid) %>%
group_by(study) %>%
mutate(esid = row_number()) %>%
ungroup() %>%
mutate(
id = row_number() + 100,
yi = rnorm(nrow(.), mean(dat$yi), sd(dat$yi)),
vi = rnorm(nrow(.), mean(dat$vi), sd(dat$vi)),
vi = if_else(vi < 0, -1*vi, vi), # make sure vi is always positive
pubstatus = sample(x = dat$pubstatus, size = nrow(.), replace = TRUE),
year = sample(x = dat$year, size = nrow(.), replace = TRUE),
deltype = sample(x = dat$deltype, size = nrow(.), replace = TRUE)
)
dat_big <- bind_rows(dat, new_rows)
# benchmark performance with full matrix (this takes a minute to run)
res <- microbenchmark(
"metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
0.6),
"clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
dat_big$study, r = 0.6, return_list = FALSE),
times = 10
)
summary(res)
# benchmark performance with sparse matrix (also takes a minute to run)
res_sparse <- microbenchmark(
"metafor" = vcalc(vi, cluster = study, obs = esid, data = dat_big, rho =
0.6, sparse = TRUE),
"clubSandwich" = impute_covariance_matrix(vi = dat_big$vi, cluster =
dat_big$study, r = 0.6, return_list = FALSE),
times = 10
)
summary(res_sparse)
```
Thanks again,
*Tamar Novetsky* *(she/her)*
Data Scientist I
Eastern Time Zone
On Tue, Aug 6, 2024 at 10:20?AM James Pustejovsky <jepusto at gmail.com>
wrote:
Hi Tamar, The difference in compute time is because of a difference in how the default output of these functions is structured. clubSandwich::impute_covariance_matrix() returns a block-diagonal by default. metafor::vcalc() returns a full (dense) matrix by default. Say that you have J studies and study j has kj effect sizes. The block-diagonal matrix has sum(kj^2) entries, whereas the full matrix has sum(kj)^2 entries. If J is large and the kjs are mostly small, this can make for a really big difference in object size. However, setting the option vcalc(sparse = TRUE) will return a block-diagonal matrix and should lead to performance comparable to impute_covariance_matrix(). Regarding your second question, I'm not sure what might be going on. Could you provide a reproducible example? James On Tue, Aug 6, 2024 at 8:20?AM Tamar Novetsky via R-sig-meta-analysis < r-sig-meta-analysis at r-project.org> wrote:
Hello, I am working on a script to run multiple meta-regressions on different subsets of the same dataset, and have been using clubSandwich::impute_covariance_matrix() to generate the variance-covariance matrix necessary as an input to metafor::rma.mv(). However, I recently learned that impute_covariance_matrix() has been superseded by metafor::vcalc(), so I have been working to replace my usage of the former function with the latter. In that process, I discovered that vcalc() seems to be much slower than impute_covariance_matrix() - about 150x slower in one use case that I benchmarked using the microbenchmark package. Since I will be running this many times in a loop, performance matters quite a lot to me in this context. Can anyone help me understand why vcalc() would be so much slower? Is it possible that I'm using it incorrectly? Secondly/possibly relatedly, I found that the results from vcalc() are always either exactly the same or exactly double the results from impute_covariance_matrix(). Does anyone have a sense of why that would be? Could that be related to the performance differences? Thanks so much for your help, *Tamar Novetsky* *(she/her)* Data Scientist I Eastern Time Zone
2 days later
On Tue, Aug 13, 2024 at 4:45?AM Viechtbauer, Wolfgang (NP) <
wolfgang.viechtbauer at maastrichtuniversity.nl> wrote:
@James: The deprecation message (Please use `metaffor::vcalc()` instead.) has a typo (should be metafor::vcalc()).
Shoot. Thanks for pointing this out. I/ve corrected it in the dev version on Github.
My thanks to James, who made a pull request on Github (https://github.com/wviechtb/metafor/pull/85) with some tweaks that speed up the computations in vcalc() by a factor of 10 or so for large matrices. This has been merged into the devel version. Best, Wolfgang
-----Original Message----- From: James Pustejovsky <jepusto at gmail.com> Sent: Friday, August 16, 2024 04:23 To: Viechtbauer, Wolfgang (NP) <wolfgang.viechtbauer at maastrichtuniversity.nl> Cc: R Special Interest Group for Meta-Analysis <r-sig-meta-analysis at r- project.org>; Tamar Novetsky <tamar at growprogress.ai> Subject: Re: [R-meta] Performance of metafor::vcalc() vs clubSandwich::impute_covariance_matrix() On Tue, Aug 13, 2024 at 4:45?AM Viechtbauer, Wolfgang (NP) <mailto:wolfgang.viechtbauer at maastrichtuniversity.nl> wrote: @James: The deprecation message (Please use `metaffor::vcalc()` instead.) has a typo (should be metafor::vcalc()). Shoot. Thanks for pointing this out. I/ve corrected it in the dev version on Github.