[R-meta] TR: cooks.distance.rma.mv is slow on complex models

3 messages · Viechtbauer Wolfgang (STAT), Martineau, Roger

Wed, Aug 16, 2017 9:37 AM #

Dear Wolfgang,

Thanks a lot for the follow-up and the improvement to cooks.distance.rma.mv().

I just re-computed Cooks distance values on the 4-level model and the computer solved the function in 1 min 11 sec instead of 14 min 5 sec as done previously. I can certainly live with that. It is a great improvement and I especially like the option of computing Cooks distance values for entire groups/clusters of estimates.

Adding the argument progbar = TRUE to the function works well.

Thanks again,

Roger ?

-----Message d'origine-----
De?: Viechtbauer Wolfgang (SP) [mailto:wolfgang.viechtbauer at maastrichtuniversity.nl] 
Envoy??: 31 juillet 2017 16:56
??: r-sig-meta-analysis at r-project.org
Cc?: Martineau, Roger
Objet?: RE: cooks.distance.rma.mv is slow on complex models

Just a follow-up on this:

If you install the devel version of metafor (http://www.metafor-project.org/doku.php/installation#development_version), you will find a lot of improvements to cooks.distance.rma.mv(). In particular, it:

1) now can do parallel processing,
2) generally should run (quite a bit?) faster (starting values for the repeated model fits are set to the parameter estimates from the 'full' model using all data -- which are likely to be much better starting values than the default ones),
3) offers the possibility to compute approximate Cook's distance values where the variance/correlation components are not re-estimated for each model fit (reestimate=FALSE); doing so only yields an approximation to the Cook's distances that ignores the influence on the variance/correlation components, but is considerably faster (and often yields similar results), and
4) has a 'cluster' argument that allows computing Cook's distances not just for individual estimates, but for entire groups/clusters of estimates.

In Roger's analysis: Fit model 'tmp.casdiet' with 'random = ~1|laboratory/experiment/study' and then use:

cooks.distance(tmp.casdiet) ### default is Cook's distance for all estimates cooks.distance(tmp.casdiet, cluster=tmp.dat.MTPY.new$laboratory)
cooks.distance(tmp.casdiet, cluster=tmp.dat.MTPY.new$experiment)

Not entirely sure about the last one -- it depends on how you have coded 'experiment'; if it is not unique across the levels of 'laboratory', you would want to use:

cooks.distance(tmp.casdiet, cluster=interaction(tmp.dat.MTPY.new$laboratory, tmp.dat.MTPY.new$experiment))

Best,
Wolfgang

-- 
Wolfgang Viechtbauer, Ph.D., Statistician | Department of Psychiatry and    
Neuropsychology | Maastricht University | P.O. Box 616 (VIJV1) | 6200 MD    
Maastricht, The Netherlands | +31 (43) 388-4170 | http://www.wvbauer.com    

-----Original Message-----
From: R-sig-meta-analysis [mailto:r-sig-meta-analysis-bounces at r-project.org] On Behalf Of Viechtbauer Wolfgang (SP)
Sent: Tuesday, July 25, 2017 19:54
To: r-sig-meta-analysis at r-project.org
Cc: Martineau, Roger
Subject: Re: [R-meta] cooks.distance.rma.mv is slow on complex models

Hi Roger,

In order to compute Cook's distance for the ith estimate, we have to refit the model with the ith estimate removed. For simple models like those fit by lm(), there are shortcuts that we can use to avoid actually having to refit the model. See:

https://en.wikipedia.org/wiki/Cook%27s_distance

For more complex models, those shortcuts do not exist (at least, I am not aware of them). So, right now, cooks.distance() crunches through k model fits (k being the number of estimates), which can take a considerable amount of time for complex models and large datasets. Yes, that kind of sucks.

On my to-do list is to add parallel processing to cooks.distance(). profile() and gosh() are similar functions that do repeated model fits and those already can do parallel processing. My plan is to add this to cooks.distance(). At least then, you can cut down on the time if you have a computer with multiple cores.

In the meantime, you could give Microsoft R Open (MRO) a try. There is nothing fundamentally different about MRO, except that it directly comes with Intel's Math Kernel Library (MKL) and that is likely to speed up model fitting quite a bit. See:

http://www.metafor-project.org/doku.php/tips:speeding_up_model_fitting

Alternatively, you could consider computing a sort of 'pseudo Cook's distance' based on the same idea as how Cook's distance is computed for standard linear models. In essence, it's just some kind of squared standardized residual multiplied by h/(1-h)^2, where h is the hatvalue. A simple example:

dat <- get(data(dat.bcg))
dat <- escalc(measure="RR", ai=tpos, bi=tneg, ci=cpos, di=cneg, data=dat.bcg) res <- rma(yi, vi, data=dat)

sav1 <- cooks.distance(res)
sav2 <- rstandard(res)$z^2 * hatvalues(res) / (1-hatvalues(res))^2

plot(sav1, sav2)
cor(sav1, sav2)

Not exactly the same thing, but quite close. A more complex example:

dat <- get(data(dat.konstantopoulos2011))
res <- rma.mv(yi, vi, random = ~ 1 | district/school, data=dat)

sav1 <- cooks.distance(res)
sav2 <- rstandard(res)$z^2 * hatvalues(res) / (1-hatvalues(res))^2

plot(sav1, sav2)
cor(sav1, sav2)

Okay, now things start to not align so nicely anymore, but maybe still okay as a rough check.

Best,
Wolfgang

Wolfgang Viechtbauer, Ph.D., Statistician | Department of Psychiatry and    
Neuropsychology | Maastricht University | P.O. Box 616 (VIJV1) | 6200 MD    
Maastricht, The Netherlands | +31 (43) 388-4170 | http://www.wvbauer.com    

-----Original Message-----
From: R-sig-meta-analysis [mailto:r-sig-meta-analysis-bounces at r-project.org] On Behalf Of Martineau, Roger
Sent: Tuesday, July 25, 2017 18:56
To: r-sig-meta-analysis at r-project.org
Subject: [R-meta] cooks.distance.rma.mv is slow on complex models

Dear metafor users,

I am running complex models using rma.mv function of metafor (metafor_2.0-0; R version 3.4.0) on a large dataset (n = 197 studies) with a 4-level hierarchical structure of data.

#### Model ####
(tmp.casdiet <- rma.mv(MTPYMean, MTPYSEMtrDP^2, data=tmp.dat.MTPY.new,
????????????????????? mods = ~
????????????????????????cMPbal +
????????????????????????cNELbal +
??????????????????????? cMPsupply.kg +
??????????????????????? factor(CasDiet)*cMPsupply.kg,
????????????????????? random = ~1|laboratory/experiment/study,
????????????????????? method = "REML", sparse=TRUE))

All metafor functions run well except cooks.distance.rma.mv which is really slowing my work. I have a new computer that should be up to the task.
??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? 
?The time spent to execute the task ?is related to the random statement:
? random = ~1|laboratory/experiment/study : 14 min 5 sec ? random = ~1|experiment/study : 8 min 10 sec ? random = ~1|study: 1 min 12 sec
??????????????????????????????????????????????? 
Find below more info on variance components and Cook?s distance graphs. 

In this model it wouldn?t make much of a difference to run the cooks.distance.rma.mv function with random = ~1|study but it is not the correct way to detect influential cases. 

Is there another way to speed up the process ?

Thanks in advance,

Roger Martineau ?

The variance components and the associated Cook?s distance graphs are:

Variance Components: 

????????????? estim???? sqrt? nlvls? fixed
sigma^2.1? 178.9385? 13.3768???? 21???? no
sigma^2.2? 409.8262? 20.2442???? 47???? no
sigma^2.3? ??0.0000?? 0.0016???? 69???? no
??????????????????????????????? factor
sigma^2.1?????????????????? laboratory
sigma^2.2??????? laboratory/experiment
sigma^2.3? laboratory/experiment/study

#### Cook's distance ####
par(mfrow=c(1,1))
tmp.cook <- cooks.distance.rma.mv(tmp.casdiet, progbar=TRUE) plot(tmp.cook, type="o", pch=19) which(tmp.cook > 1)

? random = ~1|laboratory/experiment/study

? random = ~1|experiment/study

? random = ~1|study

Viechtbauer Wolfgang (STAT)

Wed, Aug 16, 2017 9:51 AM #

Good to hear that.

By the way, there is now also rstudent() for 'rma.mv' objects (also with a 'cluster' argument, option for parallel processing, and 'reestimate' argument). Also, rstandard() now has a cluster argument. When using the cluster argument with rstandard() and rstudent(), the functions also compute cluster-level multivariate (internally or externally) standardized residuals. So, all of the tools are there for proper outlier diagnostics in 'rma.mv' models (i.e., one can check for outlying estimates and clusters).

For checking for influential estimates/clusters, there is cooks.distance() and I also just added dfbetas() for 'rma.mv' models (but I usually find Cook's distances sufficient). I will add an influence.rma.mv() function soon that will also provide covariance ratios (those can be a useful addition to Cook's distances).

Best,
Wolfgang

-----Original Message-----
From: Martineau, Roger [mailto:Roger.Martineau at AGR.GC.CA] 
Sent: Wednesday, August 16, 2017 18:38
To: Viechtbauer Wolfgang (SP)
Cc: r-sig-meta-analysis at r-project.org
Subject: TR: cooks.distance.rma.mv is slow on complex models

Dear Wolfgang,

Thanks a lot for the follow-up and the improvement to cooks.distance.rma.mv().

I just re-computed Cooks distance values on the 4-level model and the computer solved the function in 1 min 11 sec instead of 14 min 5 sec as done previously. I can certainly live with that. It is a great improvement and I especially like the option of computing Cooks distance values for entire groups/clusters of estimates.

Adding the argument progbar = TRUE to the function works well.

Thanks again,

Roger ?

-----Message d'origine-----
De?: Viechtbauer Wolfgang (SP) [mailto:wolfgang.viechtbauer at maastrichtuniversity.nl] 
Envoy??: 31 juillet 2017 16:56
??: r-sig-meta-analysis at r-project.org
Cc?: Martineau, Roger
Objet?: RE: cooks.distance.rma.mv is slow on complex models

Just a follow-up on this:

If you install the devel version of metafor (http://www.metafor-project.org/doku.php/installation#development_version), you will find a lot of improvements to cooks.distance.rma.mv(). In particular, it:

1) now can do parallel processing,
2) generally should run (quite a bit?) faster (starting values for the repeated model fits are set to the parameter estimates from the 'full' model using all data -- which are likely to be much better starting values than the default ones),
3) offers the possibility to compute approximate Cook's distance values where the variance/correlation components are not re-estimated for each model fit (reestimate=FALSE); doing so only yields an approximation to the Cook's distances that ignores the influence on the variance/correlation components, but is considerably faster (and often yields similar results), and
4) has a 'cluster' argument that allows computing Cook's distances not just for individual estimates, but for entire groups/clusters of estimates.

In Roger's analysis: Fit model 'tmp.casdiet' with 'random = ~1|laboratory/experiment/study' and then use:

cooks.distance(tmp.casdiet) ### default is Cook's distance for all estimates cooks.distance(tmp.casdiet, cluster=tmp.dat.MTPY.new$laboratory)
cooks.distance(tmp.casdiet, cluster=tmp.dat.MTPY.new$experiment)

Not entirely sure about the last one -- it depends on how you have coded 'experiment'; if it is not unique across the levels of 'laboratory', you would want to use:

cooks.distance(tmp.casdiet, cluster=interaction(tmp.dat.MTPY.new$laboratory, tmp.dat.MTPY.new$experiment))

Best,
Wolfgang

Martineau, Roger

Wed, Aug 16, 2017 9:54 AM #

Dear Wolfgang,

Thanks a lot.

These are major improvements for analyzing complex models.

Best

Roger ?

-----Message d'origine-----
De?: Viechtbauer Wolfgang (SP) [mailto:wolfgang.viechtbauer at maastrichtuniversity.nl] 
Envoy??: 16 ao?t 2017 12:52
??: Martineau, Roger
Cc?: r-sig-meta-analysis at r-project.org
Objet?: RE: cooks.distance.rma.mv is slow on complex models

Good to hear that.

By the way, there is now also rstudent() for 'rma.mv' objects (also with a 'cluster' argument, option for parallel processing, and 'reestimate' argument). Also, rstandard() now has a cluster argument. When using the cluster argument with rstandard() and rstudent(), the functions also compute cluster-level multivariate (internally or externally) standardized residuals. So, all of the tools are there for proper outlier diagnostics in 'rma.mv' models (i.e., one can check for outlying estimates and clusters).

For checking for influential estimates/clusters, there is cooks.distance() and I also just added dfbetas() for 'rma.mv' models (but I usually find Cook's distances sufficient). I will add an influence.rma.mv() function soon that will also provide covariance ratios (those can be a useful addition to Cook's distances).

Best,
Wolfgang

-----Original Message-----
From: Martineau, Roger [mailto:Roger.Martineau at AGR.GC.CA] 
Sent: Wednesday, August 16, 2017 18:38
To: Viechtbauer Wolfgang (SP)
Cc: r-sig-meta-analysis at r-project.org
Subject: TR: cooks.distance.rma.mv is slow on complex models

Dear Wolfgang,

Thanks a lot for the follow-up and the improvement to cooks.distance.rma.mv().

I just re-computed Cooks distance values on the 4-level model and the computer solved the function in 1 min 11 sec instead of 14 min 5 sec as done previously. I can certainly live with that. It is a great improvement and I especially like the option of computing Cooks distance values for entire groups/clusters of estimates.

Adding the argument progbar = TRUE to the function works well.

Thanks again,

Roger ?

-----Message d'origine-----
De?: Viechtbauer Wolfgang (SP) [mailto:wolfgang.viechtbauer at maastrichtuniversity.nl] 
Envoy??: 31 juillet 2017 16:56
??: r-sig-meta-analysis at r-project.org
Cc?: Martineau, Roger
Objet?: RE: cooks.distance.rma.mv is slow on complex models

Just a follow-up on this:

If you install the devel version of metafor (http://www.metafor-project.org/doku.php/installation#development_version), you will find a lot of improvements to cooks.distance.rma.mv(). In particular, it:

1) now can do parallel processing,
2) generally should run (quite a bit?) faster (starting values for the repeated model fits are set to the parameter estimates from the 'full' model using all data -- which are likely to be much better starting values than the default ones),
3) offers the possibility to compute approximate Cook's distance values where the variance/correlation components are not re-estimated for each model fit (reestimate=FALSE); doing so only yields an approximation to the Cook's distances that ignores the influence on the variance/correlation components, but is considerably faster (and often yields similar results), and
4) has a 'cluster' argument that allows computing Cook's distances not just for individual estimates, but for entire groups/clusters of estimates.

In Roger's analysis: Fit model 'tmp.casdiet' with 'random = ~1|laboratory/experiment/study' and then use:

cooks.distance(tmp.casdiet) ### default is Cook's distance for all estimates cooks.distance(tmp.casdiet, cluster=tmp.dat.MTPY.new$laboratory)
cooks.distance(tmp.casdiet, cluster=tmp.dat.MTPY.new$experiment)

Not entirely sure about the last one -- it depends on how you have coded 'experiment'; if it is not unique across the levels of 'laboratory', you would want to use:

cooks.distance(tmp.casdiet, cluster=interaction(tmp.dat.MTPY.new$laboratory, tmp.dat.MTPY.new$experiment))

Best,
Wolfgang