As we know, the default BiocParallel backends are currently set to MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to some extent because a new user running, say, bplapply() without additional arguments or set-up would expect some kind of parallelization. However, from a developer?s perspective, I would argue that it makes more sense to use SerialParam() by default. 1. It avoids problems with MulticoreParam stalling (especially on Macs) when the randomly chosen port is in already use. This used to be a major problem, to the point that all my BiocParallel-using functions in scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as package default would ensure BiocParallel functions run properly in the first place; if the code stalls due to switching to MulticoreParam, then it?s obvious where the problem lies (and how to fix it). 2. It avoids the alteration of the random seed when the MulticoreParam instance is constructed for the first time. library(BiocParallel) # new R session set.seed(100) invisible(bplapply(1:5, identity)) rnorm(1) # 0.1315312 set.seed(100) invisible(bplapply(1:5, identity)) rnorm(1) # -0.5021924 This is because the first bplapply() call calls bpparam(), which constructs a MulticoreParam() for the first time; this calls the PRNG to choose a random port number. Ensuing random numbers are altered, as seen above. To avoid this, I need to define the MulticoreParam() object prior to set.seed(), which undermines the utility of a default-defined bpparam(). 3. Job dispatch via SnowParam() is quite slow, which potentially makes Windows package builds run slower by default. A particularly bad example is that of scran::fastMNN(), which has a few matrix multiplications that use DelayedArray:%*%. The %*% is parallelized with the default bpparam(), resulting in SNOW parallelization on Windows. This slowed down fastMNN()?s examples from 4 seconds (unix) to ~100 seconds (windows). Clearly, serial execution is the faster option here. A related problem is MulticoreParam()?s tendency to copy the environment, which may result in problems from inflated memory consumption. So, can we default to SerialParam() on all platforms? And by this I mean the BiocParallel in-built default - I don?t want to have to instruct all my users to put a ?register(SerialParam())? at the start of their analysis scripts. I feel that BiocParallel?s job is to provide downstream code with the potential for parallelization. If end-users want actual parallelization, they had better be prepared to specify an appropriate scheme via *Param() objects. -A
[Bioc-devel] Using SerialParam() as the registered back-end for all platforms
9 messages · Martin Morgan, Aaron Lun, Henrik Bengtsson +2 more
I hope for 1. to have a 'local socket' (i.e., not using ports) implementation shortly. I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We now have
library(BiocParallel) set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538
set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538 at the expensive of using the generator when the package is loaded.
set.seed(1); rnorm(1)
[1] -0.6264538
set.seed(1); library(BiocParallel); rnorm(1)
[1] 0.1836433 Is that bad? It will be consistent across platforms. This behavior
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] 0.9624337 0.8925947
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] -0.5703597 0.1102093 seems wrong, but is consistent with mclapply
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.02704527 0.40721777
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.8239765 1.2957928 The documented behavior is to us the RNGseed= argument to *Param, but I think it could be made consistent (by default, obey the global random number seed on workers) at least on a single machine (where the default number of cores is constant). I have not (yet?) changed the default behavior to SerialParam. I guess the cost of SerialParam is from the dependent packages that need to be loaded
system.time(suppressPackageStartupMessages(library(DelayedArray)))
user system elapsed
3.068 0.082 3.150
If fastMNN() makes several calls to bplapply(), it might make sense to start the default cluster at the top of the function once
if (!isup(bpparam())) {
bpstart(bpparam())
on.exit(bpstop(bpparam()))
}
Martin
?On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <bioc-devel-bounces at r-project.org on behalf of infinite.monkeys.with.keyboards at gmail.com> wrote:
As we know, the default BiocParallel backends are currently set to MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to some extent because a new user running, say, bplapply() without additional arguments or set-up would expect some kind of parallelization. However, from a developer?s perspective, I would argue that it makes more sense to use SerialParam() by default.
1. It avoids problems with MulticoreParam stalling (especially on Macs) when the randomly chosen port is in already use. This used to be a major problem, to the point that all my BiocParallel-using functions in scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as package default would ensure BiocParallel functions run properly in the first place; if the code stalls due to switching to MulticoreParam, then it?s obvious where the problem lies (and how to fix it).
2. It avoids the alteration of the random seed when the MulticoreParam instance is constructed for the first time.
library(BiocParallel) # new R session
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # 0.1315312
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # -0.5021924
This is because the first bplapply() call calls bpparam(), which constructs a MulticoreParam() for the first time; this calls the PRNG to choose a random port number. Ensuing random numbers are altered, as seen above. To avoid this, I need to define the MulticoreParam() object prior to set.seed(), which undermines the utility of a default-defined bpparam().
3. Job dispatch via SnowParam() is quite slow, which potentially makes Windows package builds run slower by default. A particularly bad example is that of scran::fastMNN(), which has a few matrix multiplications that use DelayedArray:%*%. The %*% is parallelized with the default bpparam(), resulting in SNOW parallelization on Windows. This slowed down fastMNN()?s examples from 4 seconds (unix) to ~100 seconds (windows). Clearly, serial execution is the faster option here. A related problem is MulticoreParam()?s tendency to copy the environment, which may result in problems from inflated memory consumption.
So, can we default to SerialParam() on all platforms? And by this I mean the BiocParallel in-built default - I don?t want to have to instruct all my users to put a ?register(SerialParam())? at the start of their analysis scripts. I feel that BiocParallel?s job is to provide downstream code with the potential for parallelization. If end-users want actual parallelization, they had better be prepared to specify an appropriate scheme via *Param() objects.
-A
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
I hope for 1. to have a 'local socket' (i.e., not using ports) implementation shortly.
Yes, that would be helpful.
I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We now have
library(BiocParallel) set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538
set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538 at the expensive of using the generator when the package is loaded.
set.seed(1); rnorm(1)
[1] -0.6264538
set.seed(1); library(BiocParallel); rnorm(1)
[1] 0.1836433 Is that bad? It will be consistent across platforms.
Hm. I guess the changed behaviour is? better, in the sense that the second scenario (setting the seed before loading the package) is less likely in real analysis code. Even so, there are probably some edge cases where this could cause issues, e.g., when: set.seed(1) MyPackage::SomeFun() ? where MyPackage causes BiocParallel to be attached, which presumably changes the seed. Having thought about it for a while, the fact that bpparam() changes the random seed is only a secondary issue. The main issue is that it doesn?t change the seed reproducibly. So I wouldn?t mind *as much* if repeated calls of: set.seed(1) bpparam() rnorm(1) ? gave the same result, even if it were different from just running ?set.seed(1); rnorm(1)?. (Mind you, I?d still mind a little, but it wouldn?t be so bad.) The biggest problem with the current state of affairs is that the first call gives different results to all subsequent calls, which really interferes with debugging attempts.
This behavior
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] 0.9624337 0.8925947
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] -0.5703597 0.1102093 seems wrong, but is consistent with mclapply
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.02704527 0.40721777
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.8239765 1.2957928 The documented behavior is to us the RNGseed= argument to *Param, but I think it could be made consistent (by default, obey the global random number seed on workers) at least on a single machine (where the default number of cores is constant).
I?m less concerned with that behaviour, given it?s inherently hard to take randomization code written for serial execution and make it give the same results on multiple cores (as we discussed elsewhere).
I have not (yet?) changed the default behavior to SerialParam. I guess the cost of SerialParam is from the dependent packages that need to be loaded
system.time(suppressPackageStartupMessages(library(DelayedArray)))
user system elapsed 3.068 0.082 3.150
Does calling "SerialParam()? cause DelayedArray to be attached? That seems odd.
If fastMNN() makes several calls to bplapply(), it might make sense to start the default cluster at the top of the function once
if (!isup(bpparam())) {
bpstart(bpparam())
on.exit(bpstop(bpparam()))
}
This is probably a good idea to do in general to all of my parallelized functions, though I don?t know how much this will solve the time problem. Perhaps I should just do it and see. -A
?On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <bioc-devel-bounces at r-project.org on behalf of infinite.monkeys.with.keyboards at gmail.com> wrote: As we know, the default BiocParallel backends are currently set to MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to some extent because a new user running, say, bplapply() without additional arguments or set-up would expect some kind of parallelization. However, from a developer?s perspective, I would argue that it makes more sense to use SerialParam() by default. 1. It avoids problems with MulticoreParam stalling (especially on Macs) when the randomly chosen port is in already use. This used to be a major problem, to the point that all my BiocParallel-using functions in scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as package default would ensure BiocParallel functions run properly in the first place; if the code stalls due to switching to MulticoreParam, then it?s obvious where the problem lies (and how to fix it). 2. It avoids the alteration of the random seed when the MulticoreParam instance is constructed for the first time. library(BiocParallel) # new R session set.seed(100) invisible(bplapply(1:5, identity)) rnorm(1) # 0.1315312 set.seed(100) invisible(bplapply(1:5, identity)) rnorm(1) # -0.5021924 This is because the first bplapply() call calls bpparam(), which constructs a MulticoreParam() for the first time; this calls the PRNG to choose a random port number. Ensuing random numbers are altered, as seen above. To avoid this, I need to define the MulticoreParam() object prior to set.seed(), which undermines the utility of a default-defined bpparam(). 3. Job dispatch via SnowParam() is quite slow, which potentially makes Windows package builds run slower by default. A particularly bad example is that of scran::fastMNN(), which has a few matrix multiplications that use DelayedArray:%*%. The %*% is parallelized with the default bpparam(), resulting in SNOW parallelization on Windows. This slowed down fastMNN()?s examples from 4 seconds (unix) to ~100 seconds (windows). Clearly, serial execution is the faster option here. A related problem is MulticoreParam()?s tendency to copy the environment, which may result in problems from inflated memory consumption. So, can we default to SerialParam() on all platforms? And by this I mean the BiocParallel in-built default - I don?t want to have to instruct all my users to put a ?register(SerialParam())? at the start of their analysis scripts. I feel that BiocParallel?s job is to provide downstream code with the potential for parallelization. If end-users want actual parallelization, they had better be prepared to specify an appropriate scheme via *Param() objects. -A [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
I don't know if this is helpful for BiocParallel, but there's an extension for the foreach package that ensures reproducible RNG behavior for all parallel backends: https://cran.r-project.org/web/packages/doRNG/index.html Perhaps some of the principles from that package can be re-used? On Mon, Jan 7, 2019 at 9:37 AM Aaron Lun <
infinite.monkeys.with.keyboards at gmail.com> wrote:
I hope for 1. to have a 'local socket' (i.e., not using ports)
implementation shortly. Yes, that would be helpful.
I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We
now have
library(BiocParallel) set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538
set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538 at the expensive of using the generator when the package is loaded.
set.seed(1); rnorm(1)
[1] -0.6264538
set.seed(1); library(BiocParallel); rnorm(1)
[1] 0.1836433 Is that bad? It will be consistent across platforms.
Hm. I guess the changed behaviour is? better, in the sense that the second scenario (setting the seed before loading the package) is less likely in real analysis code. Even so, there are probably some edge cases where this could cause issues, e.g., when: set.seed(1) MyPackage::SomeFun() ? where MyPackage causes BiocParallel to be attached, which presumably changes the seed. Having thought about it for a while, the fact that bpparam() changes the random seed is only a secondary issue. The main issue is that it doesn?t change the seed reproducibly. So I wouldn?t mind *as much* if repeated calls of: set.seed(1) bpparam() rnorm(1) ? gave the same result, even if it were different from just running ?set.seed(1); rnorm(1)?. (Mind you, I?d still mind a little, but it wouldn?t be so bad.) The biggest problem with the current state of affairs is that the first call gives different results to all subsequent calls, which really interferes with debugging attempts.
This behavior
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] 0.9624337 0.8925947
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] -0.5703597 0.1102093 seems wrong, but is consistent with mclapply
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.02704527 0.40721777
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.8239765 1.2957928 The documented behavior is to us the RNGseed= argument to *Param, but I
think it could be made consistent (by default, obey the global random number seed on workers) at least on a single machine (where the default number of cores is constant). I?m less concerned with that behaviour, given it?s inherently hard to take randomization code written for serial execution and make it give the same results on multiple cores (as we discussed elsewhere).
I have not (yet?) changed the default behavior to SerialParam. I guess
the cost of SerialParam is from the dependent packages that need to be loaded
system.time(suppressPackageStartupMessages(library(DelayedArray)))
user system elapsed 3.068 0.082 3.150
Does calling "SerialParam()? cause DelayedArray to be attached? That seems odd.
If fastMNN() makes several calls to bplapply(), it might make sense to
start the default cluster at the top of the function once
if (!isup(bpparam())) {
bpstart(bpparam())
on.exit(bpstop(bpparam()))
}
This is probably a good idea to do in general to all of my parallelized functions, though I don?t know how much this will solve the time problem. Perhaps I should just do it and see. -A
?On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <
bioc-devel-bounces at r-project.org on behalf of infinite.monkeys.with.keyboards at gmail.com> wrote:
As we know, the default BiocParallel backends are currently set to
MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to some extent because a new user running, say, bplapply() without additional arguments or set-up would expect some kind of parallelization. However, from a developer?s perspective, I would argue that it makes more sense to use SerialParam() by default.
1. It avoids problems with MulticoreParam stalling (especially on
Macs) when the randomly chosen port is in already use. This used to be a major problem, to the point that all my BiocParallel-using functions in scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as package default would ensure BiocParallel functions run properly in the first place; if the code stalls due to switching to MulticoreParam, then it?s obvious where the problem lies (and how to fix it).
2. It avoids the alteration of the random seed when the
MulticoreParam instance is constructed for the first time.
library(BiocParallel) # new R session set.seed(100) invisible(bplapply(1:5, identity)) rnorm(1) # 0.1315312 set.seed(100) invisible(bplapply(1:5, identity)) rnorm(1) # -0.5021924 This is because the first bplapply() call calls bpparam(), which
constructs a MulticoreParam() for the first time; this calls the PRNG to choose a random port number. Ensuing random numbers are altered, as seen above. To avoid this, I need to define the MulticoreParam() object prior to set.seed(), which undermines the utility of a default-defined bpparam().
3. Job dispatch via SnowParam() is quite slow, which potentially
makes Windows package builds run slower by default. A particularly bad example is that of scran::fastMNN(), which has a few matrix multiplications that use DelayedArray:%*%. The %*% is parallelized with the default bpparam(), resulting in SNOW parallelization on Windows. This slowed down fastMNN()?s examples from 4 seconds (unix) to ~100 seconds (windows). Clearly, serial execution is the faster option here. A related problem is MulticoreParam()?s tendency to copy the environment, which may result in problems from inflated memory consumption.
So, can we default to SerialParam() on all platforms? And by this I
mean the BiocParallel in-built default - I don?t want to have to instruct all my users to put a ?register(SerialParam())? at the start of their analysis scripts. I feel that BiocParallel?s job is to provide downstream code with the potential for parallelization. If end-users want actual parallelization, they had better be prepared to specify an appropriate scheme via *Param() objects.
-A
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
The main problem I?ve described refers to changes in the random seed due to the MulticoreParam() constructor, prior to dispatch to workers. For the related-but-separate problem of obtaining consistent random results within each worker, we?ve been discussing the possible solutions on another Bioc-devel thread (https://stat.ethz.ch/pipermail/bioc-devel/2019-January/014498.html <https://stat.ethz.ch/pipermail/bioc-devel/2019-January/014498.html>). -A
On 7 Jan 2019, at 15:03, Ryan Thompson <rct at thompsonclan.org> wrote: I don't know if this is helpful for BiocParallel, but there's an extension for the foreach package that ensures reproducible RNG behavior for all parallel backends: https://cran.r-project.org/web/packages/doRNG/index.html <https://cran.r-project.org/web/packages/doRNG/index.html> Perhaps some of the principles from that package can be re-used? On Mon, Jan 7, 2019 at 9:37 AM Aaron Lun <infinite.monkeys.with.keyboards at gmail.com <mailto:infinite.monkeys.with.keyboards at gmail.com>> wrote:
I hope for 1. to have a 'local socket' (i.e., not using ports) implementation shortly.
Yes, that would be helpful.
I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We now have
library(BiocParallel) set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538
set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538 at the expensive of using the generator when the package is loaded.
set.seed(1); rnorm(1)
[1] -0.6264538
set.seed(1); library(BiocParallel); rnorm(1)
[1] 0.1836433 Is that bad? It will be consistent across platforms.
Hm. I guess the changed behaviour is? better, in the sense that the second scenario (setting the seed before loading the package) is less likely in real analysis code. Even so, there are probably some edge cases where this could cause issues, e.g., when: set.seed(1) MyPackage::SomeFun() ? where MyPackage causes BiocParallel to be attached, which presumably changes the seed. Having thought about it for a while, the fact that bpparam() changes the random seed is only a secondary issue. The main issue is that it doesn?t change the seed reproducibly. So I wouldn?t mind *as much* if repeated calls of: set.seed(1) bpparam() rnorm(1) ? gave the same result, even if it were different from just running ?set.seed(1); rnorm(1)?. (Mind you, I?d still mind a little, but it wouldn?t be so bad.) The biggest problem with the current state of affairs is that the first call gives different results to all subsequent calls, which really interferes with debugging attempts.
This behavior
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] 0.9624337 0.8925947
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] -0.5703597 0.1102093 seems wrong, but is consistent with mclapply
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.02704527 0.40721777
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.8239765 1.2957928 The documented behavior is to us the RNGseed= argument to *Param, but I think it could be made consistent (by default, obey the global random number seed on workers) at least on a single machine (where the default number of cores is constant).
I?m less concerned with that behaviour, given it?s inherently hard to take randomization code written for serial execution and make it give the same results on multiple cores (as we discussed elsewhere).
I have not (yet?) changed the default behavior to SerialParam. I guess the cost of SerialParam is from the dependent packages that need to be loaded
system.time(suppressPackageStartupMessages(library(DelayedArray)))
user system elapsed 3.068 0.082 3.150
Does calling "SerialParam()? cause DelayedArray to be attached? That seems odd.
If fastMNN() makes several calls to bplapply(), it might make sense to start the default cluster at the top of the function once
if (!isup(bpparam())) {
bpstart(bpparam())
on.exit(bpstop(bpparam()))
}
This is probably a good idea to do in general to all of my parallelized functions, though I don?t know how much this will solve the time problem. Perhaps I should just do it and see. -A
?On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <bioc-devel-bounces at r-project.org <mailto:bioc-devel-bounces at r-project.org> on behalf of infinite.monkeys.with.keyboards at gmail.com <mailto:infinite.monkeys.with.keyboards at gmail.com>> wrote:
As we know, the default BiocParallel backends are currently set to MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to some extent because a new user running, say, bplapply() without additional arguments or set-up would expect some kind of parallelization. However, from a developer?s perspective, I would argue that it makes more sense to use SerialParam() by default.
1. It avoids problems with MulticoreParam stalling (especially on Macs) when the randomly chosen port is in already use. This used to be a major problem, to the point that all my BiocParallel-using functions in scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as package default would ensure BiocParallel functions run properly in the first place; if the code stalls due to switching to MulticoreParam, then it?s obvious where the problem lies (and how to fix it).
2. It avoids the alteration of the random seed when the MulticoreParam instance is constructed for the first time.
library(BiocParallel) # new R session
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # 0.1315312
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # -0.5021924
This is because the first bplapply() call calls bpparam(), which constructs a MulticoreParam() for the first time; this calls the PRNG to choose a random port number. Ensuing random numbers are altered, as seen above. To avoid this, I need to define the MulticoreParam() object prior to set.seed(), which undermines the utility of a default-defined bpparam().
3. Job dispatch via SnowParam() is quite slow, which potentially makes Windows package builds run slower by default. A particularly bad example is that of scran::fastMNN(), which has a few matrix multiplications that use DelayedArray:%*%. The %*% is parallelized with the default bpparam(), resulting in SNOW parallelization on Windows. This slowed down fastMNN()?s examples from 4 seconds (unix) to ~100 seconds (windows). Clearly, serial execution is the faster option here. A related problem is MulticoreParam()?s tendency to copy the environment, which may result in problems from inflated memory consumption.
So, can we default to SerialParam() on all platforms? And by this I mean the BiocParallel in-built default - I don?t want to have to instruct all my users to put a ?register(SerialParam())? at the start of their analysis scripts. I feel that BiocParallel?s job is to provide downstream code with the potential for parallelization. If end-users want actual parallelization, they had better be prepared to specify an appropriate scheme via *Param() objects.
-A
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
_______________________________________________ Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
On Mon, Jan 7, 2019 at 4:09 AM Martin Morgan <mtmorgan.bioc at gmail.com> wrote:
I hope for 1. to have a 'local socket' (i.e., not using ports) implementation shortly. I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We now have
library(BiocParallel) set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538
set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538 at the expensive of using the generator when the package is loaded.
set.seed(1); rnorm(1)
[1] -0.6264538
set.seed(1); library(BiocParallel); rnorm(1)
[1] 0.1836433 Is that bad? It will be consistent across platforms.
Contrary to resampling methods etc., the RNG for generating random ports does _not_ have to be statistically sound. Because of this, you should be able to generate a random port while *preserving the RNG state*, i.e. undoing .GlobalEnv$.Random.seed (including removing if in a fresh R session). I decided to add this approach for future::makeClusterPSOCK() [https://github.com/HenrikBengtsson/future/blob/073e1d1/R/makeClusterPSOCK.R#L76-L82]. BTW, here's a useful development tool for detecting when the .Random.seed has been updated in an interactive session: addTaskCallback(local({ last <- .GlobalEnv$.Random.seed function(...) { curr <- .GlobalEnv$.Random.seed if (!identical(curr, last)) { warning(".Random.seed changed", call. = FALSE, immediate. = TRUE) last <<- curr } TRUE } }), name = "RNG tracker")
This behavior
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] 0.9624337 0.8925947
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] -0.5703597 0.1102093 seems wrong, but is consistent with mclapply
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.02704527 0.40721777
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.8239765 1.2957928
FWIW, without having invested much time tracking down why the design
is what it is, I'm getting concerned about how parallel::mclapply()
does parallel RNG out of the box when RNGkind("L'Ecuyer-CMRG") is in
use. For example:
library(parallel)
RNGkind("L'Ecuyer-CMRG")
unlist(mclapply(1:2, function(i) rnorm(1))) ## mc.set.seed = TRUE
[1] -0.71727 0.56171
unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.71727 0.56171
unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.71727 0.56171 I wonder how many resampling/bootstrap studies are bitten by this behavior. Same if you use mc.set.seed = FALSE;
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311 Here is how I think about parallel RNG right now: 1. To achieve fully numerically reproducible RNGs in way that is *invariant to the number of workers* (amount of chunking), I think the only solution is to pregenerated RNG seeds (using parallel::nextRNGStream()) for each individual iteration (element). In other words, if a worker will process K elements, then the main R process needs to generate K RNG seeds and pass those along to the work. I use this approach for future.apply::future_lapply(..., future.seed = TRUE/<initial_seed>), which then produce identical RNG results regardless of backend and amount of chunking. In the past, I think I've seen Martin suggesting something similar as a manual approach to some users. 2. The above approach is obviously expensive, especially when there are a large number of elements to iterate over. Because of this I'm thinking providing an option to use only one RNG seed per worker (which is the common approach used elsewhere) [https://github.com/HenrikBengtsson/future.apply/issues/20]. This won't be invariant to the number of workers, but it "should" still be statistically sound. This approach will give reproducible RNG results given the same initial seed and the same amount of chunking. 3. For algorithms which do not rely on RNG, we can ignore both of the above. The problem is that it's not always known to the user/developer which methods depend on RNG or not. The above 'RNG tracker' helps to identify some, but things might also change over time. I believe there's room for automating this in one way or the other. For instance, having a way to declare a function being dependent on RNG or not could help. Static code inspection could also do it, e.g. when an R package is built and it could be part of the R CMD checks to validate. 4. Are there other approaches? /Henrik
The documented behavior is to us the RNGseed= argument to *Param, but I think it could be made consistent (by default, obey the global random number seed on workers) at least on a single machine (where the default number of cores is constant). I have not (yet?) changed the default behavior to SerialParam. I guess the cost of SerialParam is from the dependent packages that need to be loaded
system.time(suppressPackageStartupMessages(library(DelayedArray)))
user system elapsed
3.068 0.082 3.150
If fastMNN() makes several calls to bplapply(), it might make sense to start the default cluster at the top of the function once
if (!isup(bpparam())) {
bpstart(bpparam())
on.exit(bpstop(bpparam()))
}
Martin
?On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <bioc-devel-bounces at r-project.org on behalf of infinite.monkeys.with.keyboards at gmail.com> wrote:
As we know, the default BiocParallel backends are currently set to MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to some extent because a new user running, say, bplapply() without additional arguments or set-up would expect some kind of parallelization. However, from a developer?s perspective, I would argue that it makes more sense to use SerialParam() by default.
1. It avoids problems with MulticoreParam stalling (especially on Macs) when the randomly chosen port is in already use. This used to be a major problem, to the point that all my BiocParallel-using functions in scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as package default would ensure BiocParallel functions run properly in the first place; if the code stalls due to switching to MulticoreParam, then it?s obvious where the problem lies (and how to fix it).
2. It avoids the alteration of the random seed when the MulticoreParam instance is constructed for the first time.
library(BiocParallel) # new R session
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # 0.1315312
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # -0.5021924
This is because the first bplapply() call calls bpparam(), which constructs a MulticoreParam() for the first time; this calls the PRNG to choose a random port number. Ensuing random numbers are altered, as seen above. To avoid this, I need to define the MulticoreParam() object prior to set.seed(), which undermines the utility of a default-defined bpparam().
3. Job dispatch via SnowParam() is quite slow, which potentially makes Windows package builds run slower by default. A particularly bad example is that of scran::fastMNN(), which has a few matrix multiplications that use DelayedArray:%*%. The %*% is parallelized with the default bpparam(), resulting in SNOW parallelization on Windows. This slowed down fastMNN()?s examples from 4 seconds (unix) to ~100 seconds (windows). Clearly, serial execution is the faster option here. A related problem is MulticoreParam()?s tendency to copy the environment, which may result in problems from inflated memory consumption.
So, can we default to SerialParam() on all platforms? And by this I mean the BiocParallel in-built default - I don?t want to have to instruct all my users to put a ?register(SerialParam())? at the start of their analysis scripts. I feel that BiocParallel?s job is to provide downstream code with the potential for parallelization. If end-users want actual parallelization, they had better be prepared to specify an appropriate scheme via *Param() objects.
-A
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
To add to Henrik's comments, it is also worthwhile to recognize that
mclapply() does not deliver statistically sound random numbers even within
the apply "loop" unless you use RNGkind("L'Ecuyer-CMRG") which is not set
as default. This is because mclapply will initialize random streams with
different seeds and most RNGs does not guarantee that different seeds leads
to independent streams (but L'Ecuyer does).
Reproducibility is nice, but so is statistical correctness.
Best,
Kasper
On Mon, Jan 7, 2019 at 6:26 PM Henrik Bengtsson <henrik.bengtsson at gmail.com>
wrote:
On Mon, Jan 7, 2019 at 4:09 AM Martin Morgan <mtmorgan.bioc at gmail.com> wrote:
I hope for 1. to have a 'local socket' (i.e., not using ports)
implementation shortly.
I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We
now have
library(BiocParallel) set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538
set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538 at the expensive of using the generator when the package is loaded.
set.seed(1); rnorm(1)
[1] -0.6264538
set.seed(1); library(BiocParallel); rnorm(1)
[1] 0.1836433 Is that bad? It will be consistent across platforms.
Contrary to resampling methods etc., the RNG for generating random ports does _not_ have to be statistically sound. Because of this, you should be able to generate a random port while *preserving the RNG state*, i.e. undoing .GlobalEnv$.Random.seed (including removing if in a fresh R session). I decided to add this approach for future::makeClusterPSOCK() [ https://github.com/HenrikBengtsson/future/blob/073e1d1/R/makeClusterPSOCK.R#L76-L82 ]. BTW, here's a useful development tool for detecting when the .Random.seed has been updated in an interactive session: addTaskCallback(local({ last <- .GlobalEnv$.Random.seed function(...) { curr <- .GlobalEnv$.Random.seed if (!identical(curr, last)) { warning(".Random.seed changed", call. = FALSE, immediate. = TRUE) last <<- curr } TRUE } }), name = "RNG tracker")
This behavior
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] 0.9624337 0.8925947
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] -0.5703597 0.1102093 seems wrong, but is consistent with mclapply
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.02704527 0.40721777
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.8239765 1.2957928
FWIW, without having invested much time tracking down why the design
is what it is, I'm getting concerned about how parallel::mclapply()
does parallel RNG out of the box when RNGkind("L'Ecuyer-CMRG") is in
use. For example:
library(parallel)
RNGkind("L'Ecuyer-CMRG")
unlist(mclapply(1:2, function(i) rnorm(1))) ## mc.set.seed = TRUE
[1] -0.71727 0.56171
unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.71727 0.56171
unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.71727 0.56171 I wonder how many resampling/bootstrap studies are bitten by this behavior. Same if you use mc.set.seed = FALSE;
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311 Here is how I think about parallel RNG right now: 1. To achieve fully numerically reproducible RNGs in way that is *invariant to the number of workers* (amount of chunking), I think the only solution is to pregenerated RNG seeds (using parallel::nextRNGStream()) for each individual iteration (element). In other words, if a worker will process K elements, then the main R process needs to generate K RNG seeds and pass those along to the work. I use this approach for future.apply::future_lapply(..., future.seed = TRUE/<initial_seed>), which then produce identical RNG results regardless of backend and amount of chunking. In the past, I think I've seen Martin suggesting something similar as a manual approach to some users. 2. The above approach is obviously expensive, especially when there are a large number of elements to iterate over. Because of this I'm thinking providing an option to use only one RNG seed per worker (which is the common approach used elsewhere) [https://github.com/HenrikBengtsson/future.apply/issues/20]. This won't be invariant to the number of workers, but it "should" still be statistically sound. This approach will give reproducible RNG results given the same initial seed and the same amount of chunking. 3. For algorithms which do not rely on RNG, we can ignore both of the above. The problem is that it's not always known to the user/developer which methods depend on RNG or not. The above 'RNG tracker' helps to identify some, but things might also change over time. I believe there's room for automating this in one way or the other. For instance, having a way to declare a function being dependent on RNG or not could help. Static code inspection could also do it, e.g. when an R package is built and it could be part of the R CMD checks to validate. 4. Are there other approaches? /Henrik
The documented behavior is to us the RNGseed= argument to *Param, but I
think it could be made consistent (by default, obey the global random number seed on workers) at least on a single machine (where the default number of cores is constant).
I have not (yet?) changed the default behavior to SerialParam. I guess
the cost of SerialParam is from the dependent packages that need to be loaded
system.time(suppressPackageStartupMessages(library(DelayedArray)))
user system elapsed 3.068 0.082 3.150 If fastMNN() makes several calls to bplapply(), it might make sense to
start the default cluster at the top of the function once
if (!isup(bpparam())) {
bpstart(bpparam())
on.exit(bpstop(bpparam()))
}
Martin
?On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <
bioc-devel-bounces at r-project.org on behalf of infinite.monkeys.with.keyboards at gmail.com> wrote:
As we know, the default BiocParallel backends are currently set to
MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to some extent because a new user running, say, bplapply() without additional arguments or set-up would expect some kind of parallelization. However, from a developer?s perspective, I would argue that it makes more sense to use SerialParam() by default.
1. It avoids problems with MulticoreParam stalling (especially on
Macs) when the randomly chosen port is in already use. This used to be a major problem, to the point that all my BiocParallel-using functions in scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as package default would ensure BiocParallel functions run properly in the first place; if the code stalls due to switching to MulticoreParam, then it?s obvious where the problem lies (and how to fix it).
2. It avoids the alteration of the random seed when the
MulticoreParam instance is constructed for the first time.
library(BiocParallel) # new R session
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # 0.1315312
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # -0.5021924
This is because the first bplapply() call calls bpparam(), which
constructs a MulticoreParam() for the first time; this calls the PRNG to choose a random port number. Ensuing random numbers are altered, as seen above. To avoid this, I need to define the MulticoreParam() object prior to set.seed(), which undermines the utility of a default-defined bpparam().
3. Job dispatch via SnowParam() is quite slow, which potentially
makes Windows package builds run slower by default. A particularly bad example is that of scran::fastMNN(), which has a few matrix multiplications that use DelayedArray:%*%. The %*% is parallelized with the default bpparam(), resulting in SNOW parallelization on Windows. This slowed down fastMNN()?s examples from 4 seconds (unix) to ~100 seconds (windows). Clearly, serial execution is the faster option here. A related problem is MulticoreParam()?s tendency to copy the environment, which may result in problems from inflated memory consumption.
So, can we default to SerialParam() on all platforms? And by this I
mean the BiocParallel in-built default - I don?t want to have to instruct all my users to put a ?register(SerialParam())? at the start of their analysis scripts. I feel that BiocParallel?s job is to provide downstream code with the potential for parallelization. If end-users want actual parallelization, they had better be prepared to specify an appropriate scheme via *Param() objects.
-A
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Good point Kasper, I hadn?t even considered the possibility of that. This has opened an unpleasant box of worms... I haven?t noticed any issues with MT, but I?ve just switched my DropletUtils C++ code from boost::random::mt19937 to dqrng?s pcg32 to avoid potential problems with overlaps between streams. And to avoid MT seeding issues (32-bit seed for a 19937-bit state). -A
On 8 Jan 2019, at 03:45, Kasper Daniel Hansen <kasperdanielhansen at gmail.com> wrote:
To add to Henrik's comments, it is also worthwhile to recognize that
mclapply() does not deliver statistically sound random numbers even within
the apply "loop" unless you use RNGkind("L'Ecuyer-CMRG") which is not set
as default. This is because mclapply will initialize random streams with
different seeds and most RNGs does not guarantee that different seeds leads
to independent streams (but L'Ecuyer does).
Reproducibility is nice, but so is statistical correctness.
Best,
Kasper
On Mon, Jan 7, 2019 at 6:26 PM Henrik Bengtsson <henrik.bengtsson at gmail.com>
wrote:
On Mon, Jan 7, 2019 at 4:09 AM Martin Morgan <mtmorgan.bioc at gmail.com> wrote:
I hope for 1. to have a 'local socket' (i.e., not using ports)
implementation shortly.
I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We
now have
library(BiocParallel) set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538
set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538 at the expensive of using the generator when the package is loaded.
set.seed(1); rnorm(1)
[1] -0.6264538
set.seed(1); library(BiocParallel); rnorm(1)
[1] 0.1836433 Is that bad? It will be consistent across platforms.
Contrary to resampling methods etc., the RNG for generating random ports does _not_ have to be statistically sound. Because of this, you should be able to generate a random port while *preserving the RNG state*, i.e. undoing .GlobalEnv$.Random.seed (including removing if in a fresh R session). I decided to add this approach for future::makeClusterPSOCK() [ https://github.com/HenrikBengtsson/future/blob/073e1d1/R/makeClusterPSOCK.R#L76-L82 ]. BTW, here's a useful development tool for detecting when the .Random.seed has been updated in an interactive session: addTaskCallback(local({ last <- .GlobalEnv$.Random.seed function(...) { curr <- .GlobalEnv$.Random.seed if (!identical(curr, last)) { warning(".Random.seed changed", call. = FALSE, immediate. = TRUE) last <<- curr } TRUE } }), name = "RNG tracker")
This behavior
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] 0.9624337 0.8925947
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] -0.5703597 0.1102093 seems wrong, but is consistent with mclapply
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.02704527 0.40721777
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.8239765 1.2957928
FWIW, without having invested much time tracking down why the design
is what it is, I'm getting concerned about how parallel::mclapply()
does parallel RNG out of the box when RNGkind("L'Ecuyer-CMRG") is in
use. For example:
library(parallel)
RNGkind("L'Ecuyer-CMRG")
unlist(mclapply(1:2, function(i) rnorm(1))) ## mc.set.seed = TRUE
[1] -0.71727 0.56171
unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.71727 0.56171
unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.71727 0.56171 I wonder how many resampling/bootstrap studies are bitten by this behavior. Same if you use mc.set.seed = FALSE;
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311 Here is how I think about parallel RNG right now: 1. To achieve fully numerically reproducible RNGs in way that is *invariant to the number of workers* (amount of chunking), I think the only solution is to pregenerated RNG seeds (using parallel::nextRNGStream()) for each individual iteration (element). In other words, if a worker will process K elements, then the main R process needs to generate K RNG seeds and pass those along to the work. I use this approach for future.apply::future_lapply(..., future.seed = TRUE/<initial_seed>), which then produce identical RNG results regardless of backend and amount of chunking. In the past, I think I've seen Martin suggesting something similar as a manual approach to some users. 2. The above approach is obviously expensive, especially when there are a large number of elements to iterate over. Because of this I'm thinking providing an option to use only one RNG seed per worker (which is the common approach used elsewhere) [https://github.com/HenrikBengtsson/future.apply/issues/20]. This won't be invariant to the number of workers, but it "should" still be statistically sound. This approach will give reproducible RNG results given the same initial seed and the same amount of chunking. 3. For algorithms which do not rely on RNG, we can ignore both of the above. The problem is that it's not always known to the user/developer which methods depend on RNG or not. The above 'RNG tracker' helps to identify some, but things might also change over time. I believe there's room for automating this in one way or the other. For instance, having a way to declare a function being dependent on RNG or not could help. Static code inspection could also do it, e.g. when an R package is built and it could be part of the R CMD checks to validate. 4. Are there other approaches? /Henrik
The documented behavior is to us the RNGseed= argument to *Param, but I
think it could be made consistent (by default, obey the global random number seed on workers) at least on a single machine (where the default number of cores is constant).
I have not (yet?) changed the default behavior to SerialParam. I guess
the cost of SerialParam is from the dependent packages that need to be loaded
system.time(suppressPackageStartupMessages(library(DelayedArray)))
user system elapsed 3.068 0.082 3.150 If fastMNN() makes several calls to bplapply(), it might make sense to
start the default cluster at the top of the function once
if (!isup(bpparam())) {
bpstart(bpparam())
on.exit(bpstop(bpparam()))
}
Martin
?On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <
bioc-devel-bounces at r-project.org on behalf of infinite.monkeys.with.keyboards at gmail.com> wrote:
As we know, the default BiocParallel backends are currently set to
MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to some extent because a new user running, say, bplapply() without additional arguments or set-up would expect some kind of parallelization. However, from a developer?s perspective, I would argue that it makes more sense to use SerialParam() by default.
1. It avoids problems with MulticoreParam stalling (especially on
Macs) when the randomly chosen port is in already use. This used to be a major problem, to the point that all my BiocParallel-using functions in scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as package default would ensure BiocParallel functions run properly in the first place; if the code stalls due to switching to MulticoreParam, then it?s obvious where the problem lies (and how to fix it).
2. It avoids the alteration of the random seed when the
MulticoreParam instance is constructed for the first time.
library(BiocParallel) # new R session set.seed(100) invisible(bplapply(1:5, identity)) rnorm(1) # 0.1315312 set.seed(100) invisible(bplapply(1:5, identity)) rnorm(1) # -0.5021924 This is because the first bplapply() call calls bpparam(), which
constructs a MulticoreParam() for the first time; this calls the PRNG to choose a random port number. Ensuing random numbers are altered, as seen above. To avoid this, I need to define the MulticoreParam() object prior to set.seed(), which undermines the utility of a default-defined bpparam().
3. Job dispatch via SnowParam() is quite slow, which potentially
makes Windows package builds run slower by default. A particularly bad example is that of scran::fastMNN(), which has a few matrix multiplications that use DelayedArray:%*%. The %*% is parallelized with the default bpparam(), resulting in SNOW parallelization on Windows. This slowed down fastMNN()?s examples from 4 seconds (unix) to ~100 seconds (windows). Clearly, serial execution is the faster option here. A related problem is MulticoreParam()?s tendency to copy the environment, which may result in problems from inflated memory consumption.
So, can we default to SerialParam() on all platforms? And by this I
mean the BiocParallel in-built default - I don?t want to have to instruct all my users to put a ?register(SerialParam())? at the start of their analysis scripts. I feel that BiocParallel?s job is to provide downstream code with the potential for parallelization. If end-users want actual parallelization, they had better be prepared to specify an appropriate scheme via *Param() objects.
-A
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
On Mon, Jan 7, 2019 at 3:26 PM Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote:
1. To achieve fully numerically reproducible RNGs in way that is *invariant to the number of workers* (amount of chunking), I think the only solution is to pregenerated RNG seeds (using parallel::nextRNGStream()) for each individual iteration (element). In other words, if a worker will process K elements, then the main R process needs to generate K RNG seeds and pass those along to the work. I use this approach for future.apply::future_lapply(..., future.seed = TRUE/<initial_seed>), which then produce identical RNG results regardless of backend and amount of chunking. In the past, I think I've seen Martin suggesting something similar as a manual approach to some users. 2. The above approach is obviously expensive, especially when there are a large number of elements to iterate over. Because of this I'm thinking providing an option to use only one RNG seed per worker (which is the common approach used elsewhere) [https://github.com/HenrikBengtsson/future.apply/issues/20]. This won't be invariant to the number of workers, but it "should" still be statistically sound. This approach will give reproducible RNG results given the same initial seed and the same amount of chunking. 3. For algorithms which do not rely on RNG, we can ignore both of the above. The problem is that it's not always known to the user/developer which methods depend on RNG or not. The above 'RNG tracker' helps to identify some, but things might also change over time. I believe there's room for automating this in one way or the other. For instance, having a way to declare a function being dependent on RNG or not could help. Static code inspection could also do it, e.g. when an R package is built and it could be part of the R CMD checks to validate. 4. Are there other approaches?
I don't suppose it's possible to quickly determine via static analysis whether a piece of code uses the RNG?