On Mon, Jan 7, 2019 at 4:09 AM Martin Morgan <mtmorgan.bioc at gmail.com>
wrote:
I hope for 1. to have a 'local socket' (i.e., not using ports)
I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We
library(BiocParallel)
set.seed(1); p = bpparam(); rnorm(1)
set.seed(1); p = bpparam(); rnorm(1)
[1] -0.6264538
at the expensive of using the generator when the package is loaded.
set.seed(1); library(BiocParallel); rnorm(1)
[1] 0.1836433
Is that bad? It will be consistent across platforms.
Contrary to resampling methods etc., the RNG for generating random
ports does _not_ have to be statistically sound. Because of this, you
should be able to generate a random port while *preserving the RNG
state*, i.e. undoing .GlobalEnv$.Random.seed (including removing if in
a fresh R session). I decided to add this approach for
future::makeClusterPSOCK()
[
https://github.com/HenrikBengtsson/future/blob/073e1d1/R/makeClusterPSOCK.R#L76-L82
].
BTW, here's a useful development tool for detecting when the
.Random.seed has been updated in an interactive session:
addTaskCallback(local({
last <- .GlobalEnv$.Random.seed
function(...) {
curr <- .GlobalEnv$.Random.seed
if (!identical(curr, last)) {
warning(".Random.seed changed", call. = FALSE, immediate. = TRUE)
last <<- curr
}
TRUE
}
}), name = "RNG tracker")
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
[1] -0.5703597 0.1102093
seems wrong, but is consistent with mclapply
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.02704527 0.40721777
set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
FWIW, without having invested much time tracking down why the design
is what it is, I'm getting concerned about how parallel::mclapply()
does parallel RNG out of the box when RNGkind("L'Ecuyer-CMRG") is in
use. For example:
library(parallel)
RNGkind("L'Ecuyer-CMRG")
unlist(mclapply(1:2, function(i) rnorm(1))) ## mc.set.seed = TRUE
unlist(mclapply(1:2, function(i) rnorm(1)))
unlist(mclapply(1:2, function(i) rnorm(1)))
[1] -0.71727 0.56171
I wonder how many resampling/bootstrap studies are bitten by this
behavior. Same if you use mc.set.seed = FALSE;
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311
unlist(mclapply(1:2, function(i) rnorm(1), mc.set.seed = FALSE))
[1] -0.8518311 -0.8518311
Here is how I think about parallel RNG right now:
1. To achieve fully numerically reproducible RNGs in way that is
*invariant to the number of workers* (amount of chunking), I think the
only solution is to pregenerated RNG seeds (using
parallel::nextRNGStream()) for each individual iteration (element).
In other words, if a worker will process K elements, then the main R
process needs to generate K RNG seeds and pass those along to the
work. I use this approach for future.apply::future_lapply(...,
future.seed = TRUE/<initial_seed>), which then produce identical RNG
results regardless of backend and amount of chunking. In the past, I
think I've seen Martin suggesting something similar as a manual
approach to some users.
2. The above approach is obviously expensive, especially when there
are a large number of elements to iterate over. Because of this I'm
thinking providing an option to use only one RNG seed per worker
(which is the common approach used elsewhere)
[https://github.com/HenrikBengtsson/future.apply/issues/20]. This
won't be invariant to the number of workers, but it "should" still be
statistically sound. This approach will give reproducible RNG results
given the same initial seed and the same amount of chunking.
3. For algorithms which do not rely on RNG, we can ignore both of the
above. The problem is that it's not always known to the
user/developer which methods depend on RNG or not. The above 'RNG
tracker' helps to identify some, but things might also change over
time. I believe there's room for automating this in one way or the
other. For instance, having a way to declare a function being
dependent on RNG or not could help. Static code inspection could also
do it, e.g. when an R package is built and it could be part of the R
CMD checks to validate.
4. Are there other approaches?
/Henrik
The documented behavior is to us the RNGseed= argument to *Param, but I
think it could be made consistent (by default, obey the global random
number seed on workers) at least on a single machine (where the default
number of cores is constant).
I have not (yet?) changed the default behavior to SerialParam. I guess
the cost of SerialParam is from the dependent packages that need to be
loaded
system.time(suppressPackageStartupMessages(library(DelayedArray)))
user system elapsed
3.068 0.082 3.150
If fastMNN() makes several calls to bplapply(), it might make sense to
start the default cluster at the top of the function once
if (!isup(bpparam())) {
bpstart(bpparam())
on.exit(bpstop(bpparam()))
}
Martin
?On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" <
bioc-devel-bounces at r-project.org on behalf of
infinite.monkeys.with.keyboards at gmail.com> wrote:
As we know, the default BiocParallel backends are currently set to
MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to
some extent because a new user running, say, bplapply() without additional
arguments or set-up would expect some kind of parallelization. However,
from a developer?s perspective, I would argue that it makes more sense to
use SerialParam() by default.
1. It avoids problems with MulticoreParam stalling (especially on
Macs) when the randomly chosen port is in already use. This used to be a
major problem, to the point that all my BiocParallel-using functions in
scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as
package default would ensure BiocParallel functions run properly in the
first place; if the code stalls due to switching to MulticoreParam, then
it?s obvious where the problem lies (and how to fix it).
2. It avoids the alteration of the random seed when the
MulticoreParam instance is constructed for the first time.
library(BiocParallel) # new R session
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # 0.1315312
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # -0.5021924
This is because the first bplapply() call calls bpparam(), which
constructs a MulticoreParam() for the first time; this calls the PRNG to
choose a random port number. Ensuing random numbers are altered, as seen
above. To avoid this, I need to define the MulticoreParam() object prior to
set.seed(), which undermines the utility of a default-defined bpparam().
3. Job dispatch via SnowParam() is quite slow, which potentially
makes Windows package builds run slower by default. A particularly bad
example is that of scran::fastMNN(), which has a few matrix multiplications
that use DelayedArray:%*%. The %*% is parallelized with the default
bpparam(), resulting in SNOW parallelization on Windows. This slowed down
fastMNN()?s examples from 4 seconds (unix) to ~100 seconds (windows).
Clearly, serial execution is the faster option here. A related problem is
MulticoreParam()?s tendency to copy the environment, which may result in
problems from inflated memory consumption.
So, can we default to SerialParam() on all platforms? And by this I
mean the BiocParallel in-built default - I don?t want to have to instruct
all my users to put a ?register(SerialParam())? at the start of their
analysis scripts. I feel that BiocParallel?s job is to provide downstream
code with the potential for parallelization. If end-users want actual
parallelization, they had better be prepared to specify an appropriate
scheme via *Param() objects.
-A
[[alternative HTML version deleted]]