Skip to content

portable parallel seeds project: request for critiques

11 messages · Paul Gilbert, Thomas Lumley, Paul Johnson +1 more

#
I've got another edition of my simulation replication framework.  I'm
attaching 2 R files and pasting in the readme.

I would especially like to know if I'm doing anything that breaks
.Random.seed or other things that R's parallel uses in the
environment.

In case you don't want to wrestle with attachments, the same files are
online in our SVN

http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/


## Paul E. Johnson CRMDA <pauljohn at ku.edu>
## Portable Parallel Seeds Project.
## 2012-02-18

Portable Parallel Seeds Project

This is how I'm going to recommend we work with random number seeds in
simulations. It enhances work that requires runs with random numbers,
whether runs are in a cluster computing environment or in a single
workstation.

It is a solution for two separate problems.

Problem 1. I scripted up 1000 R runs and need high quality,
unique, replicable random streams for each one. Each simulation
runs separately, but I need to be confident their streams are
not correlated or overlapping. For replication, I need to be able to
select any run, say 667, and restart it exactly as it was.

Problem 2. I've written a Parallel MPI (Message Passing Interface)
routine that launches 1000 runs and I need to assure each has
a unique, replicatable, random stream. I need to be able to
select any run, say 667, and restart it exactly as it was.

This project develops one approach to create replicable simulations.
It blends ideas about seed management from John M. Chambers
Software for Data Analysis (2008) with ideas from the snowFT
package by Hana Sevcikova and Tony R. Rossini.


Here's my proposal.

1. Run a preliminary program to generate an array of seeds

run1:   seed1.1   seed1.2   seed1.3
run2:   seed2.1   seed2.2   seed2.3
run3:   seed3.1   seed3.2   seed3.3
...      ...       ...
run1000   seed1000.1  seed1000.2   seed1000.3

This example provides 3 separate streams of random numbers within each
run. Because we will use the L'Ecuyer "many separate streams"
approach, we are confident that there is no correlation or overlap
between any of the runs.

The projSeeds has to have one row per project, but it is not a huge
file. I created seeds for 2000 runs of a project that requires 2 seeds
per run.  The saved size of the file 104443kb, which is very small. By
comparison, a 1400x1050 jpg image would usually be twice that size.
If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
still pretty small.

Because the seeds are saved in a file, we are sure each
run can be replicated. We just have to teach each program
how to use the seeds. That is step two.


2. Inside each run, an initialization function runs that loads the
seeds file and takes the row of seeds that it needs.  As the
simulation progresses, the user can ask for random numbers from the
separate streams. When we need random draws from a particular stream,
we set the variable "currentStream" with the function useStream().

The function initSeedStreams creates several objects in
the global environment. It sets the integer currentStream,
as well as two list objects, startSeeds and currentSeeds.
At the outset of the run, startSeeds and currentSeeds
are the same thing. When we change the currentStream
to a different stream, the currentSeeds vector is
updated to remember where that stream was when we stopped
drawing numbers from it.


Now, for the proof of concept. A working example.

Step 1. Create the Seeds. Review the R program

seedCreator.R

That creates the file "projSeeds.rda".


Step 2. Use one row of seeds per run.

Please review "controlledSeeds.R" to see an example usage
that I've tested on a cluster.

"controlledSeeds.R" can also be run on a single workstation for
testing purposes.  There is a variable "runningInMPI" which determines
whether the code is supposed to run on the RMPI cluster or just in a
single workstation.


The code for each run of the model begins by loading the
required libraries and loading the seed file, if it exists, or
generating a new "projSeed" object if it is not found.

library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(234234)
if (file.exists("projSeeds.rda")) {
  load("projSeeds.rda")
} else {
  source("seedCreator.R")
}

## Suppose the "run" number is:
run <- 232
initSeedStreams(run)	

After that, R's random generator functions will draw values
from the first random random stream that was initialized
in projSeeds. When each repetition (run) occurs,
R looks up the right seed for that run, and uses it.

If the user wants to begin drawing observations from the
second random stream, this command is used:

useStream(2)

If the user has drawn values from stream 1 already, but
wishes to begin again at the initial point in that stream,
use this command

useStream(1, origin = TRUE)


Question: Why is this approach better for parallel runs?

Answer: After a batch of simulations, we can re-start any
one of them and repeat it exactly. This builds on the idea
of the snowFT package, by Hana Sevcikova and A.J. Rossini.

That is different from the default approach of most R parallel
designs, including R's own parallel, RMPI and snow.

The ordinary way of controlling seeds in R parallel would initialize
the 50 nodes, and we would lose control over seeds because runs would
be repeatedly assigned to nodes. The aim here is to make sure that
each particular run has a known starting point. After a batch of
10,000 runs, we can look and say "something funny happened on run
1,323" and then we can bring that back to life later, easily.



Question: Why is this better than the simple old approach of
setting the seeds within each run with a formula like

set.seed(2345 + 10 * run)

Answer: That does allow replication, but it does not assure
that each run uses non-overlapping random number streams. It
offers absolutely no assurance whatsoever that the runs are
actually non-redundant.

Nevertheless, it is a method that is widely used and recommended
by some visible HOWTO guides.



Citations

Hana Sevcikova and A. J. Rossini (2010). snowFT: Fault Tolerant
 Simple Network of Workstations. R package version 1.2-0.
 http://CRAN.R-project.org/package=snowFT

John M Chambers (2008). SoDA: Functions and Exampels for "Software
  for Data Analysis". R package version 1.0-3.

John M Chambers (2008) Software for Data Analysis. Springer.
#
Paul

I think (perhaps incorrectly) of the general problem being that one 
wants to run a random experiment, on a single node, or two nodes, or ten 
nodes, or any number of nodes, and reliably be able to reproduce the 
experiment without concern about how many nodes it runs on when you 
re-run it.

 From your description I don't have the impression your solution would 
do that. Am I misunderstanding?

A second problem is that you want to use a proven algorithm for 
generating the numbers. This is implicitly solved by the above, because 
you always get the same result as you do on one node with a well proven 
RNG. If you generate a string of seed and then numbers from those, do 
you have a proven RNG?

Paul
On 12-02-17 03:57 PM, Paul Johnson wrote:
#
On Fri, Feb 17, 2012 at 3:23 PM, Paul Gilbert <pgilbert902 at gmail.com> wrote:
Well, I think my approach does that!  Each time a function runs, it grabs
a pre-specified set of seed values and initializes the R .Random.seed
appropriately.

Since I take the pre-specified seeds from the L'Ecuyer et al approach
(cite below), I believe that
means each separate stream is dependably uncorrelated and non-overlapping, both
within a particular run and across runs.
Luckily, I think that part was solved by people other than me:

L'Ecuyer, P., Simard, R., Chen, E. J. and Kelton, W. D. (2002) An
object-oriented random-number package with many long streams and
substreams. Operations Research 50 1073?5.
http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf

  
    
#
Ok, I guess I need to look more carefully.

Thanks,
Paul
On 12-02-17 04:44 PM, Paul Johnson wrote:
#
On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote:
Hi.

Some of the random number generators allow as a seed a vector,
not only a single number. This can simplify generating the seeds.
There can be one seed for each of the 1000 runs and then,
the rows of the seed matrix can be

  c(seed1, 1), c(seed1, 2), ...
  c(seed2, 1), c(seed2, 2), ...
  c(seed3, 1), c(seed3, 2), ...
  ...

There could be even only one seed and the matrix can be generated as

  c(seed, 1, 1), c(seed, 1, 2), ...
  c(seed, 2, 1), c(seed, 2, 2), ...
  c(seed, 3, 1), c(seed, 3, 2), ...

If the initialization using the vector c(seed, i, j) is done
with a good quality hash function, the runs will be independent.

What is your opinion on this?

An advantage of seeding with a vector is also that there can
be significantly more initial states of the generator among
which we select by the seed than 2^32, which is the maximum
for a single integer seed.

Petr Savicky.
#
On Fri, Feb 17, 2012 at 5:06 PM, Petr Savicky <savicky at cs.cas.cz> wrote:
Yes, I understand.

The seed things I'm using are the 6 integer values from the L'Ecuyer.
If you run the example script, the verbose option causes some to print
out.  The first 3 seeds in a saved project seeds file looks like:
[[1]]
[1]         407   376488316  1939487821  1433925148 -1040698333   579503880
[7]  -624878918

[[2]]
[1]         407 -1107332181   854177397  1773099324  1774170776  -266687360
[7]   816955059

[[3]]
[1]         407   936506900 -1924631332 -1380363206  2109234517  1585239833
[7] -1559304513

The 407 in the first position is an integer R uses to note the type of
stream for which the seed is intended, in this case R'Lecuyer.
I don't have any formal proof that a "good quality hash function"
would truly create seeds from which independent streams will be drawn.

There is, however, the proof in the L'Ecuyer paper that one can take
the long stream and divide it into sections.  That's the approach I'm
taking here. Its the same approach the a parallel package in R
follows, and parallel frameworks like snow.

The different thing in my approach is that I'm saving one row of seeds
per simulation "run".  So each run can be replicated exactly.

I hope.

pj

pj

  
    
#
On Fri, Feb 17, 2012 at 09:33:33PM -0600, Paul Johnson wrote:
[...]
The paralel streams uses MRG32k3a generator by P.L'Ecuyer, whose original
implementation stores its internal state as double precision floating point
numbers. The vector .Random.seed consists of integers. Do you know, whether
a modified implementation of MRG32k3a is used, which works with integers
internally, or the conversion between double and integer is done whenever
the state is stored to .Random.seed?
One of the suggested ways how to generate, say, 2^16 cryptographically secure
random numbers, is to use a counter producing a sequence 1, 2, 3, ..., 2^16
and transform these values by a hash function. An example is Fortuna generator

  http://en.wikipedia.org/wiki/Fortuna_(PRNG)

which is also available on CRAN as "randaes". The length of the sequence 
is limited, since the sequence contains no collisions. If the sequence is
too long, this could allow to distinguish it from truly random numbers.
After generating 2^16 numbers, the seed is recomputed and another 2^16
numbers are generated.

Such a generator is a very good one. It is not used for simulations, since it
is slower (say by a factor of 5) than the linear generators like Mersenne
Twister, WELL or MRG family of generators. However, if it is used only for
initialization, then the speed is less important.
According to

  http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf

the sectioning of the stream to substreams is done by jump ahead algorithm.
The new seed is far enough in the sequence from the previous seed, so it
is guaranteed that the sequence generated from the previous seed is shorter
than the jump and the subsequences do not overlap. However, the new seed
is computable as a function of the previous seed. If we use this strategy
to produce a matrix of seeds

  s_{1,1}, s_{1,2}, ...
  s_{2,1}, s_{2,2}, ...
  s_{3,1}, s_{2,2}, ...

then each s_{i,j} may be computed from s_{1,1} and i, j. In this case, it is
sufficient to store s_{1,1}. If we know for each run the indices i,j, then
we can compute s_{i,j} by an efficient algorithm.
Saving .Random.seed should be a safe strategy.

Petr Savicky.
1 day later
#
On Sat, Feb 18, 2012 at 4:33 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:
That is essentially the *definition* of a good quality hash function,
at least in the cryptographic sense.  It maps the inputs into numbers
that are indistinguishable from uniform random except that the same
input always gives the same output.

What's harder is to prove that you *have* a good quality hash
function, but for these (non-adversarial) purposes even something like
MD4 would be fine, and certainly the SHA family.

    -thomas
1 day later
#
On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote:
Hi.

In the description of your project in the file

  http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/README

you argue as follows

  Question: Why is this better than the simple old approach of
  setting the seeds within each run with a formula like
   
  set.seed(2345 + 10 * run)
   
  Answer: That does allow replication, but it does not assure
  that each run uses non-overlapping random number streams. It
  offers absolutely no assurance whatsoever that the runs are
  actually non-redundant.

The following demonstrates that the function set.seed() for
the default generator indeed allows to have correlated streams.

  step <- function(x)
  {
      x[x < 0] <- x[x < 0] + 2^32
      x <- (69069 * x + 1) %% 2^32
      x[x > 2^31] <- x[x > 2^31] - 2^32
      x
  }

  n <- 1000
  seed1 <- 124370417 # any seed
  seed2 <- step(seed1)

  set.seed(seed1)
  x <- runif(n)
  set.seed(seed2)
  y <- runif(n)

  rbind(seed1, seed2)
  table(x[-1] == y[-n])

The output is

             [,1]
  seed1 124370417
  seed2 205739774
  
  FALSE  TRUE 
      5   994 

This means that if the streams x, y are generated from the two
seeds above, then y is almost exactly equal to x shifted by 1.

What is the current state of your project?

Petr Savicky.
1 day later
#
Greetings. Answers below.
On Tue, Feb 21, 2012 at 7:04 AM, Petr Savicky <savicky at cs.cas.cz> wrote:

            
Thanks for the example. That is exactly the point I've been making
about the old, ordinary way of setting seeds.

I continue testing on my plan, I believe it does work. The example
implementation is successful.  I have not received any critiques that
make me think it is theoretically wrong, but I also have not received
feedback from any body who has very detailed knowledge of R's parallel
package to tell me that my approach is good.

In order for this to be easy for users, I need to put the init streams
and set current stream functions into a package, and then streamline
the process of creating the seed array.  My opinion is that CRAN is
now overflowed with too many "one function" packages, I don't want to
create another one just for these two little functions, and I may roll
it into my general purpose regression package called "rockchalk".

If I worked in a company and had money riding on this, I'd need to
hire someone from R Core to stare at the way I'm setting, saving, and
re-setting the .Random.seed variable to be completely sure there are
not complications.  Because parallel package is so new, there is
relatively little documentation on how it is supposed to work. I'm
just gazing at the source code and trying not to  break anything it
does, and trying to stop it from breaking anything I do.

One technical issue that has been raised to me is that R parallel's
implementation of the L'Ecuyer generator is based on integer valued
variables, whereas the original L'Ecuyer code uses double real
variables.  But I'm trusting the R Core on this, if they code the
generator in a way that is good enough for R itself, it is good enough
for me. (And if that's wrong, we'll all find out together :) ).

Josef Leydold (the rstream package author) has argued that R's
implementation runs more slowly than it ought to. We had some
correspondence and I tracked a few threads in forums. It appears the
approach suggested there is roadblocked by some characteristics deep
down in R and the way random streams are managed.  Packages have only
a limited, tenuous method to replace R's generators with their own
generators. Here is the comment  that L'Ecuyer and Leydold make in the
document that is distributed with rstream..  "rstream: Streams of
Random Numbers for Stochastic Simulation"

"There is one drawback in the implementation. R uses a static table to
implement access to different
kinds of random number generators. Only one slot
(RNGkind(kind="user-supplied")) has been provided
for users who want to add her own generator. This slot relies on a
pointer to a function
called user_unif_rand. Thus rstream has to abuse this slot to add
additional random number
generators. However, there are other packages on CRAN that use the
same trick (randaes, rlecuyer,
rsprng, SuppDists) or load one of these packages (Rmpi, rpvm, rpart,
snow, varSelRF). Thus if a
user loads two of these packages or adds her own uniform random number
generator, the behavior
of at least one of the packages is broken. This problem could be fixed
by some changes in the R
core system, either by adding a package variableto
RNGkind(kind="user-supplied") similar to the
.Call function, or by replacing the static table by a dynamic one.

I interpret all of this to mean that, although other approaches may
work as well or better,
the only approach that can "defend itself " in the ecology of a
running R program is an approach
that fiddles with R's own generators, rather than tries to replace
them. And that's what I'm doing.

pj

ps

Just this moment, while googling for Leydold's comments, I found a
post I had not seen before. This is neat.

Parallel Random Number Generation in C++ and R Using RngStream
Andrew Karl ? Randy Eubank ? Dennis Young
http://math.la.asu.edu/~eubank/webpage/rngStreamPaper.pdf

I think that faces the same problem L'Ecuyer and Leydold described.
The package may hang its generator on one pointer inside the R random
scheme, but there's no way to be sure it stays there.
#
On Wed, Feb 22, 2012 at 12:17:25PM -0600, Paul Johnson wrote:
[...]
I am also preparing a solution to the problem. One is based on AES
used for initialization of the R base Mersenne-Twister generator,
so it only replaces set.seed() function. Another solution is based
on "rlecuyer" package. I suggest to discuss the possible solutions
off-list before submitting to CRAN.
I do not know about any L'Ecuyer's generator in R base. You probably
mean the authors of the extension packages with these generators.
In order to connect a user defined generator to R, there are two
obligatory entry points "user_unif_rand" and "user_unif_init".
The first allows to call the generator from runif() and the similar
functions. The second connects the generator to set.seed() function.
If there is only one extension package with a generator loaded
to an R session, then these entry points are good enough. If the
package provides several generators, like "randtoolbox", it is
possible to change between them easily using functions provided
by the package for this purpose. I think that having several
packages with generators simultaneously can be good for their
development, but this is not needed for their use in applications.

There are also two other entry points "user_unif_nseed" and
"user_unif_seedloc", which allow to support the variable ".Random.seed".
A problem with this is that R converts the internal state of the
generator to ".Random.seed" by reading a specific memory location,
but does not alert the package about this event. So, if the state
requires a transformation to integer before storing to ".Random.seed",
it is not possible to do this only when needed.

In the package "rngwell19937", i included some code that tries to
determine, whether the user changed ".Random.seed" or not. The reason
is that most of the state is integer and is stored to ".Random.seed",
but the state contains also a function pointer, which is not stored.
It can be recomputed from ".Random.seed" and this recomputing is done,
if the package detects a change of ".Random.seed". This is not a nice
solution. So in "randtoolbox" we decided not to support ".Random.seed".

I understand that in the area of parallel computing, the work
with ".Random.seed" is a good paradigm, but if the generator
provides other tools for converting the state to an R object
and put it back to the active state, then ".Random.seed" is not
strictly necessary.
Thank you very much for this link.

All the best, Petr.