Quick rsprng questions
I'm far from an authority, but I'll try to answer.
On Thu, 2009-07-30 at 16:35 -0400, Thomas Hampton wrote:
Hello Ross, We recently installed rsprng on our beowulf cluster. My recollection was that R routines like sample() gave weird results before we did this -- namely, you could sample as many times as you like and you would get the same result, as if you were setting the seed to some fixed value (even when you did not). I am not observing this behavior now. My questions are these. First, is it a normal feature of clusters to show odd random number properties if you do not have something like rsprng in there?
I would not expect the behavior you described above. The only misbehavior that seems likely is that each node/process in the cluster gets the same, or at least not fully independent, streams of random numbers.
Second, if you install rspring, does the problem just magically go away, or do you need to make special calls in your R code to take advantage of rsprng?
First, you need to initialize rsprng properly and second you need to be able to access it. Unless something else initializes rsprng (e.g., snow provides setupSPRNG), you need to by calling init.rsprng with appropriate parameters (which include the total number of processes and the rank of the process executing the initialization). This will create independent streams. The second issue is getting access to these random numbers. The uniform random number generator and anything derived from it should work. I'm not sure if the normal random number generator will use SPRNG or not; I suspect it will. If you're trying to access the random number stream from C code, it's tricky. There are more details on the web page I announced: http://wiki.r-project.org/rwiki/doku.php?id=packages:cran:rsprng.
Finally, why (roughly) is random number generation different in the parallel environment to begin with?
In the simplest case, you might get the same random number stream in each parallel process. This means the extra runs are pointless and, if you use them naively, you will think you have a much bigger sample than you really do. A more complex problem is that the random number streams could be dependent, but in a more subtle way. A simple strategy is to generate a list of random integers to serve as seeds, ship a different seed to each process, and then set the seed in each process. This works with non-parallel RNG's and is probably good enough in most cases (it's a popular move in the biostat dept here). I suspect there are some issues with it, though, because otherwise there'd be no need to for explicit parallel random number generators like SPRNG. Ross
Thanks very much, Tom