Skip to content

Bias in R's random integers?

35 messages · Iñaki Ucar, David Hugh-Jones, Ben Bolker +10 more

Messages 1–25 of 35

#
Dear list,

It looks to me that R samples random integers using an intuitive but biased
algorithm by going from a random number on [0,1) from the PRNG to a random
integer, e.g.
https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808

Many other languages use various rejection sampling approaches which
provide an unbiased method for sampling, such as in Go, python, and others
described here:  https://arxiv.org/abs/1805.10941 (I believe the biased
algorithm currently used in R is also described there).  I'm not an expert
in this area, but does it make sense for the R to adopt one of the unbiased
random sample algorithms outlined there and used in other languages?  Would
a patch providing such an algorithm be welcome? What concerns would need to
be addressed first?

I believe this issue was also raised by Killie & Philip in
http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more
recently in
https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf,
pointing to the python implementation for comparison:
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265

Thanks!

Carl
#
On 18/09/2018 5:46 PM, Carl Boettiger wrote:
I think the analyses are correct, but I doubt if a change to the default 
is likely to be accepted as it would make it more difficult to reproduce 
older results.

On the other hand, a contribution of a new function like sample() but 
not suffering from the bias would be good.  The normal way to make such 
a contribution is in a user contributed package.

By the way, R code illustrating the bias is probably not very hard to 
put together.  I believe the bias manifests itself in sample() producing 
values with two different probabilities (instead of all equal 
probabilities).  Those may differ by as much as one part in 2^32.  It's 
very difficult to detect a probability difference that small, but if you 
define the partition of values into the high probability values vs the 
low probability values, you can probably detect the difference in a 
feasible simulation.

Duncan Murdoch
#
El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch
(<murdoch.duncan at gmail.com>) escribi?:
According to Kellie and Philip, in the attachment of the thread
referenced by Carl, "The maximum ratio of selection probabilities can
get as large as 1.5 if n is just below 2^31".

I?aki
#
On Wed, 19 Sep 2018 at 13:43, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
I'm a bit alarmed by the logic here. Unbiased sampling seems basic for a
statistical language. As a consumer of R I'd like to think that e.g. my
bootstrapped p values are correct.
Surely if the old results depend on the biased algorithm, then they are
false results?
#
On 2018-09-19 09:40 AM, David Hugh-Jones wrote:
Balancing backward compatibility and correctness is a tough problem
here.  If this goes into base R, what's the best way to do it?  What was
the protocol for migrating away from the "buggy Kinderman-Ramage"
generator, back in the day?   (Version 1.7 was sometime between 2001 and
2004).

  I couldn't find the exact commit in the GitHub mirror: this is related ...

https://github.com/wch/r-source/commit/7ad3044639fd1fe093c655e573fd1a67aa7f55f6#diff-dbcad570d4fb9b7005550ff630543b37



===
?normal.kind? can be ?"Kinderman-Ramage"?, ?"Buggy
     Kinderman-Ramage"? (not for ?set.seed?), ?"Ahrens-Dieter"?,
     ?"Box-Muller"?, ?"Inversion"? (the default), or ?"user-supplied"?.
     (For inversion, see the reference in ?qnorm?.)  The
     Kinderman-Ramage generator used in versions prior to 1.7.0 (now
     called ?"Buggy"?) had several approximation errors and should only
     be used for reproduction of old results.
#
On 19/09/2018 9:09 AM, I?aki Ucar wrote:
Sorry, I didn't write very well.  I meant to say that the difference in 
probabilities would be 2^-32, not that the ratio of probabilities would 
be 1 + 2^-32.

By the way, I don't see the statement giving the ratio as 1.5, but maybe 
I was looking in the wrong place.  In Theorem 1 of the paper I was 
looking in the ratio was "1 + m 2^{-w + 1}".  In that formula m is your 
n.  If it is near 2^31, R uses w = 57 random bits, so the ratio would be 
very, very small (one part in 2^25).

The worst case for R would happen when m  is just below  2^25, where w 
is at least 31 for the default generators.  In that case the ratio could 
be about 1.03.

Duncan Murdoch
#
The 53 bits only encode at most 2^{32} possible values, because the source
of the float is the output of a 32-bit PRNG (the obsolete version of MT).
53 bits isn't the relevant number here.

The selection ratios can get close to 2. Computer scientists don't do it
the way R does, for a reason.

Regards,
Philip

On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:

  
    
#
On 19/09/2018 9:40 AM, David Hugh-Jones wrote:
All Monte Carlo results contain Monte Carlo error.  Using the biased 
function will have some additional error, but for almost all 
simulations, it will be negligible compared to the Monte Carlo error.  I 
suspect the only simulations where the bias was anywhere near the same 
order of magnitude as the Monte Carlo error would be ones designed with 
this specific code in mind.

Duncan Murdoch
#
That depends on the number of replications, among other things.

Moreover, because of the bias, the usual formulae for uncertainty in
estimates based on random samples, etc., are incorrect: sample() does not
give a simple random sample.

On Wed, Sep 19, 2018 at 9:15 AM Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:

  
    
#
On 19/09/2018 12:09 PM, Philip B. Stark wrote:
No, two calls to unif_rand() are used.  There are two 32 bit values, but 
some of the bits are thrown away.

Duncan Murdoch
#
No, the 2nd call only happens when m > 2**31. Here's the code:

(RNG.c, lines 793ff)

double R_unif_index(double dn)
{
    double cut = INT_MAX;

    switch(RNG_kind) {
    case KNUTH_TAOCP:
    case USER_UNIF:
    case KNUTH_TAOCP2:
cut = 33554431.0; /* 2^25 - 1 */
  break;
    default:
  break;
   }

    double u = dn > cut ? ru() : unif_rand();
    return floor(dn * u);
}

On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:

  
    
#
On 19/09/2018 12:23 PM, Philip B. Stark wrote:
Yes, you're right. Sorry!

So the ratio really does come close to 2.  However, the difference in 
probabilities between outcomes is still at most 2^-32 when m is less 
than that cutoff.  That's not feasible to detect; the only detectable 
difference would happen if some event was constructed to hold an 
abundance of outcomes with especially low (or especially high) probability.

As I said in my original post, it's probably not hard to construct such 
a thing, but as I've said more recently, it probably wouldn't happen by 
chance.  Here's one attempt to do it:

Call the values from unif_rand() "the unif_rand() outcomes".  Call the 
values from sample() the sample outcomes.

It would be easiest to see the error if half of the sample() outcomes 
used two unif_rand() outcomes, and half used just one.  That would mean 
m should be (2/3) * 2^32, but that's too big and would trigger the other 
version.

So how about half use 2 unif_rands(), and half use 3?  That means m = 
(2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes would 
alternate between the two possibilities, so our event could be even 
versus odd outcomes.

Let's try it:

 > m <- (2/5)*2^32
 > m > 2^31
[1] FALSE
 > x <- sample(m, 1000000, replace = TRUE)
 > table(x %% 2)

      0      1
399850 600150

Since m is an even number, the true proportions of evens and odds should 
be exactly 0.5.  That's some pretty strong evidence of the bug in the 
generator.  (Note that the ratio of the observed probabilities is about 
1.5, so I may not be the first person to have done this.)

I'm still not convinced that there has ever been a simulation run with 
detectable bias compared to Monte Carlo error unless it (like this one) 
was designed specifically to show the problem.

Duncan Murdoch
#
Hi Duncan--

Nice simulation!

The absolute difference in probabilities is small, but the maximum relative
difference grows from something negligible to almost 2 as m approaches
2**31.

Because the L_1 distance between the uniform distribution on {1, ..., m}
and what you actually get is large, there have to be test functions whose
expectations are quite different under the two distributions. Whether those
correspond to commonly used statistics or not, I have no idea.

Regarding backwards compatibility: as a user, I'd rather the default
sample() do the best possible thing, and take an extra step to use
something like sample(..., legacy=TRUE) if I want to reproduce old results.

Regards,
Philip

On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:

  
    
#
One more thing, apropos this:

I'm still not convinced that there has ever been a simulation run with
detectable bias compared to Monte Carlo error unless it (like this one) was
designed specifically to show the problem.


I often use random permutations to simulate p-values to calibrate
permutation tests. If I'm trying to simulate the probability of a
low-probability event, this could matter a lot.

Best wishes,
Philip

On Wed, Sep 19, 2018 at 12:52 PM Philip B. Stark <stark at stat.berkeley.edu>
wrote:

  
    
#
On 19/09/2018 3:52 PM, Philip B. Stark wrote:
That is a mathematically true statement, but I suspect it is not very 
relevant.  Pseudo-random number generators always have test functions 
whose sample averages are quite different from the expectation under the 
true distribution.  Remember Von Neumann's "state of sin" quote.  The 
bug in sample() just means it is easier to find such a function than it 
would otherwise be.

The practical question is whether such a function is likely to arise in 
practice or not.

 > Whether those correspond to commonly used statistics or not, I have no
 > idea.

I am pretty confident that this bug rarely matters.
I suspect there's a good chance the bug I discovered today (non-integer 
x values not being truncated) will be declared to be a feature, and the 
documentation will be changed.  Then the rejection sampling approach 
would need to be quite a bit more complicated.

I think a documentation warning about the accuracy of sampling 
probabilities would also be a sufficient fix here, and would be quite a 
bit less trouble than changing the default sample().  But as I said in 
my original post, a contribution of a function without this bug would be 
a nice addition.

Duncan Murdoch
#
It doesn't seem too hard to come up with plausible ways in which this could
give bad results. Suppose I sample rows from a large dataset, maybe for
bootstrapping. Suppose the rows are non-randomly ordered, e.g. odd rows are
males, even rows are females. Oops! Very non-representative sample,
bootstrap p values are garbage.

David

On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
#
On 19/09/2018 5:57 PM, David Hugh-Jones wrote:
That would only happen if your dataset was exactly 1717986918 elements 
in size. (And in fact, it will be less extreme than I posted:  I had x 
set to 1717986918.4, as described in another thread.  If you use an 
integer value you need a different pattern; add or subtract an element 
or two and the pattern needed to see a problem changes drastically.)

But if you're sampling from a dataset of that exact size, then you 
should worry about this bug. Don't use sample().  Use the algorithm that 
Carl described.

Duncan Murdoch
#
A quick point of order here: arguing with Duncan in this forum is
helpful to expose ideas, but probably neither side will convince the
other; eventually, if you want this adopted in core R, you'll need to
convince an R-core member to pursue this fix.

  In the meantime, a good, well-tested implementation in a
user-contributed package (presumably written in C for speed) would be
enormously useful.  Volunteers ... ?
On 2018-09-19 04:19 PM, Duncan Murdoch wrote:
#
For a well-tested C algorithm, based on my reading of Lemire, the unbiased
"algorithm 3" in https://arxiv.org/abs/1805.10941 is part already of the C
standard library in OpenBSD and macOS (as arc4random_uniform), and in the
GNU standard library.  Lemire also provides C++ code in the appendix of his
piece for both this and the faster "nearly divisionless" algorithm.

It would be excellent if any R core members were interested in considering
bindings to these algorithms as a patch, or might express expectations for
how that patch would have to operate (e.g. re Duncan's comment about
non-integer arguments to sample size).  Otherwise, an R package binding
seems like a good starting point, but I'm not the right volunteer.
On Wed, Sep 19, 2018 at 4:22 PM Ben Bolker <bbolker at gmail.com> wrote:

            
#
On 9/20/18 1:43 AM, Carl Boettiger wrote:
It is difficult to do this in a package, since R does not provide access
to the random bits generated by the RNG. Only a float in (0,1) is
available via unif_rand(). However, if one is willing to use an external
RNG, it is of course possible. After reading about Lemire's work [1], I
had planned to integrate such an unbiased sampling scheme into the dqrng
package, which I have now started. [2]

Using Duncan's example, the results look much better:
0      1
500252 499748

Currently I am taking the other interpretation of "truncated":
0      1
499894 500106

I will adjust this to whatever is decided for base R.


However, there is currently neither long vector nor weighted sampling
support. And the performance without replacement is quite bad compared
to R's algorithm with hashing.

cheerio
ralf

[1] via http://www.pcg-random.org/posts/bounded-rands.html
[2] https://github.com/daqana/dqrng/tree/feature/sample
#
On 20/09/2018 6:59 AM, Ralf Stubner wrote:
I believe it is safe to multiply the unif_rand() value by 2^32, and take 
the whole number part as an unsigned 32 bit integer.  Depending on the 
RNG in use, that will give at least 25 random bits.  (The low order bits 
are the questionable ones.  25 is just a guess, not a guarantee.)

However, if one is willing to use an external
Another useful diagnostic is

   plot(density(y[y %% 2 == 0]))

Obviously that should give a more or less uniform density, but for 
values near m, the default sample() gives some nice pretty pictures of 
quite non-uniform densities.

By the way, there are actually quite a few examples of very large m 
besides m = (2/5)*2^32 where performance of sample() is noticeably bad. 
You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) 
* 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + 
3a)*2^32, etc.

So perhaps I'm starting to be convinced that the default sample() should 
be fixed.

Duncan Murdoch
#
On 09/19/2018 10:03 AM, Ben Bolker wrote:
...
I think improvements in the RNG is a situation where backward 
compatibility is not really going to be lost, because people can specify 
the old generator, they just will not get it by default. My opinion is 
that the default needs to generally be the best option available because 
too many people will be expecting that, or not know better, in which 
case that is what they should get.

There are only two small problems that occur to me:

1/ Researchers that want to have reproducible results (all I hope) need 
to be aware the change has happened. In theory they should have recorded 
the RNG they were using, along with the seed (and, BTW, the number of 
nodes if they generate with a parallel generator). If they have not done 
that then they can figure out the RNG from knowing what version of R 
they used. If they haven't recorded that then they can figure it out by 
some experimentation and knowing roughly when they did the research. If 
none of this works then the research probably should be lost.

As an exercise, researchers might also want to experiment with whether 
the new default qualitatively changes their results. That might lead to 
publishable research, so no one should complain.

2/ Package maintainers that have used the default RNG to generate tests 
may need to change their tests to specify the old generator, or modify 
results used for comparisons in the tests. Since package testing is 
usually for code checking rather than statistical results, not using the 
best available generator is not usually an issue.

Most of my own package testing already specifies the generator, lots 
uses "buggy Kinderman-Ramage" because tests were set up a long time ago. 
I will have to change package setRNG which warns when the default 
generator changes. (This warning is intentional because I was bitten 
badly by a small change in the S generator circa 1990.)
I think there may have been a change in R 0.99 too. At least my notes 
suggest that the code I changed for  R 1.7.0 had worked with the default 
generator from R 0.99 to 1.6.2.

I don't recall the protocol, I think it just happened and was announced 
in the NEWS. (Has this protocol changed?) The ramification for me was 
that I had to go through all of my packages' testing and change the name 
of the explicitly specified RNG to "buggy Kinderman-Ramage".

Perhaps there does need to be a protocol for testing before release. 
When my package setRNG fails then many of my other packages will also 
fail because they depend on it. This is a simple fix but reverse 
dependencies may make it look like lots of things are broken.

Paul Gilbert
#
Hi all,
On Thu, Sep 20, 2018 at 9:30 AM, Paul Gilbert <pgilbert902 at gmail.com> wrote:
I was going to suggest helper/convenience functions for this but looking at
?RNG I see that someone has already put in RNGversion which *sets* the RNG
kind to what was used by default in an older version. I do wonder if there
is still value in a function that would *return* it, e.g. for comparisons.
Perhaps RNGversionstr?

Also, would R-core members be interested in a small patch to sessionInfo()
and print.sessionInfo() which makes it so that the current RNGkind is
captured and displayed (respectively) by the sessionInfo machinery? I can
prepare one if so.

Best,
~G
#
Hi,

Note that it wouldn't be the first time that sample() changes behavior
in a non-backward compatible way:

   https://stat.ethz.ch/pipermail/r-devel/2012-October/065049.html

Cheers,
H.
On 09/20/2018 08:15 AM, Duncan Murdoch wrote:

  
    
#
Hello,
On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote:
I find this discussion fascinating. I normally test random numbers in 
different languages every now and again using various methods. One simple 
check that I do is to use Michal Zalewski's method when he studied Strange 
Attractors and Initial TCP/IP Sequence Numbers:

http://lcamtuf.coredump.cx/newtcp/
https://pdfs.semanticscholar.org/
adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf

The technique works by mapping the dynamics of the generated numbers into a 
three-dimensional phase space. This is then plotted in a graph so that you 
can visually see if something odd is going on.

I used   runif(10000, min = 0, max = 65535)  to get a set of numbers. This is 
the resulting plot that was generated from R's numbers using this technique:

http://people.redhat.com/sgrubb/files/r-random.jpg

And for comparison this was generated by collecting the same number of 
samples from the bash shell:

http://people.redhat.com/sgrubb/files/bash-random.jpg

The net result is that it shows some banding in the R generated random 
numbers where bash has uniform random numbers with no discernible pattern.

Best Regards,
-Steve