Bias in R's random integers? - R-devel

Fri, Sep 21, 2018 7:15 AM #

OK.  Thanks.  I see there is a real problem.

One option to fix it while mostly retaining backwards-compatibility
would be to add extra bits from a second RNG call only when m is large
- eg, larger than 2^27.  That would retain reproducibility for most
analyses of small to moderate size data sets.  Of course, there would
still be some small, detectable error for values a bit less than 2^27,
but perhaps that's tolerable.  (The 2^27 threshold could obviously be
debated.)

R Core made a similar decision in the case of sampling with
replacement when implementing a new hashing algorithm that produces
different results.  It is enabled by default only when m > 1e7 and no
more than half the values are to be sampled, as was noted:

That incompatibility could have been avoided.  A year ago I posted
a fast hashing algorithm that produces the same results as the simple
algorithm, here:

  https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

The latest version of this will be in the soon-to-be new release of
pqR, and will of course enabled automatically whenever it seems
desirable, for a considerable speed gain in many cases.

  Radford Neal