Prev 55894 / 63424 Next

Bias in R's random integers?

Philip B. Stark

Thu, Sep 20, 2018 7:58 PM

The same issue occurs in walker_ProbSampleReplace() in random.c, lines
386-387:

rU = unif_rand() * n;
k = (int) rU;

Cheers,
Philip

On Wed, Sep 19, 2018 at 3:08 PM Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:

On 19/09/2018 5:57 PM, David Hugh-Jones wrote:

It doesn't seem too hard to come up with plausible ways in which this
could give bad results. Suppose I sample rows from a large dataset,
maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g.
odd rows are males, even rows are females. Oops! Very non-representative
sample, bootstrap p values are garbage.

That would only happen if your dataset was exactly 1717986918 elements
in size. (And in fact, it will be less extreme than I posted:  I had x
set to 1717986918.4, as described in another thread.  If you use an
integer value you need a different pattern; add or subtract an element
or two and the pattern needed to see a problem changes drastically.)

But if you're sampling from a dataset of that exact size, then you
should worry about this bug. Don't use sample().  Use the algorithm that
Carl described.

Duncan Murdoch

David

On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>> wrote:

    On 19/09/2018 3:52 PM, Philip B. Stark wrote:

     > Hi Duncan--
     >
     >

    That is a mathematically true statement, but I suspect it is not very
    relevant.  Pseudo-random number generators always have test functions
    whose sample averages are quite different from the expectation under
    the
    true distribution.  Remember Von Neumann's "state of sin" quote.  The
    bug in sample() just means it is easier to find such a function than

it

    would otherwise be.

    The practical question is whether such a function is likely to arise

in

    practice or not.

      > Whether those correspond to commonly used statistics or not, I

    have no

      > idea.

    I am pretty confident that this bug rarely matters.

     > Regarding backwards compatibility: as a user, I'd rather the

default

     > sample() do the best possible thing, and take an extra step to use
     > something like sample(..., legacy=TRUE) if I want to reproduce

    old results.

    I suspect there's a good chance the bug I discovered today

(non-integer

    x values not being truncated) will be declared to be a feature, and

the

    documentation will be changed.  Then the rejection sampling approach
    would need to be quite a bit more complicated.

    I think a documentation warning about the accuracy of sampling
    probabilities would also be a sufficient fix here, and would be

quite a

    bit less trouble than changing the default sample().  But as I said

in

    my original post, a contribution of a function without this bug
    would be
    a nice addition.

    Duncan Murdoch

     >
     > Regards,
     > Philip
     >
     > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch

    <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>

     > <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>>> wrote:

     >
     >     On 19/09/2018 12:23 PM, Philip B. Stark wrote:

     >      > No, the 2nd call only happens when m > 2**31. Here's the

code:

     >
     >     Yes, you're right. Sorry!
     >
     >     So the ratio really does come close to 2.  However, the

    difference in

     >     probabilities between outcomes is still at most 2^-32 when m

    is less

     >     than that cutoff.  That's not feasible to detect; the only

    detectable

     >     difference would happen if some event was constructed to hold

an

     >     abundance of outcomes with especially low (or especially high)
     >     probability.
     >
     >     As I said in my original post, it's probably not hard to

    construct such

     >     a thing, but as I've said more recently, it probably wouldn't

    happen by

     >     chance.  Here's one attempt to do it:
     >
     >     Call the values from unif_rand() "the unif_rand() outcomes".

    Call the

     >     values from sample() the sample outcomes.
     >
     >     It would be easiest to see the error if half of the sample()

    outcomes

     >     used two unif_rand() outcomes, and half used just one.  That

    would mean

     >     m should be (2/3) * 2^32, but that's too big and would

    trigger the

     >     other
     >     version.
     >
     >     So how about half use 2 unif_rands(), and half use 3?  That

    means m =

     >     (2/5) * 2^32 = 1717986918.  A good guess is that sample()

    outcomes

     >     would
     >     alternate between the two possibilities, so our event could

    be even

     >     versus odd outcomes.
     >
     >     Let's try it:
     >

     >       > m <- (2/5)*2^32
     >       > m > 2^31

     >     [1] FALSE

     >       > x <- sample(m, 1000000, replace = TRUE)
     >       > table(x %% 2)

     >
     >            0      1
     >     399850 600150
     >
     >     Since m is an even number, the true proportions of evens and

odds

     >     should
     >     be exactly 0.5.  That's some pretty strong evidence of the

    bug in the

     >     generator.  (Note that the ratio of the observed

    probabilities is about

     >     1.5, so I may not be the first person to have done this.)
     >
     >     I'm still not convinced that there has ever been a simulation

    run with

     >     detectable bias compared to Monte Carlo error unless it (like

    this one)

     >     was designed specifically to show the problem.
     >
     >     Duncan Murdoch
     >

     >      >
     >      > (RNG.c, lines 793ff)
     >      >
     >      > double R_unif_index(double dn)
     >      > {
     >      >      double cut = INT_MAX;
     >      >
     >      >      switch(RNG_kind) {
     >      >      case KNUTH_TAOCP:
     >      >      case USER_UNIF:
     >      >      case KNUTH_TAOCP2:
     >      > cut = 33554431.0; /* 2^25 - 1 */
     >      > break;
     >      >      default:
     >      > break;
     >      >     }
     >      >
     >      >      double u = dn > cut ? ru() : unif_rand();
     >      >      return floor(dn * u);
     >      > }
     >      >
     >      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch

     >     <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>

    <mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>>

     >      > <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>

     >     <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>>>> wrote:

     >      >
     >      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:

     >      >      > The 53 bits only encode at most 2^{32} possible

values,

     >     because the

     >      >      > source of the float is the output of a 32-bit PRNG

(the

     >     obsolete

     >      >     version

     >      >      > of MT). 53 bits isn't the relevant number here.

     >      >
     >      >     No, two calls to unif_rand() are used.  There are two

    32 bit

     >     values,

     >      >     but
     >      >     some of the bits are thrown away.
     >      >
     >      >     Duncan Murdoch
     >      >

     >      >      >
     >      >      > The selection ratios can get close to 2. Computer

    scientists

     >      >     don't do it

     >      >      > the way R does, for a reason.
     >      >      >
     >      >      > Regards,
     >      >      > Philip
     >      >      >
     >      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch

     >      >     <murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com> <mailto:murdoch.duncan at gmail.com
    <mailto:murdoch.duncan at gmail.com>>

     >     <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com> <mailto:murdoch.duncan at gmail.com
    <mailto:murdoch.duncan at gmail.com>>>

     >      >      > <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>

     >     <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>>

     >      >     <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>

     >     <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>>>>> wrote:

     >      >      >
     >      >      >     On 19/09/2018 9:09 AM, I?aki Ucar wrote:

     >      >      >      > El mi?., 19 sept. 2018 a las 14:43, Duncan

    Murdoch

     >      >      >      > (<murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>

     >     <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>>

     >      >     <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>

     >     <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>>> <mailto:murdoch.duncan at gmail.com
    <mailto:murdoch.duncan at gmail.com>

     >     <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>>

     >      >     <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>

     >     <mailto:murdoch.duncan at gmail.com

    <mailto:murdoch.duncan at gmail.com>>>>>)

     >      >      >     escribi?:

     >      >      >      >>
     >      >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:

     >      >      >      >>> Dear list,
     >      >      >      >>>
     >      >      >      >>> It looks to me that R samples random

integers

     >     using an

     >      >      >     intuitive but biased

     >      >      >      >>> algorithm by going from a random number on

    [0,1) from

     >      >     the PRNG

     >      >      >     to a random

     >      >      >      >>> integer, e.g.
     >      >      >      >>>

     >      >      >

https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808

     >      >      >      >>>
     >      >      >      >>> Many other languages use various rejection

    sampling

     >      >     approaches

     >      >      >     which

     >      >      >      >>> provide an unbiased method for sampling,

    such as

     >     in Go,

     >      >     python,

     >      >      >     and others

     >      >      >      >>> described here:

    https://arxiv.org/abs/1805.10941 (I

     >      >     believe the

     >      >      >     biased

     >      >      >      >>> algorithm currently used in R is also

    described

     >     there).  I'm

     >      >      >     not an expert

     >      >      >      >>> in this area, but does it make sense for

    the R to

     >     adopt

     >      >     one of

     >      >      >     the unbiased

     >      >      >      >>> random sample algorithms outlined there

    and used

     >     in other

     >      >      >     languages?  Would

     >      >      >      >>> a patch providing such an algorithm be

    welcome? What

     >      >     concerns

     >      >      >     would need to

     >      >      >      >>> be addressed first?
     >      >      >      >>>
     >      >      >      >>> I believe this issue was also raised by

    Killie &

     >     Philip in

     >      >      >      >>>

     >      >

    http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and

     >      >      >     more

     >      >      >      >>> recently in
     >      >      >      >>>

     >      >      >

     >      >

    https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
    <

https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>

https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>

     >      >

https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>

     >      >      >

     >      >

https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,

     >      >      >      >>> pointing to the python implementation for

    comparison:

     >      >      >      >>>

     >      >      >

     >      >

https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265

     >      >      >      >>
     >      >      >      >> I think the analyses are correct, but I

    doubt if a

     >     change

     >      >     to the

     >      >      >     default

     >      >      >      >> is likely to be accepted as it would make

    it more

     >      >     difficult to

     >      >      >     reproduce

     >      >      >      >> older results.
     >      >      >      >>
     >      >      >      >> On the other hand, a contribution of a new

     >     function like

     >      >      >     sample() but

     >      >      >      >> not suffering from the bias would be good.

The

     >     normal way to

     >      >      >     make such

     >      >      >      >> a contribution is in a user contributed

    package.

     >      >      >      >>
     >      >      >      >> By the way, R code illustrating the bias is

     >     probably not very

     >      >      >     hard to

     >      >      >      >> put together.  I believe the bias manifests

    itself in

     >      >     sample()

     >      >      >     producing

     >      >      >      >> values with two different probabilities

    (instead

     >     of all equal

     >      >      >      >> probabilities).  Those may differ by as

much as

     >     one part in

     >      >      >     2^32.  It's

     >      >      >      >
     >      >      >      > According to Kellie and Philip, in the

    attachment

     >     of the

     >      >     thread

     >      >      >      > referenced by Carl, "The maximum ratio of

    selection

     >      >     probabilities can

     >      >      >      > get as large as 1.5 if n is just below 2^31".

     >      >      >
     >      >      >     Sorry, I didn't write very well.  I meant to

    say that the

     >      >     difference in

     >      >      >     probabilities would be 2^-32, not that the

ratio of

     >      >     probabilities would

     >      >      >     be 1 + 2^-32.
     >      >      >
     >      >      >     By the way, I don't see the statement giving

    the ratio as

     >      >     1.5, but

     >      >      >     maybe
     >      >      >     I was looking in the wrong place.  In Theorem 1

    of the

     >     paper

     >      >     I was

     >      >      >     looking in the ratio was "1 + m 2^{-w + 1}".

    In that

     >     formula

     >      >     m is your

     >      >      >     n.  If it is near 2^31, R uses w = 57 random

    bits, so

     >     the ratio

     >      >      >     would be
     >      >      >     very, very small (one part in 2^25).
     >      >      >
     >      >      >     The worst case for R would happen when m  is

    just below

     >      >     2^25, where w

     >      >      >     is at least 31 for the default generators.  In

that

     >     case the

     >      >     ratio

     >      >      >     could
     >      >      >     be about 1.03.
     >      >      >
     >      >      >     Duncan Murdoch
     >      >      >
     >      >      >
     >      >      >
     >      >      > --
     >      >      > Philip B. Stark | Associate Dean, Mathematical and

    Physical

     >      >     Sciences |

     >      >      > Professor,  Department of Statistics |
     >      >      > University of California
     >      >      > Berkeley, CA 94720-3860 | 510-394-5077 |

     >      > statistics.berkeley.edu/~stark

    <http://statistics.berkeley.edu/%7Estark>

     >     <http://statistics.berkeley.edu/%7Estark>

     >      >     <http://statistics.berkeley.edu/%7Estark>

     >      >      > <http://statistics.berkeley.edu/%7Estark> |
     >      >      > @philipbstark
     >      >      >

     >      >
     >      >
     >      >
     >      > --
     >      > Philip B. Stark | Associate Dean, Mathematical and Physical

     >     Sciences |

     >      > Professor,  Department of Statistics |
     >      > University of California
     >      > Berkeley, CA 94720-3860 | 510-394-5077 |

     > statistics.berkeley.edu/~stark

    <http://statistics.berkeley.edu/%7Estark>

     >     <http://statistics.berkeley.edu/%7Estark>

     >      > <http://statistics.berkeley.edu/%7Estark> |
     >      > @philipbstark
     >      >

     >
     >
     >
     > --
     > Philip B. Stark | Associate Dean, Mathematical and Physical

    Sciences |

     > Professor,  Department of Statistics |
     > University of California
     > Berkeley, CA 94720-3860 | 510-394-5077 |

    statistics.berkeley.edu/~stark
    <http://statistics.berkeley.edu/%7Estark>

     > <http://statistics.berkeley.edu/%7Estark> |
     > @philipbstark
     >

    ______________________________________________
    R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel

--
Sent from Gmail Mobile

Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

	[[alternative HTML version deleted]]

Thread (35 messages)

Carl Boettiger Bias in R's random integers? Sep 18 Duncan Murdoch Bias in R's random integers? Sep 19 Iñaki Ucar Bias in R's random integers? Sep 19 David Hugh-Jones Bias in R's random integers? Sep 19 Ben Bolker Bias in R's random integers? Sep 19 Duncan Murdoch Bias in R's random integers? Sep 19 Philip B. Stark Bias in R's random integers? Sep 19 Duncan Murdoch Bias in R's random integers? Sep 19 Philip B. Stark Bias in R's random integers? Sep 19 Duncan Murdoch Bias in R's random integers? Sep 19 Philip B. Stark Bias in R's random integers? Sep 19 Duncan Murdoch Bias in R's random integers? Sep 19 Philip B. Stark Bias in R's random integers? Sep 19 Philip B. Stark Bias in R's random integers? Sep 19 Duncan Murdoch Bias in R's random integers? Sep 19 David Hugh-Jones Bias in R's random integers? Sep 19 Duncan Murdoch Bias in R's random integers? Sep 19 Ben Bolker Bias in R's random integers? Sep 19 Carl Boettiger Bias in R's random integers? Sep 19 Ralf Stubner Bias in R's random integers? Sep 20 Duncan Murdoch Bias in R's random integers? Sep 20 Paul Gilbert Bias in R's random integers? Sep 20 Gabriel Becker Bias in R's random integers? Sep 20 Hervé Pagès Bias in R's random integers? Sep 20 Steve Grubb Bias in R's random integers? Sep 20 Philip B. Stark Bias in R's random integers? Sep 20 Ralf Stubner Bias in R's random integers? Sep 21 Steve Grubb Bias in R's random integers? Sep 21 Dirk Eddelbuettel Bias in R's random integers? Sep 21 Dirk Eddelbuettel Bias in R's random integers? Sep 21 Luke Tierney Bias in R's random integers? Sep 21 Ralf Stubner Bias in R's random integers? Sep 21 Steve Grubb Bias in R's random integers? Sep 21 Steve Grubb Bias in R's random integers? Sep 21 Ralf Stubner Bias in R's random integers? Sep 27