David
On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>> wrote:
On 19/09/2018 3:52 PM, Philip B. Stark wrote:
That is a mathematically true statement, but I suspect it is not very
relevant.? Pseudo-random number generators always have test functions
whose sample averages are quite different from the expectation under
the
true distribution.? Remember Von Neumann's "state of sin" quote.? The
bug in sample() just means it is easier to find such a function than it
would otherwise be.
The practical question is whether such a function is likely to arise in
practice or not.
?> Whether those correspond to commonly used statistics or not, I
have no
?> idea.
I am pretty confident that this bug rarely matters.
> Regarding backwards compatibility: as a user, I'd rather the default
> sample() do the best possible thing, and take an extra step to use
> something like sample(..., legacy=TRUE) if I want to reproduce
old results.
I suspect there's a good chance the bug I discovered today (non-integer
x values not being truncated) will be declared to be a feature, and the
documentation will be changed.? Then the rejection sampling approach
would need to be quite a bit more complicated.
I think a documentation warning about the accuracy of sampling
probabilities would also be a sufficient fix here, and would be quite a
bit less trouble than changing the default sample().? But as I said in
my original post, a contribution of a function without this bug
would be
a nice addition.
Duncan Murdoch
>
> Regards,
> Philip
>
> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
<murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>
> <mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>>> wrote:
>
>? ? ?On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>? ? ? > No, the 2nd call only happens when m > 2**31. Here's the code:
>
>? ? ?Yes, you're right. Sorry!
>
>? ? ?So the ratio really does come close to 2.? However, the
>? ? ?probabilities between outcomes is still at most 2^-32 when m
>? ? ?than that cutoff.? That's not feasible to detect; the only
>? ? ?difference would happen if some event was constructed to hold an
>? ? ?abundance of outcomes with especially low (or especially high)
>? ? ?probability.
>
>? ? ?As I said in my original post, it's probably not hard to
>? ? ?a thing, but as I've said more recently, it probably wouldn't
>? ? ?chance.? Here's one attempt to do it:
>
>? ? ?Call the values from unif_rand() "the unif_rand() outcomes".
>? ? ?values from sample() the sample outcomes.
>
>? ? ?It would be easiest to see the error if half of the sample()
>? ? ?used two unif_rand() outcomes, and half used just one.? That
>? ? ?m should be (2/3) * 2^32, but that's too big and would
>? ? ?other
>? ? ?version.
>
>? ? ?So how about half use 2 unif_rands(), and half use 3?? That
>? ? ?(2/5) * 2^32 = 1717986918.? A good guess is that sample()
>? ? ?would
>? ? ?alternate between the two possibilities, so our event could
>? ? ?versus odd outcomes.
>
>? ? ?Let's try it:
>
>? ? ? ?> m <- (2/5)*2^32
>? ? ? ?> m > 2^31
>? ? ?[1] FALSE
>? ? ? ?> x <- sample(m, 1000000, replace = TRUE)
>? ? ? ?> table(x %% 2)
>
>? ? ? ? ? ? 0? ? ? 1
>? ? ?399850 600150
>
>? ? ?Since m is an even number, the true proportions of evens and odds
>? ? ?should
>? ? ?be exactly 0.5.? That's some pretty strong evidence of the
>? ? ?generator.? (Note that the ratio of the observed
>? ? ?1.5, so I may not be the first person to have done this.)
>
>? ? ?I'm still not convinced that there has ever been a simulation
>? ? ?detectable bias compared to Monte Carlo error unless it (like
>? ? ?was designed specifically to show the problem.
>
>? ? ?Duncan Murdoch
>
>? ? ? >
>? ? ? > (RNG.c, lines 793ff)
>? ? ? >
>? ? ? > double R_unif_index(double dn)
>? ? ? > {
>? ? ? >? ? ? double cut = INT_MAX;
>? ? ? >
>? ? ? >? ? ? switch(RNG_kind) {
>? ? ? >? ? ? case KNUTH_TAOCP:
>? ? ? >? ? ? case USER_UNIF:
>? ? ? >? ? ? case KNUTH_TAOCP2:
>? ? ? > cut = 33554431.0; /* 2^25 - 1 */
>? ? ? > break;
>? ? ? >? ? ? default:
>? ? ? > break;
>? ? ? >? ? ?}
>? ? ? >
>? ? ? >? ? ? double u = dn > cut ? ru() : unif_rand();
>? ? ? >? ? ? return floor(dn * u);
>? ? ? > }
>? ? ? >
>? ? ? > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
>? ? ?<murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>
<mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>>
>? ? ? > <mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>
>? ? ?<mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>>>> wrote:
>? ? ? >
>? ? ? >? ? ?On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>? ? ? >? ? ? > The 53 bits only encode at most 2^{32} possible values,
>? ? ?because the
>? ? ? >? ? ? > source of the float is the output of a 32-bit PRNG (the
>? ? ?obsolete
>? ? ? >? ? ?version
>? ? ? >? ? ? > of MT). 53 bits isn't the relevant number here.
>? ? ? >
>? ? ? >? ? ?No, two calls to unif_rand() are used.? There are two
>? ? ?values,
>? ? ? >? ? ?but
>? ? ? >? ? ?some of the bits are thrown away.
>? ? ? >
>? ? ? >? ? ?Duncan Murdoch
>? ? ? >
>? ? ? >? ? ? >
>? ? ? >? ? ? > The selection ratios can get close to 2. Computer
>? ? ? >? ? ?don't do it
>? ? ? >? ? ? > the way R does, for a reason.
>? ? ? >? ? ? >
>? ? ? >? ? ? > Regards,
>? ? ? >? ? ? > Philip
>? ? ? >? ? ? >
>? ? ? >? ? ? > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>? ? ? >? ? ?<murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com> <mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>>
>? ? ?<mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com> <mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>>>
>? ? ? >? ? ? > <mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>
>? ? ?<mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>>
>? ? ? >? ? ?<mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>
>? ? ?<mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>>>>> wrote:
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? ?On 19/09/2018 9:09 AM, I?aki Ucar wrote:
>? ? ? >? ? ? >? ? ? > El mi?., 19 sept. 2018 a las 14:43, Duncan
>? ? ? >? ? ? >? ? ? > (<murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>
>? ? ?<mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>>
>? ? ? >? ? ?<mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>
>? ? ?<mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>>> <mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>
>? ? ?<mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>>
>? ? ? >? ? ?<mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>
>? ? ?<mailto:murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>>>>>)
>? ? ? >? ? ? >? ? ?escribi?:
>? ? ? >? ? ? >? ? ? >>
>? ? ? >? ? ? >? ? ? >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>? ? ? >? ? ? >? ? ? >>> Dear list,
>? ? ? >? ? ? >? ? ? >>>
>? ? ? >? ? ? >? ? ? >>> It looks to me that R samples random integers
>? ? ?using an
>? ? ? >? ? ? >? ? ?intuitive but biased
>? ? ? >? ? ? >? ? ? >>> algorithm by going from a random number on
>? ? ? >? ? ?approaches
>? ? ? >? ? ? >? ? ?which
>? ? ? >? ? ? >? ? ? >>> provide an unbiased method for sampling,
>? ? ?in Go,
>? ? ? >? ? ?python,
>? ? ? >? ? ? >? ? ?and others
>? ? ? >? ? ? >? ? ? >>> described here:
>? ? ? >? ? ?believe the
>? ? ? >? ? ? >? ? ?biased
>? ? ? >? ? ? >? ? ? >>> algorithm currently used in R is also
>? ? ?there).? I'm
>? ? ? >? ? ? >? ? ?not an expert
>? ? ? >? ? ? >? ? ? >>> in this area, but does it make sense for
>? ? ?adopt
>? ? ? >? ? ?one of
>? ? ? >? ? ? >? ? ?the unbiased
>? ? ? >? ? ? >? ? ? >>> random sample algorithms outlined there
>? ? ?in other
>? ? ? >? ? ? >? ? ?languages?? Would
>? ? ? >? ? ? >? ? ? >>> a patch providing such an algorithm be
>? ? ? >? ? ?concerns
>? ? ? >? ? ? >? ? ?would need to
>? ? ? >? ? ? >? ? ? >>> be addressed first?
>? ? ? >? ? ? >? ? ? >>>
>? ? ? >? ? ? >? ? ? >>> I believe this issue was also raised by
>? ? ?Philip in
>? ? ? >? ? ? >? ? ? >>>
>? ? ? >
>? ? ? >? ? ? >? ? ?more
>? ? ? >? ? ? >? ? ? >>> recently in
>? ? ? >? ? ? >? ? ? >>>
>? ? ? >? ? ? >
>? ? ? >
>
>? ? ? >? ? ? >
>? ? ? >
>
>? ? ? >? ? ? >? ? ? >>> pointing to the python implementation for
>? ? ? >? ? ? >? ? ? >>>
>? ? ? >? ? ? >
>? ? ? >
>
>? ? ? >? ? ? >? ? ? >>
>? ? ? >? ? ? >? ? ? >> I think the analyses are correct, but I
>? ? ?change
>? ? ? >? ? ?to the
>? ? ? >? ? ? >? ? ?default
>? ? ? >? ? ? >? ? ? >> is likely to be accepted as it would make
>? ? ? >? ? ?difficult to
>? ? ? >? ? ? >? ? ?reproduce
>? ? ? >? ? ? >? ? ? >> older results.
>? ? ? >? ? ? >? ? ? >>
>? ? ? >? ? ? >? ? ? >> On the other hand, a contribution of a new
>? ? ?function like
>? ? ? >? ? ? >? ? ?sample() but
>? ? ? >? ? ? >? ? ? >> not suffering from the bias would be good.? The
>? ? ?normal way to
>? ? ? >? ? ? >? ? ?make such
>? ? ? >? ? ? >? ? ? >> a contribution is in a user contributed
>? ? ? >? ? ? >? ? ? >>
>? ? ? >? ? ? >? ? ? >> By the way, R code illustrating the bias is
>? ? ?probably not very
>? ? ? >? ? ? >? ? ?hard to
>? ? ? >? ? ? >? ? ? >> put together.? I believe the bias manifests
>? ? ? >? ? ?sample()
>? ? ? >? ? ? >? ? ?producing
>? ? ? >? ? ? >? ? ? >> values with two different probabilities
>? ? ?of all equal
>? ? ? >? ? ? >? ? ? >> probabilities).? Those may differ by as much as
>? ? ?one part in
>? ? ? >? ? ? >? ? ?2^32.? It's
>? ? ? >? ? ? >? ? ? >
>? ? ? >? ? ? >? ? ? > According to Kellie and Philip, in the
>? ? ?of the
>? ? ? >? ? ?thread
>? ? ? >? ? ? >? ? ? > referenced by Carl, "The maximum ratio of
>? ? ? >? ? ?probabilities can
>? ? ? >? ? ? >? ? ? > get as large as 1.5 if n is just below 2^31".
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? ?Sorry, I didn't write very well.? I meant to
>? ? ? >? ? ?difference in
>? ? ? >? ? ? >? ? ?probabilities would be 2^-32, not that the ratio of
>? ? ? >? ? ?probabilities would
>? ? ? >? ? ? >? ? ?be 1 + 2^-32.
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? ?By the way, I don't see the statement giving
>? ? ? >? ? ?1.5, but
>? ? ? >? ? ? >? ? ?maybe
>? ? ? >? ? ? >? ? ?I was looking in the wrong place.? In Theorem 1
>? ? ?paper
>? ? ? >? ? ?I was
>? ? ? >? ? ? >? ? ?looking in the ratio was "1 + m 2^{-w + 1}".
>? ? ?formula
>? ? ? >? ? ?m is your
>? ? ? >? ? ? >? ? ?n.? If it is near 2^31, R uses w = 57 random
>? ? ?the ratio
>? ? ? >? ? ? >? ? ?would be
>? ? ? >? ? ? >? ? ?very, very small (one part in 2^25).
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? ?The worst case for R would happen when m? is
>? ? ? >? ? ?2^25, where w
>? ? ? >? ? ? >? ? ?is at least 31 for the default generators.? In that
>? ? ?case the
>? ? ? >? ? ?ratio
>? ? ? >? ? ? >? ? ?could
>? ? ? >? ? ? >? ? ?be about 1.03.
>? ? ? >? ? ? >
>? ? ? >? ? ? >? ? ?Duncan Murdoch
>? ? ? >? ? ? >
>? ? ? >? ? ? >
>? ? ? >? ? ? >
>? ? ? >? ? ? > --
>? ? ? >? ? ? > Philip B. Stark | Associate Dean, Mathematical and
>? ? ? >? ? ?Sciences |
>? ? ? >? ? ? > Professor, ?Department of Statistics |
>? ? ? >? ? ? > University of California
>? ? ? >? ? ? > Berkeley, CA 94720-3860 | 510-394-5077 |
>? ? ? > statistics.berkeley.edu/~stark
> Professor, ?Department of Statistics |
> University of California
> Berkeley, CA 94720-3860 | 510-394-5077 |