Zipf random number generation

Sorry. 

c = sum(1.0 / pow((double) j, alpha)), j=1,2,...,N
Understood! Though I think you mean

  1/c = sum(1.0 / pow((double) j, alpha)), j=1,2,...,N

Anyway, provided alpha > 1 the series converges for the
infinite sum, but if you limit the range to 1...N then
you could have any value of alpha (even 0 or negative).

Be that as it may, I don't know of any R function for
sampling such a distribution directly, but you could
certainly try the suggestion I gave at the end of my
previous response:

  N <- 10000 ##For instance. You need N large enough
  i0 <- (1:N) 
  p <- 1/(i0^alpha) ; p <- p/sum(p)
  X <- sample (i0, n, replace=TRUE, prob=p)

Then X will be a random sample of size n from the
distribution you describe on the range 1:N.

When I tried the above with alpha=1.1, N=10000, n=1000
the largest value of X was 9928, i.e. very nearly
equal to N, so this suggested that N=10000 may not be
large enough! So I then tried it with N=50000 and got
max(X) = 48303; then with N=100000 and max(X) = 94914.
And so on (at this point I was hitting swap-space on
this little machine). In short: with alpha so near 1,
the sample will tend to extend over the whole range
you give it when you specify N.

This is an indication that the simple-minded method
of my suggestion may fail to provide sufficiently large
sampled values when alpha is near 1.

However, since the "common practices" in your area,
as you describe them below, do set a value of N, then
presumably there is also a "common view" of what limit
for N is acceptable -- i.e. in your area it does not
matter if you fail to sample values larger than N.

In that case the above method should be satisfactory.

If not, then some other approach would be needed.
You certainly don't want to use the above method with N
approaching the largest machine-representable integer!

One possibility might be to split the sampling.
Let alpha > 1. Choose N fairly large. You would need
a method for evaluating sum(i^alpha) from 1 to infinity.
Let P = sum[1:N](i^alpha)/sum[1:inf](i^alpha).

With probability P, sample on (1:N) with the above method.
With probability (1-P), use a continuous approximation
on [(N+1):inf] with density proportional to 1/(x^alpha).
For the latter, the CDF is easily evaluated analytically;
likewise its inverse function. Hence by sampling from a
uniform distribution you can sample X from the distribution
with density proportional to 1/(x^alpha) on [(N+1):inf].
Then get the sampled value of i as floor(X).

This is in interesting question! I hope other readers can
suggest good ideas.

Hoping this helps,
Ted.
 --- Weiguang Shi <wgshi2001 at yahoo.ca> wrote: 
Thanks very much Ted for the detailed explanation.

It might be flawed but some "common practices" in my
area use the following approach to generate a random
rank for some Zipf-like distribution with the
parameter alpha.

The key is to introduce the limit of ranks, N. With
N
and alpha, therefore, the probability of rank i is
determined by 
    c/pow((double) i, alpha)
where
    c = sum(1.0 / pow((double) i, alpha))

Any comment on that?

Thanks,
Weiguang

 --- Ted.Harding at nessie.mcc.ac.uk wrote: 
On 25-Jan-05 Weiguang Shi wrote:
Hi,

Is there a Zipf-like distribution RNG in R?

Thanks,
Weiguang
"Zipf's Law" (as originally formulated in studies
of
the
frequencies of words in texts) is to the effect
that
the
relative frequencies with which words occur once,
twice,
3 times, ... are in proportion to 1/1, 1/2, 1/3,
...
,
1/n, ... with no limit on n (i.e. the number of
different
words each represented n times is proportional to
1/n).

This is "improper" since sum(1/n) = infinity, so
does not
define a probability distribution. A respectable
analogue
is Fisher's logarithmic distribution p(x) where

  p(x) = ((t^x)/x)/(-log(1-t)), x = 1, 2, 3, ...

where t in (0,1) is the parameter of the
distribution.

As t -> 1, p(x+1)/p(x) -> x/(x+1) as in Zipf's
Law.
However, I've searched the R site and have found
only
one instance of a function directly related to
this
distribution, namely logff() in package VGAM,
which
is for estimating the parameter t.

So it looks as though there is no direct
implementation
of something like "rlogdist".

However, the logarithmic distribution is a
limiting
form of the negative binomial distribution
(conditional
on a positive value), and there are functions in R
for
random sampling from this distribution.

From my reading of "?rnbinom" in the base package,
this
function does not have the flexibitility you would
need
for this. But in the MASS package there is the
function
rnegbin() which does.

You would need to invoke

  rnegbin(N, mu=..., theta=... )

where the value ... of mu is large and the value
...
of
theta is small, and reject cases with value zero.
This
of course makes it awkward to generate a sample of
given
size.

Alternatively, you can try along the lines of

x0<-1:10000; t<-0.999;
p<-((t^x0)/x0)/(-log(1-t))
Y<-sample(x0,5000,replace=TRUE,prob=p)
hist(Y,breaks=100)
While this gives the logarithmic distribution over
the
range of x in x0, it is inexact in that it does
not
permit values greater than max(x0) to be sampled.

No doubt others can suggest something better than
this!

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding)
<Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New
number!]
Date: 25-Jan-05                                   

 Time: 10:20:57
------------------------------ XFMail
------------------------------

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 26-Jan-05                                       Time: 10:46:08
------------------------------ XFMail ------------------------------

Zipf random number generation

Thread (6 messages)