R function for percentrank

Wed, Dec 5, 2007 10:43 AM

On Wed, 2007-12-05 at 18:42 +0100, Martin Maechler wrote:

I'm coming late to this, but this *does* need a correction
just for the archives !

"MS" == Marc Schwartz <marc_schwartz at comcast.net>
    on Sat, 01 Dec 2007 13:33:21 -0600 writes:

    MS> On Sat, 2007-12-01 at 18:40 +0000, David Winsemius wrote:

    >> David Winsemius <dwinsemius at comcast.net> wrote in 
    >> news:Xns99F989B3A3057dNOTwinscomcast at 80.91.229.13:
    >>

    >> > "tom soyer" <tom.soyer at gmail.com> wrote in
    >> > news:65cc7bdf0712010951p451a993i70da89f285d801de at mail.gmail.com: 
    >> >

    >> >> John,
    >> >> 
    >> >> The Excel's percentrank function works like this: if one has a number,
    >> >> x for example, and one wants to know the percentile of this number in
    >> >> a given data set, dataset, one would type =percentrank(dataset,x) in
    >> >> Excel to calculate the percentile. So for example, if the data set is
    >> >> c(1:10), and one wants to know the percentile of 2.5 in the data set,
    >> >> then using the percentrank function one would get 0.166, i.e., 2.5 is
    >> >> in the 16.6th percentile. 
    >> >> 
    >> >> I am not sure how to program this function in R. I couldn't find it as
    >> >> a built-in function in R either. It seems to be an obvious choice for
    >> >> a built-in function. I am very surprised, but maybe we both missed it.

    >> >  
    >> > My nomination for a function with a similar result would be ecdf(), the 
    >> > empirical cumulative distribution function. It is of class "function"

    >> so

    >> > efforts to index ecdf(.)[.] failed for me.

I think you did not understand ecdf() !!!
It *returns* a function,
that you can then apply to old (or new) data; see below

    MS> You can use ls.str() to look into the function environment:

    >> ls.str(environment(ecdf(x)))

    MS> f :  num 0
    MS> method :  int 2
    MS> n :  int 25
    MS> x :  num [1:25] -2.215 -1.989 -0.836 -0.820 -0.626 ...
    MS> y :  num [1:25] 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32 0.36 0.4 ...
    MS> yleft :  num 0
    MS> yright :  num 1



    MS> You can then use get() or mget() within the function environment to
    MS> return the requisite values. Something along the lines of the following
    MS> within the function percentrank():

    MS> percentrank <- function(x, val)
    MS> {
    MS> env.x <- environment(ecdf(x))
    MS> res <- mget(c("x", "y"), env.x)
    MS> Ind <- which(sapply(seq(length(res$x)),
    MS> function(i) isTRUE(all.equal(res$x[i], val))))
    MS> res$y[Ind]
    MS> }

sorry Marc, but "Yuck !!"

- this  percentrank() only works when you apply it to original x[i] values
- only works for 'val' of length 1
- is a complicated hack

and absolutely unneeded  (see below)

    MS> Thus:

    MS> set.seed(1)
    MS> x <- rnorm(25)

    >> x

    MS> [1] -0.62645381  0.18364332 -0.83562861  1.59528080  0.32950777
    MS> [6] -0.82046838  0.48742905  0.73832471  0.57578135 -0.30538839
    MS> [11]  1.51178117  0.38984324 -0.62124058 -2.21469989  1.12493092
    MS> [16] -0.04493361 -0.01619026  0.94383621  0.82122120  0.59390132
    MS> [21]  0.91897737  0.78213630  0.07456498 -1.98935170  0.61982575

    >> percentrank(x, 0.48742905)

    MS> [1] 0.56

[gives 0.52 in my version of R ]

Well, that is *THE SAME*  as using  ecdf() the way you 
should have used it :

  ecdf(x)(0.48742905)

{in two lines, that is

  mypercR <- ecdf(x)
  mypercR(0.48742905)

 which maybe easier to understand, if you have never used the
 nice concept that underlies all of

 approxfun(), splinefun() or ecdf()
}

You can also use

  ecdf(x)(x)

and indeed check that it is identical to the convoluted
percentrank() function above :

ecdf(x)(0.48742905)

[1] 0.52

ecdf(x)(x)

 [1] 0.20 0.44 0.12 1.00 0.48 0.16 0.56 0.72 0.60 0.28 0.96 0.52 0.24 0.04 0.92
[16] 0.32 0.36 0.88 0.80 0.64 0.84 0.76 0.40 0.08 0.68

all(ecdf(x)(x) == sapply(x, function(v) percentrank(x,v)))

[1] TRUE


Regards (and apologies for my apparent indignation ;-)
by the author of ecdf() ,

Martin Maechler, ETH Zurich

Martin,

Thanks for the corrections. In hindsight, now seeing the intended use of
ecdf() in the fashion you describe above, it is now clear that my
approach in response to David's query was un-needed and "over the top".
"Yuck" is quite appropriate... :-)

As I was going through this "exercise", it did seem overly complicated,
given R's usual elegant philosophy about such things. I suppose if I had
looked at the source for plot.stepfun(), it would have been more evident
as to how the y values are acquired.

In reviewing the examples in ?ecdf, I think that an example using
something along the lines of the discussion here more explicitly, would
be helpful. It is not crystal clear from the examples, that one can use
ecdf() in this fashion, though the use of "12 * Fn(tt)" hints at it.

Perhaps:

##-- Simple didactical  ecdf  example:
x <- rnorm(12)
Fn <- ecdf(x)
Fn
Fn(x)  # returns the percentiles for x
...


Thanks again Martin and no offense taken...  :-)

Regards,

Marc

R function for percentrank

Thread (13 messages)