Hello!
I want to plot a P-P plot. So I've implemented this function:
ppplot <- function(x,dist,...)
{
pdf <- get(paste("p",dist,sep=""),mode="function");
x <- sort(x);
plot( pdf(x,...), ecdf(x)(x));
}
I have two questions:
1. Is it right to draw as reference line the following:
xx <- pdf(x,...);
yy <- ecdf(x)(x);
l <- lm( yy ~ xx )
abline( l$coefficients );
or what else is better?
2.I found various version of P-P plot where instead of using the
"ecdf" function use ((1:n)-0.5)/n
After investigation I found there're different definition of ECDF
(note "i" is the rank):
* Kaplan-Meier: i/n
* modified Kaplan-Meier: (i-0.5)/n
* Median Rank: (i-0.3)/(n+0.4)
* Herd Johnson i/(n+1)
* ...
Furthermore, similar expressions are used by "ppoints".
So,
2.1 For P-P plot, what shall I use?
2.2 In general why should I prefer one kind of CDF over another one?
(Note: this issue might also apply to Q-Q plot, infact qqnorm use
ppoints instead of ecdf)
Thank you very much!!
Sincerely,
-- Marco
What ECDF function?
4 messages · Shiazy Fuzzy, Robert A LaBudde
At 12:57 PM 6/9/2007, Marco wrote:
<snip> 2.I found various version of P-P plot where instead of using the "ecdf" function use ((1:n)-0.5)/n After investigation I found there're different definition of ECDF (note "i" is the rank): * Kaplan-Meier: i/n * modified Kaplan-Meier: (i-0.5)/n * Median Rank: (i-0.3)/(n+0.4) * Herd Johnson i/(n+1) * ... Furthermore, similar expressions are used by "ppoints". So, 2.1 For P-P plot, what shall I use? 2.2 In general why should I prefer one kind of CDF over another one? <snip>
This is an age-old debate in statistics. There are many different formulas, some of which are optimal for particular distributions. Using i/n (which I would call the Kolmogorov method), (i-1)/n or i/(n+1) is to be discouraged for general ECDF modeling. These correspond in quality to the rectangular rule method of integration of the bins, and assume only that the underlying density function is piecewise constant. There is no disadvantage to using these methods, however, if the pdf has multiple discontinuities. I tend to use (i-0.5)/n, which corresponds to integrating with the "midpoint rule", which is a 1-point Gaussian quadrature, and which is exact for linear behavior with derivative continuous. It's simple, it's accurate, and it is near optimal for a wide range of continuous alternatives. The formula (i- 3/8)/(n + 1/4) is optimal for the normal distribution. However, it is equal to (i-0.5)/n to order 1/n^3, so there is no real benefit to using it. Similarly, there is a formula (i-.44)/(N+.12) for a Gumbel distribution. If you do know for sure (don't need to test) the form of the distribution, you're better off fitting that distribution function directly and not worrying about the edf. Also remember that edfs are not very accurate, so the differences between these formulae are difficult to justify in practice. ================================================================ Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: ral at lcfltd.com Least Cost Formulations, Ltd. URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239 Fax: 757-467-2947 "Vere scire est per causas scire"
On 6/9/07, Robert A LaBudde <ral at lcfltd.com> wrote:
At 12:57 PM 6/9/2007, Marco wrote:
<snip> 2.I found various version of P-P plot where instead of using the "ecdf" function use ((1:n)-0.5)/n After investigation I found there're different definition of ECDF (note "i" is the rank): * Kaplan-Meier: i/n * modified Kaplan-Meier: (i-0.5)/n * Median Rank: (i-0.3)/(n+0.4) * Herd Johnson i/(n+1) * ... Furthermore, similar expressions are used by "ppoints". So, 2.1 For P-P plot, what shall I use? 2.2 In general why should I prefer one kind of CDF over another one? <snip>
This is an age-old debate in statistics. There are many different formulas, some of which are optimal for particular distributions. Using i/n (which I would call the Kolmogorov method), (i-1)/n or i/(n+1) is to be discouraged for general ECDF modeling. These correspond in quality to the rectangular rule method of integration of the bins, and assume only that the underlying density function is piecewise constant. There is no disadvantage to using these methods, however, if the pdf has multiple discontinuities. I tend to use (i-0.5)/n, which corresponds to integrating with the "midpoint rule", which is a 1-point Gaussian quadrature, and which is exact for linear behavior with derivative continuous. It's simple, it's accurate, and it is near optimal for a wide range of continuous alternatives.
Hmmm I'm a bit confused, but very interested! So you don't use the R "ecdf", do you?
The formula (i- 3/8)/(n + 1/4) is optimal for the normal distribution. However, it is equal to (i-0.5)/n to order 1/n^3, so there is no real benefit to using it. Similarly, there is a formula (i-.44)/(N+.12) for a Gumbel distribution. If you do know for sure (don't need to test) the form of the distribution, you're better off fitting that distribution function directly and not worrying about the edf. Also remember that edfs are not very accurate, so the differences between these formulae are difficult to justify in practice.
I will bear in min! My first interpretation was that using some different from i/n (e.g. i/(n+1)), let to better individuate tail differences (maybe...) Regards, -- Marco
================================================================ Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: ral at lcfltd.com Least Cost Formulations, Ltd. URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239 Fax: 757-467-2947 "Vere scire est per causas scire"
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
At 06:36 PM 6/9/2007, Marco wrote:
On 6/9/07, Robert A LaBudde <ral at lcfltd.com> wrote:
At 12:57 PM 6/9/2007, Marco wrote:
<snip>
<snip> Hmmm I'm a bit confused, but very interested! So you don't use the R "ecdf", do you?
Only when an i/n edf is needed (some tests, such as ks.test() are based on this). Also, I frequently do modeling in Excel as well, where you need to enter your own formulas.
<snip>
Also remember that edfs are not very accurate, so the differences between these formulae are difficult to justify in practice.
I will bear in min! My first interpretation was that using some different from i/n (e.g. i/(n+1)), let to better individuate tail differences (maybe...)
The chief advantage to i/(n+1) is that you don't generate 1.0 as an abscissa, as you do with i/n. But the same is true of (i-0.5)/n, and it's more accurate. Unless you need to do otherwise, just use ecdf(), because it matches the theory for most uses, and it almost always doesn't matter that it's slightly less accurate than other choices. ================================================================ Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: ral at lcfltd.com Least Cost Formulations, Ltd. URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239 Fax: 757-467-2947 "Vere scire est per causas scire"