Dear All
I am trying to replicate a numerical application (not computed on R) from an
article. Using, ks.test() I computed the exact D value shown in the article
but the p-values I obtain are quite different from the one shown in the
article.
The tests are performed on a sample of 37 values (please see "[0] DATA"
below) for truncated Exponential, Pareto and truncated LogNormal (please see
"[1] FUNCTIONS" below). You can find below the commands I use to compute the
tests ("[2] COMMANDS") and the p-values presented in the article.
Can anyone explain the difference between these two set of p-values? Maybe
the explanation is linked to my second question: Can anyone confirm me that
p-values calculated in the Kolmogrov-Smirnov g-o-f are independent from the
family of the distribution assumed in H0?
Anderson-Darling test is also performed in the article. would anyone have
functions to calculate pvalues for exponential, lognormal, weibull, etc?
Many thanks for your help,
Franck Allaire
R 1.6.2
###############
# [0] DATA # truncation point is 30 i.e. all values are above 30
###############
39.7
582
439.9
47.1
83.9
41.2
893.1
192
106.2
216.7
1243.4
52.8
351.6
36.2
431.5
57.3
1602.1
822.2
260.1
58.7
6299.9
814.9
137.2
203.8
1263.5
53.7
1313
118.4
167.8
70.1
503.7
64.8
529.9
87.8
2465.4
317.9
2753.9
###############
# [1] FUNCTIONS
###############
#EXP
exptfit_function(x,b, plot.it = F, lty = 1){
if (mode(x) != "numeric")
stop("need numeric data")
x <- x[!is.na(x)]
x <- sort(x)
lambda <- length(x)/sum(x-b)
if(plot.it) { lines(x, dexpt(x, lambda = lambda, b = b), lty =
lty,col="red")}
result <- list(lambda = lambda, b = b)
class(result) <- "expt"
result}
dexpt<-function(x,lambda,b) dexp(x-b,rate=lambda)
pexpt<-function(x,lambda,b) pexp(x-b,rate=lambda)
qexpt<-function(p,lambda,b) qexp(p,rate=lambda)+b
rexpt<-function(n,lambda,b) rexp(n,rate=lambda)+b
#PARETO
paretofit<-function(x,b, plot.it = F, lty = 1){
x <- sort(x)
alpha <- length(x)/sum(log(x/b))
if(plot.it) { lines(x, dpareto(x, alpha = alpha, b = b), lty =
lty,col="red")}
result <- list(alpha = alpha, b = b)
class(result) <- "pareto"
result}
dpareto<-function(x, alpha,b) ifelse(x<b,0,alpha*(b^alpha)/(x^(alpha+1)))
ppareto<-function(x, alpha,b) ifelse(x<b,0,1-(b/x)^alpha)
qpareto<-function(p, alpha,b) b*exp(-log(1-p)/alpha)
rpareto<-function(n, alpha,b) qpareto(runif(n),alpha=alpha,b=b)
#LN
lnormtfit_function(x,b, plot.it = F, lty = 1){
if (mode(x) != "numeric")
stop("need numeric data")
x <- x[!is.na(x)]
x <- sort(x)
y <- log(x-b)
n <- length(y)
mu <- mean(y)
sigma <- sd(y)
if(plot.it) { lines(x, dlnormt(x, meanlog=mu,sdlog=sigma,b=b), lty
= lty,col="red")}
result <- list(mu=mu,sigma=sigma,b=b)
class(result) <- "lnormt"
result}
dlnormt<-function(x,mu,sigma,b) dlnorm(x-b,meanlog=mu,sdlog=sigma)
plnormt<-function(x,mu,sigma,b) plnorm(x-b,meanlog=mu,sdlog=sigma)
qlnormt<-function(p,mu,sigma,b) qlnorm(p,meanlog=mu,sdlog=sigma)+b
rlnormt<-function(n,mu,sigma,b) rlnorm(n,meanlog=mu,sdlog=sigma)+b
###############
# [2] COMMANDS
###############
# EXP
tmp <- exptfit(data,b=30)
ks.test(data,"pexpt",lambda=tmp$lambda,b=tmp$b)
# Article: D= 0.2599 pvalue<<0.005
# PARETO
tmp <- paretofit(data,b=30)
ks.test(data, "ppareto", alpha=tmp$alpha,b=tmp$b)
# Article: D=0.14586 pvalue=0.16
# LN
tmp <- lnormtfit(data,b=30)
ks.test(data,"plnormt",mu=tmp$mu,sigma=tmp$sigma,b=tmp$b)
# Article: D=0.07939 pvalue>>0.15
ks.test()
6 messages · franck allaire, Brian Ripley, Roger Koenker +1 more
You appear to be applying the KS test after estimating parameters. The distribution theory is for an iid sample from a known continuous distribution (and does not otherwise depend on the distribution). Since your H_0 is not pre-specified, that distribution theory is not correct. (Some corrections have been worked out for say ML fitting of exponential and normal distributions -- by Michael Stephens as I recall.) Also, your `truncated LogNormal' does not appear to be truncated, rather to be shifted. That's the same thing for an exponential (for a positive shift), but not for any other distribution.
On Thu, 28 Aug 2003, franck allaire wrote:
Dear All I am trying to replicate a numerical application (not computed on R) from an article. Using, ks.test() I computed the exact D value shown in the article but the p-values I obtain are quite different from the one shown in the article.
And what is the H_0 and H_1 used in the article?
The tests are performed on a sample of 37 values (please see "[0] DATA"
below) for truncated Exponential, Pareto and truncated LogNormal (please see
"[1] FUNCTIONS" below). You can find below the commands I use to compute the
tests ("[2] COMMANDS") and the p-values presented in the article.
Can anyone explain the difference between these two set of p-values? Maybe
the explanation is linked to my second question: Can anyone confirm me that
p-values calculated in the Kolmogrov-Smirnov g-o-f are independent from the
family of the distribution assumed in H0?
Anderson-Darling test is also performed in the article. would anyone have
functions to calculate pvalues for exponential, lognormal, weibull, etc?
Many thanks for your help,
Franck Allaire
R 1.6.2
###############
# [0] DATA # truncation point is 30 i.e. all values are above 30
###############
39.7
582
439.9
47.1
83.9
41.2
893.1
192
106.2
216.7
1243.4
52.8
351.6
36.2
431.5
57.3
1602.1
822.2
260.1
58.7
6299.9
814.9
137.2
203.8
1263.5
53.7
1313
118.4
167.8
70.1
503.7
64.8
529.9
87.8
2465.4
317.9
2753.9
###############
# [1] FUNCTIONS
###############
#EXP
exptfit_function(x,b, plot.it = F, lty = 1){
if (mode(x) != "numeric")
stop("need numeric data")
x <- x[!is.na(x)]
x <- sort(x)
lambda <- length(x)/sum(x-b)
if(plot.it) { lines(x, dexpt(x, lambda = lambda, b = b), lty =
lty,col="red")}
result <- list(lambda = lambda, b = b)
class(result) <- "expt"
result}
dexpt<-function(x,lambda,b) dexp(x-b,rate=lambda)
pexpt<-function(x,lambda,b) pexp(x-b,rate=lambda)
qexpt<-function(p,lambda,b) qexp(p,rate=lambda)+b
rexpt<-function(n,lambda,b) rexp(n,rate=lambda)+b
#PARETO
paretofit<-function(x,b, plot.it = F, lty = 1){
x <- sort(x)
alpha <- length(x)/sum(log(x/b))
if(plot.it) { lines(x, dpareto(x, alpha = alpha, b = b), lty =
lty,col="red")}
result <- list(alpha = alpha, b = b)
class(result) <- "pareto"
result}
dpareto<-function(x, alpha,b) ifelse(x<b,0,alpha*(b^alpha)/(x^(alpha+1)))
ppareto<-function(x, alpha,b) ifelse(x<b,0,1-(b/x)^alpha)
qpareto<-function(p, alpha,b) b*exp(-log(1-p)/alpha)
rpareto<-function(n, alpha,b) qpareto(runif(n),alpha=alpha,b=b)
#LN
lnormtfit_function(x,b, plot.it = F, lty = 1){
if (mode(x) != "numeric")
stop("need numeric data")
x <- x[!is.na(x)]
x <- sort(x)
y <- log(x-b)
n <- length(y)
mu <- mean(y)
sigma <- sd(y)
if(plot.it) { lines(x, dlnormt(x, meanlog=mu,sdlog=sigma,b=b), lty
= lty,col="red")}
result <- list(mu=mu,sigma=sigma,b=b)
class(result) <- "lnormt"
result}
dlnormt<-function(x,mu,sigma,b) dlnorm(x-b,meanlog=mu,sdlog=sigma)
plnormt<-function(x,mu,sigma,b) plnorm(x-b,meanlog=mu,sdlog=sigma)
qlnormt<-function(p,mu,sigma,b) qlnorm(p,meanlog=mu,sdlog=sigma)+b
rlnormt<-function(n,mu,sigma,b) rlnorm(n,meanlog=mu,sdlog=sigma)+b
###############
# [2] COMMANDS
###############
# EXP
tmp <- exptfit(data,b=30)
ks.test(data,"pexpt",lambda=tmp$lambda,b=tmp$b)
# Article: D= 0.2599 pvalue<<0.005
# PARETO
tmp <- paretofit(data,b=30)
ks.test(data, "ppareto", alpha=tmp$alpha,b=tmp$b)
# Article: D=0.14586 pvalue=0.16
# LN
tmp <- lnormtfit(data,b=30)
ks.test(data,"plnormt",mu=tmp$mu,sigma=tmp$sigma,b=tmp$b)
# Article: D=0.07939 pvalue>>0.15
______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
On Thu, 28 Aug 2003, Prof Brian Ripley wrote:
You appear to be applying the KS test after estimating parameters. The distribution theory is for an iid sample from a known continuous distribution (and does not otherwise depend on the distribution). Since your H_0 is not pre-specified, that distribution theory is not correct. (Some corrections have been worked out for say ML fitting of exponential and normal distributions -- by Michael Stephens as I recall.)
Just to amplify this comment a bit, I'm a little worried that the current documentation of of ks.test may make it appear that estimated parameters are ok, or that somehow the p-values computed are "corrected" in some way for their existence -- which I very much doubt. The standard reference on this sort of thing was Durbin's (1973) SIAM monograph. There is a very nice approach due to Khmaladze (1981) based on the Doob-Meyer decomposition - this is the closest thing that I'm aware of for handling KS type tests with estimated parameters in a general context. url: www.econ.uiuc.edu/~roger/my.html Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820
1 day later
On 28 Aug 2003 at 8:06, Roger Koenker wrote:
But is it worth it to modify Kolmogorov-Smirnof fot estimated parameters? It has very low power anyhow. If the null hypothesis is "exponential distributio" (which is a scale family), what about using the quantile transformation twice new <- qnorm(pexp(old)) to transform from exponential to normal distribution and the applying shapiro.test ? Kjetil Halvorsen
On Thu, 28 Aug 2003, Prof Brian Ripley wrote:
You appear to be applying the KS test after estimating parameters. The distribution theory is for an iid sample from a known continuous distribution (and does not otherwise depend on the distribution). Since your H_0 is not pre-specified, that distribution theory is not correct. (Some corrections have been worked out for say ML fitting of exponential and normal distributions -- by Michael Stephens as I recall.)
Just to amplify this comment a bit, I'm a little worried that the current documentation of of ks.test may make it appear that estimated parameters are ok, or that somehow the p-values computed are "corrected" in some way for their existence -- which I very much doubt. The standard reference on this sort of thing was Durbin's (1973) SIAM monograph. There is a very nice approach due to Khmaladze (1981) based on the Doob-Meyer decomposition - this is the closest thing that I'm aware of for handling KS type tests with estimated parameters in a general context. url: www.econ.uiuc.edu/~roger/my.html Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820
______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
On Fri, 29 Aug 2003, kjetil brinchmann halvorsen wrote:
But is it worth it to modify Kolmogorov-Smirnof fot estimated parameters? It has very low power anyhow. If the null hypothesis is "exponential distributio" (which is a scale family), what about using the quantile transformation twice new <- qnorm(pexp(old))
And if old is exponential, but not standard exponential? url: www.econ.uiuc.edu/~roger/my.html Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820
On 29 Aug 2003 at 20:19, Roger Koenker wrote:
On Fri, 29 Aug 2003, kjetil brinchmann halvorsen wrote:
But is it worth it to modify Kolmogorov-Smirnof fot estimated parameters? It has very low power anyhow. If the null hypothesis is "exponential distributio" (which is a scale family), what about using the quantile transformation twice new <- qnorm(pexp(old))
And if old is exponential, but not standard exponential?
But the shapiro-wilk test is scale invariant, and the exponential family is a scale family, so we can standardize old first without destroying the validity of the shapiro-wilk test? Kjetil Halvorsen
url: www.econ.uiuc.edu/~roger/my.html Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820