Skip to content

ks.test()

6 messages · franck allaire, Brian Ripley, Roger Koenker +1 more

#
Dear All

I am trying to replicate a numerical application (not computed on R) from an 
article. Using, ks.test() I computed the exact D value shown in the article 
but the p-values I obtain are quite different from the one shown in the 
article.

The tests are performed on a sample of 37 values (please see "[0] DATA" 
below) for truncated Exponential, Pareto and truncated LogNormal (please see 
"[1] FUNCTIONS" below). You can find below the commands I use to compute the 
tests ("[2] COMMANDS") and the p-values presented in the article.

Can anyone explain the difference between these two set of p-values? Maybe 
the explanation is linked to my second question: Can anyone confirm me that 
p-values calculated in the Kolmogrov-Smirnov g-o-f are independent from the 
family of the distribution assumed in H0?

Anderson-Darling test is also performed in the article. would anyone have 
functions to calculate pvalues for exponential, lognormal, weibull, etc?

Many thanks for your help,

Franck Allaire
R 1.6.2

###############
# [0] DATA # truncation point is 30 i.e. all values are above 30
###############
39.7
582
439.9
47.1
83.9
41.2
893.1
192
106.2
216.7
1243.4
52.8
351.6
36.2
431.5
57.3
1602.1
822.2
260.1
58.7
6299.9
814.9
137.2
203.8
1263.5
53.7
1313
118.4
167.8
70.1
503.7
64.8
529.9
87.8
2465.4
317.9
2753.9

###############
# [1] FUNCTIONS
###############

#EXP
exptfit_function(x,b, plot.it = F, lty = 1){

          if (mode(x) != "numeric")
                stop("need numeric data")
          x <- x[!is.na(x)]
          x <- sort(x)
          lambda <- length(x)/sum(x-b)
          if(plot.it) { lines(x, dexpt(x, lambda = lambda, b = b), lty = 
lty,col="red")}
          result <- list(lambda = lambda, b = b)
          class(result) <- "expt"
          result}

dexpt<-function(x,lambda,b) dexp(x-b,rate=lambda)
pexpt<-function(x,lambda,b) pexp(x-b,rate=lambda)
qexpt<-function(p,lambda,b) qexp(p,rate=lambda)+b
rexpt<-function(n,lambda,b) rexp(n,rate=lambda)+b

#PARETO
paretofit<-function(x,b, plot.it = F, lty = 1){

          x <- sort(x)
          alpha <- length(x)/sum(log(x/b))
          if(plot.it) { lines(x, dpareto(x, alpha = alpha, b = b), lty = 
lty,col="red")}
          result <- list(alpha = alpha, b = b)
          class(result) <- "pareto"
          result}

dpareto<-function(x, alpha,b) ifelse(x<b,0,alpha*(b^alpha)/(x^(alpha+1)))
ppareto<-function(x, alpha,b) ifelse(x<b,0,1-(b/x)^alpha)
qpareto<-function(p, alpha,b) b*exp(-log(1-p)/alpha)
rpareto<-function(n, alpha,b) qpareto(runif(n),alpha=alpha,b=b)

#LN
lnormtfit_function(x,b, plot.it = F, lty = 1){

          if (mode(x) != "numeric")
                stop("need numeric data")
          x <- x[!is.na(x)]
          x <- sort(x)
          y <- log(x-b)
          n <- length(y)
          mu <- mean(y)
          sigma <- sd(y)
          if(plot.it) { lines(x, dlnormt(x, meanlog=mu,sdlog=sigma,b=b), lty 
= lty,col="red")}
          result <- list(mu=mu,sigma=sigma,b=b)
          class(result) <- "lnormt"
          result}

dlnormt<-function(x,mu,sigma,b) dlnorm(x-b,meanlog=mu,sdlog=sigma)
plnormt<-function(x,mu,sigma,b) plnorm(x-b,meanlog=mu,sdlog=sigma)
qlnormt<-function(p,mu,sigma,b) qlnorm(p,meanlog=mu,sdlog=sigma)+b
rlnormt<-function(n,mu,sigma,b) rlnorm(n,meanlog=mu,sdlog=sigma)+b

###############
# [2] COMMANDS
###############

# EXP
tmp <- exptfit(data,b=30)
ks.test(data,"pexpt",lambda=tmp$lambda,b=tmp$b)
# Article: D= 0.2599 pvalue<<0.005

# PARETO
tmp <- paretofit(data,b=30)
ks.test(data, "ppareto", alpha=tmp$alpha,b=tmp$b)
# Article: D=0.14586 pvalue=0.16

# LN
tmp <- lnormtfit(data,b=30)
ks.test(data,"plnormt",mu=tmp$mu,sigma=tmp$sigma,b=tmp$b)
# Article: D=0.07939 pvalue>>0.15
#
You appear to be applying the KS test after estimating parameters.  The 
distribution theory is for an iid sample from a known continuous 
distribution (and does not otherwise depend on the distribution).  Since 
your H_0 is not pre-specified, that distribution theory is not correct.
(Some corrections have been worked out for say ML fitting of exponential 
and normal distributions -- by Michael Stephens as I recall.)

Also, your `truncated LogNormal' does not appear to be truncated, rather
to be shifted.  That's the same thing for an exponential (for a positive
shift), but not for any other distribution.
On Thu, 28 Aug 2003, franck allaire wrote:

            
And what is the H_0 and H_1 used in the article?

  
    
#
On Thu, 28 Aug 2003, Prof Brian Ripley wrote:

            
Just to amplify this comment a bit, I'm a little worried that the
current documentation of of ks.test may make it appear that estimated
parameters are ok, or that somehow the p-values computed are
"corrected" in some way for their existence -- which I very much
doubt.  The standard reference on this sort of thing was Durbin's (1973)
SIAM monograph.  There is a very nice approach due to Khmaladze (1981)
based on the Doob-Meyer decomposition - this is the closest thing
that I'm aware of for handling KS type tests with estimated parameters
in a general context.

url:	www.econ.uiuc.edu/~roger/my.html	Roger Koenker
email	rkoenker at uiuc.edu			Department of Economics
vox: 	217-333-4558				University of Illinois
fax:   	217-244-6678				Champaign, IL 61820
1 day later
#
On 28 Aug 2003 at 8:06, Roger Koenker wrote:
But is it worth it to modify Kolmogorov-Smirnof fot estimated 
parameters? It has very low power anyhow. If the null hypothesis is 
"exponential distributio" (which is a scale family), what about using 
the quantile transformation twice

new <- qnorm(pexp(old))

to transform from exponential to normal distribution and the applying
shapiro.test
?

Kjetil Halvorsen
#
On Fri, 29 Aug 2003, kjetil brinchmann halvorsen wrote:

            
And if old is exponential, but not standard exponential?

url:	www.econ.uiuc.edu/~roger/my.html	Roger Koenker
email	rkoenker at uiuc.edu			Department of Economics
vox: 	217-333-4558				University of Illinois
fax:   	217-244-6678				Champaign, IL 61820
#
On 29 Aug 2003 at 20:19, Roger Koenker wrote:

            
But the shapiro-wilk test is scale invariant, and the exponential 
family is a scale family, so we can standardize old first without 
destroying the validity of the shapiro-wilk test?

Kjetil Halvorsen