P-values Kolmogorov–Smirnov test

Boo G.

Thu, Sep 5, 2019 11:42 AM

Thanks for your reply, Rui. 

I don?t think that I can use directly the ks.test because I have a weighted sample (see  m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]) and I want to account for that. That?s why I am trying to compute everything manually. 

Also, if you look at the results of the ks.test in your simulation, you will notice that the p-value always implies that the sample is always (even with same size = 1) drawn form the same distribution. This looks suspicious to me.

What are your thoughts?

On 5 Sep 2019, at 20:29, Rui Barradas <ruipbarradas at sapo.pt> wrote:

Hello,

I don't have the algorithms at hand but the KS statistic calculation is more complicated than your max/abs difference.

Anyway, why not use ks.test? it's not that difficult:


set.seed(1234)
#reference distribution
d_1 <- sort(rpois(1000, 500))
p_1 <- d_1/sum(d_1)
m_1 <- data.frame(d_1, p_1)

#data frame to store the values of the simulation
d_stat <- data.frame(1:1000, NA, NA)
names(d_stat) <- c("sample_size", "ks_distance", "p_value")

#simulation
for (i in 1:1000) {
 #sample from the reference distribution
 m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
 d_2 <- m_2$d_1

 ht <- ks.test(d_1, d_2)
 #kolmogorov-smirnov distance
 d_stat[i, 2] <- ht$statistic
 d_stat[i, 3] <- ht$p.value
}

hist(d_stat[, 2])
hist(d_stat[, 3])


Note that d_2 is not sorted, but the results are equal in the sense of function identical(), meaning they are *exactly* the same. Why shouldn't they?

Hope this helps,

Rui Barradas


?s 17:06 de 05/09/19, Boo G. escreveu:

Hello,
I am trying to perform a Kolmogorov?Smirnov test to assess the difference between a distribution and samples drawn proportionally to size of different sizes. I managed to compute the Kolmogorov?Smirnov distance but I am lost with the p-value. I have looked into the ks.test function unsuccessfully. Can anyone help me with computing p-values for a two-tailed test?
Below a simplified version of my code.
Thanks in advance.
Gianluca
library(spatstat)
#reference distribution
d_1 <- sort(rpois(1000, 500))
p_1 <- d_1/sum(d_1)
m_1 <- data.frame(d_1, p_1)
#data frame to store the values of the siumation
d_stat <- data.frame(1:1000, NA, NA)
names(d_stat) <- c("sample_size", "ks_distance", "p_value")
#simulation
for (i in 1:1000) {
  #sample from the reference distribution
  m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
  m_2 <-m_2[order(m_2$d_1),]
  d_2 <- m_2$d_1
  p_2 <- m_2$p_1
  #weighted ecdf for the reference distribution and the sample
  f_d_1 <- ewcdf(d_1, normalise=F)
  f_d_2 <- ewcdf(d_2, 1/p_2, normalise=F, adjust=1/length(d_2))
  #kolmogorov-smirnov distance
  d_stat[i,2] <- max(abs(f_d_1(d_2) - f_d_2(d_2)))
}
	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7C0c709068527c41e062dd08d7322f0d72%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=y9jfixyNiroKwKZEJj0owuCcWoeFQKZdaG9WLe2xHQ8%3D&amp;reserved=0
PLEASE do read the posting guide https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&amp;data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7C0c709068527c41e062dd08d7322f0d72%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=7n0doy4P1S1TpApX1zpUborAnUnxuOxYtn%2FQ%2BtVztGM%3D&amp;reserved=0
and provide commented, minimal, self-contained, reproducible code.

P-values Kolmogorov–Smirnov test

Thread (8 messages)