Skip to content

KS test from ctest package

4 messages · Brian Ripley, David Middleton, Kurt Hornik

#
This question is mainly aimed at Kurt Hornik as author of the ctest package,
but I'm cc'ing it to r-help as I suspect there will be other valuable
opinions out there.

I have been attempting 2 sample Kolmogorov-Smirnov tests using the ks.test
function from the ctest package (ctest v.0.9-15, R v.0.63.3 win32).  I am
comparing fish length-frequency distributions.  My main reference for the KS 
test at present is Sokal & Rohlf, Biometry (2nd edn), pages 440-445).

The individuals in my samples are measured to the nearest 0.5cm and so in
most samples there are several identical length values.  It appears that the
KS test statistic D is being overestimated (and the p value therefore
underestimated).
I think this is best illustrated by a trivial (but extreme) example:

	> library(ctest)
	> x <- y <- rep(1,10)
	> ks.test(x,y)

	         Two-sample Kolmogorov-Smirnov test 

	data:  x and y 
	D = 1, p-value = 9.08e-005 
	alternative hypothesis: two.sided 

Obviously when two identical vectors are compared the test statistic D should
be zero and the probability that the two vectors represent the same
underlying 
distribution should be 1.

If D is calculated using the first method outlined by Sokal & Rohlf (maximum
absolute difference between relative cumulative frequencies) then D is
indeed 0.
The method used in the ctest code is presented by Sokal and Rohlf as an
alternate (NB not approximate) computation scheme and attributed to Gideon
& Mueller (1978).  The pertinent code is the line:

        z <- ifelse(order(c(x, y)) <= n.x, 1/n.x, -1/n.y).

If the two vectors in the example above had been identical, but with no
repeated values, the result of order(c(x, y)) would have been along the
lines of

 [1]  1 11  2 12  3 13  4 14  5 15  6 16  7 17  8 18  9 19 10 20

(the essential point being that items in the result come alternately
from x & y).  D is calculated as max(abs(cumsum(z))), with the result that
the minimum D for identical vectors is min(1/n.x,1/n.y).  (It therefore
appears
to me that this computational method should be considered an approximate
rather
than alternative method.)

In the case of vectors with replicated values the problem 
is worse because values from one vector are grouped in the vector returned
by order.  In the case of the example above:
[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

I don't think this can be considered a bug, but it is certainly a problem
for the method used in computing D.  Has anyone coded alternative KS test
computation methods in R/S?  It's obviously not hard, but could be slow unless
done elegantly!

Thanks

David Middleton



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Thu, 8 Apr 1999, David Middleton wrote:

            
If the data are discretized the KS test does not have the standard
(distribution-free) distribution. `Distribution-free' here means
independent samples from a continuous distribution. So the KS test is 
not IMHO appropriate in your problem. My view is that the function should
warn you off, and not give a p-value if it finds ties. It might be good to
construct the exact statistic, though.
#
Brian

Many thanks for the rapid response.  Here are the inevitable follow
up questions!
I am aware that the KS test assumes samples from a continuous distribution.
Fish length obviously is a continuous variable, though it is apparent that
measuring to the nearest 0.5cm (or 1cm in some cases) does introduce a
certain discretization.  In the case I'm considering lengths to the nearest
0.5cm are the highest precision available.  I wonder, therefore, whether there
are guidelines regarding the precision required before a continuous variable
yields continuous measurements?  Possibly some criterion based on the ratio
of precision to range?  In this case the fact that there are repeated values
for the length measurements suggests there has been inherent creation of size
classes.
Sokal and Rohlf do give an approximate KS 2 sample test for large sample
sizes.  Again D is the maximum absolute difference between cumulative
relative frequencies but the difference is only calculated once per
measurement class, rather than for each individual measurement.  Their example
has sample sizes of 400-500.  Is there any published guidance for the sample
size that is considered "large enough"?

I hope that these questions are not too general for the R list - unfortunately
my access to a statistical publications is somewhat limited at present.  I do
note with satisfaction that it will be relatively easy to code the approximate
test in R.

Thanks

David Middleton, dajm at deeq.demon.co.uk
Falkland Islands Fisheries Department


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
1 day later
#
The new version of ctest tries to be more intelligent about ties.  It
gives a warning and hopefully also the right differences between the two
ecdfs.

Thanks,
-k
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._