ks.test - continuous vs discrete
I frequently want to test for differences between animal size frequency distributions. The obvious test (I think) to use is the Kolmogorov-Smirnov two sample test (provided in R as the function ks.test in package ctest).
"obvious" depends on the problem you want to test: KS tests the hypothesis H_0: F(z) = G(z) for all z vs. H_1: F(z) != G(z) for at least one z ks.test assumes that both F and G are continuous variables. However, if you want to test H_0: F(z) = G(z) vs. H_1: F(z) = G(z - delta); delta != 0 as "test for differences" indicates, the Wilcoxon rank sum test is "obvious". Or, more general, if your hypothesis is "exchangeability", a permutation test can be used.
The KS test is for continuous variables and this obviously includes length, weight etc. However, limitations in measuring (e.g length to the nearest cm/mm, weight to the nearest g/mg etc) has the obvious effect of "discretising" real data.
or maybe the underlying distribution is discrete? Anyway: ks.test and wilcox.test in ctest assume data from continuous distributions and the normal approximation is used if ties occur. For the Wilcoxon and permutation test, the conditional distribution (that is: conditional on the ties) can be computed using the exactRankTests package.
The ks.test function checks for the presence of ties noting in the help page that "continuous distributions do not generate them". Given the problem of "measuring to the nearest..." noted above I frequently find that my data has ties and ks.test generates a warning. I was interested to note that the example of a two-sample KS test given in Sokal & Rohlf's "Biometry" (I have the 2nd edition where the example is on p.441) has exactly the same problem:
A <- c(104,109,112,114,116,118,118,117,121,123,125,126,126,128,128,128) B <- c(100,105,107,107,108,111,116,120,121,123)
For your example:
R> library(exactRankTests)
R> wilcox.exact(B, A)
Exact Wilcoxon rank sum test
data: B and A
W = 36.5, p-value = 0.02039
alternative hypothesis: true mu is not equal to 0
R> perm.test(B, A)
2-sample Permutation Test
data: B and A
T = 1118, p-value = 0.01864
alternative hypothesis: true mu is not equal to 0
Torsten
ks.test(A,B)
Two-sample Kolmogorov-Smirnov test
data: A and B
D = 0.475, p-value = 0.1244
alternative hypothesis: two.sided
Warning message:
cannot compute correct p-values with ties in: ks.test(A, B)
In their chapter 2, "Data in Biology", Sokal & Rohlf note "any given reading
of a continuous variable ... is therefore an approximation to the exact
reading, which is in practice unknowable. However, for the purposes of
computation these approximations are usually sufficient..."
I am interested to know whether this can be made more exact. Are there
methods to test that data are measured at an appropriate scale so as to be
regarded as sufficiently continuous for a KS test, or is common sense choice
of measurement precision widely regarded as sufficient?
Any comments/references would be appreciated!
David Middleton
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._