Skip to content
Prev 170492 / 398506 Next

Bootstrap or Wilcoxons' test?

The Wilcoxon rank sum test is not "plain and simple a test equality of  
distributions". If it were such, it would be able to test for  
differences in variance when locations were similar. For that purpose  
it would, in point of fact, be useless. Compare these simple  
situations w.r.t. the WRS:

 > x <- rnorm(100)  # mean=0, sd=1
 > y <- rnorm(100, mean=0, sd=4)
 > wilcox.test(x,y)

	Wilcoxon rank sum test with continuity correction

data:  x and y
W = 4518, p-value = 0.2394
alternative hypothesis: true location shift is not equal to 0

 > y <- rnorm(100, mean=.2, sd=0)
 >
 > wilcox.test(x,y)

	Wilcoxon rank sum test with continuity correction

data:  x and y
W = 3900, p-value = 0.004079
alternative hypothesis: true location shift is not equal to 0

It is a test of the equality of location (and the median is a readily  
understood non-parametric measure of location). The test is derived  
under the *assumption* that the samples are drawn from the *same*  
distribution differing only by a shift. If the distributions were not  
of the same family, the test would be invalidated. The wilcox.test  
help page is informative, saying "the null hypothesis is that the  
distributions of xand y differ by a location shift of mu". The  
pseudomedian is optionally estimated when conf.int is set to TRUE. I  
also suggest looking at the formula for the statistic. It is available  
with getAnywhere(wilcox.test.default).

If one wants a test for "equality of distribution", one could turn to  
a more general test (with loss of power but with at least some  
potential for detecting differences in dispersion) such as the  
Kolmogorov-Smirnov or Kuiper tests. With x and y as above:

 > ks.test(x,y)

	Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.61, p-value < 2.2e-16
alternative hypothesis: two-sided

Warning message:
In ks.test(x, y) : cannot compute correct p-values with ties

Returning to the OP's question, rather than worrying about normality  
in samples, the greater threat to validity in regression methods is  
unequal variances across groups or the range of continuous predictors.