Skip to content

Comparing spatial distributions - permutation test implementation

3 messages · Jo Frabetti, Marcelino de la Cruz

#
Hello everyone,

I am looking at the joint spatial distribution of 2 kinds of organisms  
(estimated on a grid of points) and want to test for significant  
association or dissociation.

My first question is: do you know a nice technique to do that,  
considering that I have a limited number of points (36) but that they  
are repeated (4 times)? I did GLMs to test for correlations between  
the two (hence forgetting about the spatial aspect of it) and was  
previously pointed to the SADIE software. Would there be anything  
explicitly spatial and available in R please?

Then, Syrjala's test[1] seems appropriate and tests for differences in  
distribution. It computes a Cram?r-von Mises-type statistic and tests  
its significance with a permutation test.
I implemented the test in R and posted the code on these mailing  
lists[2]. Some people checked it and confirmed that the statistic  
gives correct results but my estimation of the p-value does not match  
the one predicted with the orignal software from Syrjala. I don't know  
what I am doing wrong. The permutation test is described by Syrjala as:

         (...) Under the null hypothesis,
         at a given sampling location (x_k, y_k), either density ob-
         servation y_i(x_k, y_k), i = 1, 2, is equally likely for each
         population. Thus, for a given data set, the distribution
         of the test statistic can be constructed by calculating
         the value of the test statistic for all 2^k pairwise per-
         mutations of the data set. (...) The level of signif-
         icance of a specific realization of the test statistic T is
         determined from its position in the ordered set of test
         statistic values from all 2^k permutations. (...)

My understanding is that, for each permutation I should choose a  
random number of points (between 1 and k), swap the values for species  
1 and species 2 at those points, and recompute the test on the new  
data. But this does not work :/ . Here is my code and associated data  
from Syrjala (for which I have reference values). Any advice would be  
very welcome (in particular if there is a way to leverage boot() for  
this).
NB: computing the 1000 permutations can be a bit lengthy, but  
fortunately, by using plyr, you get a nice progress bar to look at!

syrjala.stat <- function(x, y=NULL, var1=NULL, var2=NULL)
#
#	Compute Syrjala statistic
#	x, y			coordinates
#	var1, var2	value of 2 parameters both measured at (x,y) points
#	NB: x can also be a data.frame/matrix containing x,y,var1,var2 as  
columns
#
{
	# Input checks
	if (!is.null(ncol(x))) {
		if (ncol(x) == 4) {
			names(x) = c("x","y","var1","var2")
			dat = x
		} else {
			stop("Wrong number of columns in argument x")
		}
	} else {
		dat = data.frame(x, y, var1, var2)
	}

	# Normalize abundances
	dat$var1 = dat$var1/sum(dat$var1)
	dat$var2 = dat$var2/sum(dat$var2)

	# For each point (each line of dat)
	# compute the squared difference in gammas from each origin
	meanSqDiff = apply(dat, 1, function(d, coord, variab) {
		north = (coord$x>=d[1])
		east = (coord$y>=d[2])
		south = (coord$x<=d[1])
		west = (coord$y<=d[2])
		return( mean( c(
			(diff(sapply(variab[(north & east),], sum)))^2,
			(diff(sapply(variab[(south & east),], sum)))^2,
			(diff(sapply(variab[(south & west),], sum)))^2,
			(diff(sapply(variab[(north & west),], sum)))^2
			) )
		)
	}, dat[,c("x","y")], dat[,c("var1","var2")])

	# Compute the statistic (i.e. sum of mean squared differences)
	return(sum(meanSqDiff))
}


# Get data online : http://dl.getdropbox.com/u/1047321/syrjala_data_cod.csv
system("curl http://dl.getdropbox.com/u/1047321/syrjala_data_cod.csv >  
syrjala_data_cod.csv")

dataCod = read.csv(file = "syrjala_data_cod.csv", header = TRUE)

# Normalize abundances
dataCod$var1 = dataCod$var1/sum(dataCod$var1)
dataCod$var2 = dataCod$var2/sum(dataCod$var2)

# Number of permutations
nperm = 1000

# Create nperm-1 replicates of the data (one is the original  
observation)
d = rep(list(dataCod), nperm-1)

# Compute number of observations before to avoid doing that for every  
replicate
n = nrow(dataCod)

require(plyr)
# Permute some observations and compute the syrjala stat for each  
permutation
psis = ldply(d, .fun=function(x, n){
	# choose indices of observations to swap
	idx = sample(1:n, runif(1, min=1, max=n))
	# swap observations
	x[idx, 3:4] = x[idx, 4:3]
	# compute syrjala stat
	return(syrjala.stat(x))
}, n, .progress="text")
}

# Compute the syrjala stat for the observations
psi = syrjala.stat(dataCod)

# Estimate the pvalue
pvalue = (sum(psis>=psi)+1)/nperm

psi
pvalue
# Should be:
# statistic	= 0.224
# p-value	= 0.1900

Thank you very much in advance. Sincerely,

JiHO
---
http://jo.irisson.free.fr/

[1] A statistical test for a difference between the spatial  
distributions of two populations. Syrjala SE. Ecology. 1996;77(1):75?80.
http://dl.getdropbox.com/u/1047321/Syrjala1996.pdf

[2] https://stat.ethz.ch/pipermail/r-sig-geo/2008-February/ 
thread.html#3137
#
Hi,

Jose M. Blanco-Moreno an myself have implemented 
Syrjala's test in the ecespa package. As a matter 
of fact, Jose M. has implemented a Frotran 
version of Syrjala's original QBasic function. 
Using your data the results are very close to 
your reported # statistic= 0.224 and # p-value       = 0.1900.


Cheers,

Marcelino



# with Fortran code
 > syrjala0(dataCod[,1:2], dataCod$var1, dataCod$var2,R=F,nsim=1000)
Cramer-von Misses test for the difference between
the spatial distributions of  dataCod$var1 and dataCod$var2
based on 1000 permutations.

    psi:     0.223504
    p-value: 0.2167832

Kolmogorov-Smirnov test for the difference between
the spatial distributions of  dataCod$var1 and dataCod$var2
based on 1000 permutations.

    psi:     0.061354
    p-value: 0.2867133


### With R-code (a bit lengthy, so I use only 100 permutations)
 > syrjala0(dataCod[,1:2], dataCod$var1, dataCod$var2,R=T,nsim=100)
Syrjala test for the difference between the spatial distributions of
  dataCod$var1 and  dataCod$var2 , based on 100 simulations

    psi:      0.223504
    p-value:  0.2079208
At 17:34 20/05/2009, JiHO wrote:
________________________________

Marcelino de la Cruz Rot

Departamento de  Biolog?a Vegetal
E.U.T.I. Agr?cola
Universidad Polit?cnica de Madrid
28040-Madrid
Tel.: 91 336 54 35
Fax: 91 336 56 56
marcelino.delacruz at upm.es
#
On 2009-May-21 , at 05:40 , Marcelino de la Cruz wrote:

            
Thanks a lot. Having the fortran implementation is very nice!
I still don't find the values computed with the quickbasic code but  
the values reported in the article for this data set are different  
from those computed with QB and actually closer to yours.

Looking at your code I still don't get what I am doing wrong though.  
It seems we both use sample to get a few of the columns and then swap  
them. Well, I'll use your test anyway.

JiHO
---
http://jo.irisson.free.fr/