resampling for correlation and testing

On Wed, Mar 28, 2012 at 3:53 PM, Benton, Paul
<hpaul.benton08 at imperial.ac.uk> wrote:
Hello all R-er,

I'm trying to run a resampling method on some data. The current method I have takes 2+ days or a lot of memory . I was wondering if anyone has a better suggestion.

Currently I take a matrix and get the correlation matrix from it. This will be called rho.A. Each element in this will be tested against the distribution from the resampled correlation B matrix.

Some example code:

A<-matrix(rnorm(100), ncol=10)
B<-matrix(rnorm(100), ncol=10)

rho.A<-cor(A)

{
idx<-sample(1:10, 10)
idx
# [1] 8 4 5 7 1 9 2 10 6  3

rho.B<-cor(B[,idx])
} ## repeat this x time (currently 500)

## in essence we then have the following :
rho.arrayB<-array(runif((10*10)*500), dim=c(10,10,500))
Err... no we don't. sample(10,10) ; sample(10,10) ... only permutes
the columns, so the 500 cor(B) have exactly the same values in
different off diag positions. Using runif they are unique
My apologies for trying to make some example data. I should have done exactly what I was doing, which would have been harder to read. Since runif doesn't give exactly the same thing I'll give my functions that I'm using.

## note rperm was originally submitted to the stackexchange.com
## http://stats.stackexchange.com/questions/24300/how-to-resample-in-r-without-repeating-permutations
rperm <- function(m, size=2, replaceTF=TRUE) { # Obtain m unique permutations of 1:size
    # Function to obtain a new permutation.
    newperm <- function() {
        count <- 0                # Protects against infinite loops
        repeat {
            # Generate a permutation and check against previous ones.
            p <- sample(1:size, replace=replaceTF)
            hash.p <- paste(p, collapse="") # make a name for the list
            if (is.null(cache[[hash.p]])) break 
            ## stop if we have not already made this seq - repeat to make another new seq

            # Prepare to try again.
            count <- count+1
            if (count > 1000) {   # 1000 is arbitrary; adjust to taste
                p <- NA           # NA indicates a new permutation wasn't found
                hash.p <- ""
                break
            }
        }
        cache[[hash.p]] <<- TRUE  # Update the list of permutations found
        p                         # Return this (new) permutation
    }

    # Obtain m unique permutations.
    cache <- list()
    replicate(m, newperm())  
} 

library(parallel)
cl <- makeCluster(detectCores())
bootComb<-t(rperm(times, ncol(mat.B))) 
pValues.mat<-matrix(NA, nrow=nrow(mat.B), ncol=nrow(mat.B))
	for(i in 1:nrow(mat.B)){
		cat(round(i/nrow(mat.B),2)*100, "% \r")
		rho<-parApply(cl, bootComb, 1, function(idx, xmat, ymat){
			rho<-apply(ymat[,idx], 1, function(ymat, xmat){
				## parallel here due to memory sizes
				rho<-cor(xmat, ymat)
			}, xmat[idx]) ##gives back a vector of rho for n vs m
		}, mat.B[i, ], mat.B) ## returns a matrix of rho's for each combination

		rho.all<- cbind(rho, rho.A[i,])
		pValues.mat[i, ] <- parApply(cl, rho.all, 1, function(rhoVec){
			tl <- length(rhoVec) ## test index
			bl <- tl-1 ## dist index
			wilcox.test(rhoVec[1:bl], rhoVec[tl])$p.value
		})
	}
	stopCluster(cl)
## Then test if rho.A[1,1] come from the distribution of rho.B[1,1]
pvalueMat[1,1]<-wilcox.test(rho.array[1,1,] , rho.A[1,1])$p.value

From what I know cor(A)[ i , i ] = cor(B)[ j , j ] = 1   for any
choice of A,B,i and j
No, cor(a)[i, j] != cor(b)[i , j] 

If your concern is because they are coming from the same distribution then again this is example data. Even then I would imagine that the correlation would be different for small n. Either way resampling the columns will give different  correlation values. Please try the code yourself.
I don't think Wilcox intended his test to be used in this way?.
Probably true ?. care to suggest a different statistical test ?
I would start with fixing these issues first so you don't wait 2 days
for a vector of NaN's
Actually I didn't get a vector or NaN's I got a matrix of pvalues. Which has proved useful. Now I want to make the function faster and was looking for a bit of help.
Cheers

However, my array size would be 2300 x 2300 x 500 which R won't let me even make as an empty structure. Any suggestion are more than welcomed !!

Cheers,

Paul

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

resampling for correlation and testing

Thread (4 messages)