resampling for correlation and testing
On Mar 29, 2012, at 1:41 AM, ilai wrote:
On Wed, Mar 28, 2012 at 3:53 PM, Benton, Paul <hpaul.benton08 at imperial.ac.uk> wrote:
Hello all R-er,
I'm trying to run a resampling method on some data. The current method I have takes 2+ days or a lot of memory . I was wondering if anyone has a better suggestion.
Currently I take a matrix and get the correlation matrix from it. This will be called rho.A. Each element in this will be tested against the distribution from the resampled correlation B matrix.
Some example code:
A<-matrix(rnorm(100), ncol=10)
B<-matrix(rnorm(100), ncol=10)
rho.A<-cor(A)
{
idx<-sample(1:10, 10)
idx
# [1] 8 4 5 7 1 9 2 10 6 3
rho.B<-cor(B[,idx])
} ## repeat this x time (currently 500)
## in essence we then have the following :
rho.arrayB<-array(runif((10*10)*500), dim=c(10,10,500))
Err... no we don't. sample(10,10) ; sample(10,10) ... only permutes the columns, so the 500 cor(B) have exactly the same values in different off diag positions. Using runif they are unique
My apologies for trying to make some example data. I should have done exactly what I was doing, which would have been harder to read. Since runif doesn't give exactly the same thing I'll give my functions that I'm using. ## note rperm was originally submitted to the stackexchange.com ## http://stats.stackexchange.com/questions/24300/how-to-resample-in-r-without-repeating-permutations rperm <- function(m, size=2, replaceTF=TRUE) { # Obtain m unique permutations of 1:size # Function to obtain a new permutation. newperm <- function() { count <- 0 # Protects against infinite loops repeat { # Generate a permutation and check against previous ones. p <- sample(1:size, replace=replaceTF) hash.p <- paste(p, collapse="") # make a name for the list if (is.null(cache[[hash.p]])) break ## stop if we have not already made this seq - repeat to make another new seq # Prepare to try again. count <- count+1 if (count > 1000) { # 1000 is arbitrary; adjust to taste p <- NA # NA indicates a new permutation wasn't found hash.p <- "" break } } cache[[hash.p]] <<- TRUE # Update the list of permutations found p # Return this (new) permutation } # Obtain m unique permutations. cache <- list() replicate(m, newperm()) } library(parallel) cl <- makeCluster(detectCores()) bootComb<-t(rperm(times, ncol(mat.B))) pValues.mat<-matrix(NA, nrow=nrow(mat.B), ncol=nrow(mat.B)) for(i in 1:nrow(mat.B)){ cat(round(i/nrow(mat.B),2)*100, "% \r") rho<-parApply(cl, bootComb, 1, function(idx, xmat, ymat){ rho<-apply(ymat[,idx], 1, function(ymat, xmat){ ## parallel here due to memory sizes rho<-cor(xmat, ymat) }, xmat[idx]) ##gives back a vector of rho for n vs m }, mat.B[i, ], mat.B) ## returns a matrix of rho's for each combination rho.all<- cbind(rho, rho.A[i,]) pValues.mat[i, ] <- parApply(cl, rho.all, 1, function(rhoVec){ tl <- length(rhoVec) ## test index bl <- tl-1 ## dist index wilcox.test(rhoVec[1:bl], rhoVec[tl])$p.value }) } stopCluster(cl)
## Then test if rho.A[1,1] come from the distribution of rho.B[1,1] pvalueMat[1,1]<-wilcox.test(rho.array[1,1,] , rho.A[1,1])$p.value
From what I know cor(A)[ i , i ] = cor(B)[ j , j ] = 1 for any choice of A,B,i and j
No, cor(a)[i, j] != cor(b)[i , j] If your concern is because they are coming from the same distribution then again this is example data. Even then I would imagine that the correlation would be different for small n. Either way resampling the columns will give different correlation values. Please try the code yourself.
I don't think Wilcox intended his test to be used in this way?.
Probably true ?. care to suggest a different statistical test ?
I would start with fixing these issues first so you don't wait 2 days for a vector of NaN's
Actually I didn't get a vector or NaN's I got a matrix of pvalues. Which has proved useful. Now I want to make the function faster and was looking for a bit of help.
Cheers
However, my array size would be 2300 x 2300 x 500 which R won't let me even make as an empty structure. Any suggestion are more than welcomed !! Cheers, Paul
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.