Skip to content

Prevent calculation when only NA

7 messages · jeff6868, Jim Lemon, Rui Barradas

#
Hi everybody,

I have a small question about R.
I'm doing some correlation matrices between my files. These files contains
each 4 columns of data.
These data files contains missing data too. It could happen sometimes that
in one file, one of the 4 columns contains only missing data NA. As I'm
doing correlations between the same columns of each files, I get a
correlation matrix with a column containing only NAs such like this:

          file1 file 2 file 3
file1    1       NA    0.8
file2    NA     1     NA   
file3   0.8     NA     1

For file2, I have no correlation coefficient. 
My function is looking for the highest correlation coefficient for each
file. But I have an error message due to this.
My question is: how can I say to the function: don't do any calculation if
you see only NAs for the file you're working on? The aim of this function is
to automatize this calculation for 300 files.
I tried by adding: na.rm=TRUE, but it stills wants to do the calculation for
the file containing only NAs (error: 0 (non-NA) cases).
Could you tell me what I should add in my function? Thanks a lot!

get.max.cor <- function(station, mat){
        mat[row(mat) == col(mat)] <- -Inf
        which( mat[station, ] == max(mat[station, ], na.rm=TRUE) )
     }





--
View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716.html
Sent from the R help mailing list archive at Nabble.com.
#
On 05/21/2012 05:59 PM, jeff6868 wrote:
Hi Jeff,
Can you use:

if(any(!is.na(mat))) {
...
}

Jim
#
Hi Jim,

Thanks for your answer.
I tried your proposition. The idea seems to be good but I still have my
error.
Actually, the error is in the next function, which uses the function
get.max.cor I told you before.
I also tried these 2 functions with data containing no missing data, and it
works well.
But I think that the next function is doing the calculation by column (it
seems to read each column). 
Do you think it's possible to introduce in the function get.max.cor
something which stops the calculation for a file if there're only NAs in the
correlation matrix for this file, instead of removing the NAs?
For example: if there're only NAs in file2, don't try to do any calculation
with file2 and go to file3 (and so one)?
I think that this is the problem, because even if I remove NAs, it stills
wants to do a calculation. But as there're no numeric values, it gives an
error.


--
View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716p4630722.html
Sent from the R help mailing list archive at Nabble.com.
#
Hello,

Maybe the function could return a special value, such as zero.
Since a column with that number doesn't exist, the code executed afterward
would simply move on to the second greatest correlation.
The function would then become

get.max.cor <- function(station, mat){
      mat[row(mat) == col(mat)] <- -Inf
	if(sum(is.na(mat[station, ])) == ncol(mat) - 1)
		0
	else
      	which( mat[station, ] == max(mat[station, ], na.rm=TRUE) )
}

df1 <- read.table(text="
          file1 file2 file3
file1    1       NA    0.8
file2    NA     1     NA  
file3   0.8     NA     1
", header=TRUE)

get.max.cor("file2", df1)


Hope this helps,

Rui Barradas

jeff6868 wrote
--
View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716p4630728.html
Sent from the R help mailing list archive at Nabble.com.
#
Hello Rui,

Thanks for your answer too.
I tried your proposition too, but by giving the value 0 for this file, it
still wants to make a calculation with it. As it is looking for the best
correlation, and then the 2nd best correlation, giving only 0 seems to be a
problem for the 2nd best correlation at least.
Maybe the best way to solve the problem would be to introduce in the
function get.max.cor a line which would delete all the colums containing
only NAs in my correlation matrix? 
For example if my calculated correlation matrix is (imagine that the numeric
values are correlation coefficients for the example):

x <- data.frame(a = 1:10, b = c(1:5,NA,7:9, NA), c = 21:30, d = NA)

Maybe is it possible in my function to delete only columns containing 100%
of NA, in order to have a matrix like this:

 x <- data.frame(a = 1:10, b = c(1:5,NA,7:9, NA), c = 21:30)

and to keep other columns even if there're some NAs (the calculation is
still possible as they're numeric coefficients in the column).
Actually, it cannot look for the best or the second best correlation
coefficient in a column if it contains only NA.
I think that a correlation matrix like this would allow the calculation for
the next function and the rest of my script.

--
View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716p4630731.html
Sent from the R help mailing list archive at Nabble.com.
#
Try this.

check.na <- function(mat){
	nas <- NULL
	for(st in seq.int(ncol(mat)))
		if(sum(is.na(mat[, st])) == nrow(mat) - 1) nas <- c(nas, st)
	if(length(nas)){
		mat <- mat[, -nas]
		mat <- mat[-nas, ]
	}
	mat
}
	
check.na(df1)
      file1 file3
file1   1.0   0.8
file3   0.8   1.0

Note that you must remove both the columns and the rows, it's a correlation
matrix.
And that's also why the 'ncol(mat) minus 1', the diagonal value need not be
NA.

Rui Barradas

jeff6868 wrote
--
View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716p4630732.html
Sent from the R help mailing list archive at Nabble.com.
#
I tried your function. It works great thanks. I used then diag() in order to
have the value "1" for the whole diagonal of my matrix. But it still doesn't
work.... it's crazy.
By deleting colums and rows (and so some files) containing only NAs in the
correlation matrix, it doesn't work when I apply the function, because I'm
working on a list of files.
By deleting the files in the correlation matrix, it cannot apply the
function on the list.files (dimensions are different if I delete some files
in the correlation). And as I don't know before the calculation which files
are going to contain these NA columns and rows, I have to do it on another
way. 
I think I should first select the files for my list (and for the
correlation) which contains at least for example 1000 numeric values in a
certain array in order to calculate my correlations. But i'll post it in
another topic.


--
View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716p4630752.html
Sent from the R help mailing list archive at Nabble.com.