Hi everybody,
I have a small question about R.
I'm doing some correlation matrices between my files. These files contains
each 4 columns of data.
These data files contains missing data too. It could happen sometimes that
in one file, one of the 4 columns contains only missing data NA. As I'm
doing correlations between the same columns of each files, I get a
correlation matrix with a column containing only NAs such like this:
file1 file 2 file 3
file1 1 NA 0.8
file2 NA 1 NA
file3 0.8 NA 1
For file2, I have no correlation coefficient.
My function is looking for the highest correlation coefficient for each
file. But I have an error message due to this.
My question is: how can I say to the function: don't do any calculation if
you see only NAs for the file you're working on? The aim of this function is
to automatize this calculation for 300 files.
I tried by adding: na.rm=TRUE, but it stills wants to do the calculation for
the file containing only NAs (error: 0 (non-NA) cases).
Could you tell me what I should add in my function? Thanks a lot!
get.max.cor <- function(station, mat){
mat[row(mat) == col(mat)] <- -Inf
which( mat[station, ] == max(mat[station, ], na.rm=TRUE) )
}
--
View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716.html
Sent from the R help mailing list archive at Nabble.com.
Prevent calculation when only NA
7 messages · jeff6868, Jim Lemon, Rui Barradas
On 05/21/2012 05:59 PM, jeff6868 wrote:
Hi everybody,
I have a small question about R.
I'm doing some correlation matrices between my files. These files contains
each 4 columns of data.
These data files contains missing data too. It could happen sometimes that
in one file, one of the 4 columns contains only missing data NA. As I'm
doing correlations between the same columns of each files, I get a
correlation matrix with a column containing only NAs such like this:
file1 file 2 file 3
file1 1 NA 0.8
file2 NA 1 NA
file3 0.8 NA 1
For file2, I have no correlation coefficient.
My function is looking for the highest correlation coefficient for each
file. But I have an error message due to this.
My question is: how can I say to the function: don't do any calculation if
you see only NAs for the file you're working on? The aim of this function is
to automatize this calculation for 300 files.
I tried by adding: na.rm=TRUE, but it stills wants to do the calculation for
the file containing only NAs (error: 0 (non-NA) cases).
Could you tell me what I should add in my function? Thanks a lot!
get.max.cor<- function(station, mat){
mat[row(mat) == col(mat)]<- -Inf
which( mat[station, ] == max(mat[station, ], na.rm=TRUE) )
}
Hi Jeff,
Can you use:
if(any(!is.na(mat))) {
...
}
Jim
Hi Jim, Thanks for your answer. I tried your proposition. The idea seems to be good but I still have my error. Actually, the error is in the next function, which uses the function get.max.cor I told you before. I also tried these 2 functions with data containing no missing data, and it works well. But I think that the next function is doing the calculation by column (it seems to read each column). Do you think it's possible to introduce in the function get.max.cor something which stops the calculation for a file if there're only NAs in the correlation matrix for this file, instead of removing the NAs? For example: if there're only NAs in file2, don't try to do any calculation with file2 and go to file3 (and so one)? I think that this is the problem, because even if I remove NAs, it stills wants to do a calculation. But as there're no numeric values, it gives an error. -- View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716p4630722.html Sent from the R help mailing list archive at Nabble.com.
Hello,
Maybe the function could return a special value, such as zero.
Since a column with that number doesn't exist, the code executed afterward
would simply move on to the second greatest correlation.
The function would then become
get.max.cor <- function(station, mat){
mat[row(mat) == col(mat)] <- -Inf
if(sum(is.na(mat[station, ])) == ncol(mat) - 1)
0
else
which( mat[station, ] == max(mat[station, ], na.rm=TRUE) )
}
df1 <- read.table(text="
file1 file2 file3
file1 1 NA 0.8
file2 NA 1 NA
file3 0.8 NA 1
", header=TRUE)
get.max.cor("file2", df1)
Hope this helps,
Rui Barradas
jeff6868 wrote
Hi everybody,
I have a small question about R.
I'm doing some correlation matrices between my files. These files contains
each 4 columns of data.
These data files contains missing data too. It could happen sometimes that
in one file, one of the 4 columns contains only missing data NA. As I'm
doing correlations between the same columns of each files, I get a
correlation matrix with a column containing only NAs such like this:
file1 file 2 file 3
file1 1 NA 0.8
file2 NA 1 NA
file3 0.8 NA 1
For file2, I have no correlation coefficient.
My function is looking for the highest correlation coefficient for each
file. But I have an error message due to this.
My question is: how can I say to the function: don't do any calculation if
you see only NAs for the file you're working on? The aim of this function
is to automatize this calculation for 300 files.
I tried by adding: na.rm=TRUE, but it stills wants to do the calculation
for the file containing only NAs (error: 0 (non-NA) cases).
Could you tell me what I should add in my function? Thanks a lot!
get.max.cor <- function(station, mat){
mat[row(mat) == col(mat)] <- -Inf
which( mat[station, ] == max(mat[station, ], na.rm=TRUE) )
}
-- View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716p4630728.html Sent from the R help mailing list archive at Nabble.com.
Hello Rui, Thanks for your answer too. I tried your proposition too, but by giving the value 0 for this file, it still wants to make a calculation with it. As it is looking for the best correlation, and then the 2nd best correlation, giving only 0 seems to be a problem for the 2nd best correlation at least. Maybe the best way to solve the problem would be to introduce in the function get.max.cor a line which would delete all the colums containing only NAs in my correlation matrix? For example if my calculated correlation matrix is (imagine that the numeric values are correlation coefficients for the example): x <- data.frame(a = 1:10, b = c(1:5,NA,7:9, NA), c = 21:30, d = NA) Maybe is it possible in my function to delete only columns containing 100% of NA, in order to have a matrix like this: x <- data.frame(a = 1:10, b = c(1:5,NA,7:9, NA), c = 21:30) and to keep other columns even if there're some NAs (the calculation is still possible as they're numeric coefficients in the column). Actually, it cannot look for the best or the second best correlation coefficient in a column if it contains only NA. I think that a correlation matrix like this would allow the calculation for the next function and the rest of my script. -- View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716p4630731.html Sent from the R help mailing list archive at Nabble.com.
Try this.
check.na <- function(mat){
nas <- NULL
for(st in seq.int(ncol(mat)))
if(sum(is.na(mat[, st])) == nrow(mat) - 1) nas <- c(nas, st)
if(length(nas)){
mat <- mat[, -nas]
mat <- mat[-nas, ]
}
mat
}
check.na(df1)
file1 file3
file1 1.0 0.8
file3 0.8 1.0
Note that you must remove both the columns and the rows, it's a correlation
matrix.
And that's also why the 'ncol(mat) minus 1', the diagonal value need not be
NA.
Rui Barradas
jeff6868 wrote
Hello Rui, Thanks for your answer too. I tried your proposition too, but by giving the value 0 for this file, it still wants to make a calculation with it. As it is looking for the best correlation, and then the 2nd best correlation, giving only 0 seems to be a problem for the 2nd best correlation at least. Maybe the best way to solve the problem would be to introduce in the function get.max.cor a line which would delete all the colums containing only NAs in my correlation matrix? For example if my calculated correlation matrix is (imagine that the numeric values are correlation coefficients for the example): x <- data.frame(a = 1:10, b = c(1:5,NA,7:9, NA), c = 21:30, d = NA) Maybe is it possible in my function to delete only columns containing 100% of NA, in order to have a matrix like this: x <- data.frame(a = 1:10, b = c(1:5,NA,7:9, NA), c = 21:30) and to keep other columns even if there're some NAs (the calculation is still possible as they're numeric coefficients in the column). Actually, it cannot look for the best or the second best correlation coefficient in a column if it contains only NA. I think that a correlation matrix like this would allow the calculation for the next function and the rest of my script.
-- View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716p4630732.html Sent from the R help mailing list archive at Nabble.com.
I tried your function. It works great thanks. I used then diag() in order to have the value "1" for the whole diagonal of my matrix. But it still doesn't work.... it's crazy. By deleting colums and rows (and so some files) containing only NAs in the correlation matrix, it doesn't work when I apply the function, because I'm working on a list of files. By deleting the files in the correlation matrix, it cannot apply the function on the list.files (dimensions are different if I delete some files in the correlation). And as I don't know before the calculation which files are going to contain these NA columns and rows, I have to do it on another way. I think I should first select the files for my list (and for the correlation) which contains at least for example 1000 numeric values in a certain array in order to calculate my correlations. But i'll post it in another topic. -- View this message in context: http://r.789695.n4.nabble.com/Prevent-calculation-when-only-NA-tp4630716p4630752.html Sent from the R help mailing list archive at Nabble.com.