error in NORM lib
try MICE package :) regards, Roel de Jong
(Ted Harding) wrote:
Folks, Leo G??rther and I have been privately discussing the problems with imputation using NORM which he originally described on 9 November. Essentially, he observed that many of the imputed missing values were totally absurd, being well out of any range comatible with the observed values of the variables. After following a few false trails, we have discovered the reason. People interested in using NORM (and CAT and MIX and maybe PAN) may well be interested in this reason! The dataset, which can be downloaded from his URL http://www.anicca-vijja.de/lg/dframe.Rdata consists of a matrix with 74 columns and 200 rows. There are 553 missing values out of the 14800 (less than 4%), and the distributions of the observed values of the variables are well-behaved. So this should not be a problematic dataset. 61 of the 74 columns have missing values (NAs) in them, and this is the reason why NORM fails. Specifically, the first few lines of the code of the function prelim.norm() are as follows: if (is.vector(x)) x <- matrix(x, length(x), 1) n <- nrow(x) p <- ncol(x) storage.mode(x) <- "double" r <- 1 * is.na(x) nmis <- as.integer(apply(r, 2, sum)) names(nmis) <- dimnames(x)[[2]] mdp <- as.integer((r %*% (2^((1:ncol(x)) - 1))) + 1) and, as can be seen from the last line, if there are missing values in a column with index > 31 then (r %*% (2^((1:ncol(x)) - 1))) + 1 >= 2^31 and then applying as.integer() to this value returns NA since as.integer only works for numbers no greater than .Machine$integer.max, normally 2^31 - 1. (Is the situation different for R on say 64-bit machines?) The value of mdp[i] is a "packed" binary encoding of the column positions of any NAs in row i: if bit j-1 (counting from 0) in the binary representation of mdp[i] is 1, then there is an NA in column j of row i. The vector mdp is used at various places in the NORM routines, and the effect on the imputations of having NAs in it, when the functioning of the routines depends on unpacking the encoding, is catastrophic. (Experiment had shown, indeed, that imputing with a subset of fewer than 32 columns always gave acceptable results). The upshot of this is that NORM cannot be used for multiple imputations if there are more than 31 columns in the data which have NAs in them. You could have more than 31 columns of data -- indeed Leo's 74 would have worked then -- if the columns are re-ordered so that all the columns with NAs are at the left, provided there are fewer than 32 with NAs. Unfortunately Leo has 61. There is in principle no necessity to represent NA positions in this way, but that is how Shafer did it and it was carried over into R. An alternative method would simply be to have a 0/1 matrix of NA indicators, but the code for the NORM functions would have to be picked through to replace the unpacking of mdp -- and this includes FORTRAN routines (Oh dear, echoes of the "open source and R" discussion)! So removing this limitation would not be trivial. I have not noticed mention of the limitation in the documentation of the NORM functions. Exactly the same construction of mdp, and therefore exactly the same problem, occurs in prelim.cat in CAT, for which I'm joint maintainer with Fernando Tusell, so we had better try to look into that! Any lessons we learn will be broadcast, so should be useful for NORM as well. And, for good measure, in MIX it occurs twice over in prelim.mix: once in constructing mdpz for the continuous variables, and once in mdpw for the categorical variables. This is perhaps less likely in practice to cause the problem in MIX, since it would arise only if either there were more than 31 columns of continuous variables with NAs, or more than 31 of categorical variables; so MIXers can spread their bets. Again, I have not noticed that the limitation is mentioned in the documentation of MIX; and I'm pretty sure it is not in the documentation of CAT! Any suggestions or guidance from people who are familiar with NORM and MIX will be most welcome. I should add that I have not looked into PAN, but would not be surprised if it were there as well. I've written this explanation in consultation with Leo G??rtler, and he has proposed that I should publish it to the R List; but please consider that it is a joint effort. Best wishes to all, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 15-Nov-05 Time: 00:45:46 ------------------------------ XFMail ------------------------------
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html