Adding NA values in random positions in a dataframe
An essentially identical approach that may be a tad clearer -- but
requires additional space -- first creates a logical vector for the
locations of the NA's in the unlisted data.frame. Further NA positions
are randomly added and then the augmented vector is used as a logical
matrix to index where the NA's should go in the data frame:
df <- data.frame(a = c(1:3,NA,4:6),
b=c(letters[1:6],NA),
c= c(1,NA,runif(5)))
nr <- nrow(df); nc <- ncol(df)
p <- .3 ## desired total proportion of NA's
ina <- is.na(unlist(df)) ## logical vector, TRUE corresponds to NA positions
n2 <- floor(p*nr*nc) - sum(ina) ## number of new NA's
ina[sample(which(!is.na(ina)), n2)] <- TRUE
df[matrix(ina, nr=nr,nc=nc)]<- NA ## using matrix indexing
df
Cheers,
Bert
On Fri, Nov 29, 2013 at 10:09 AM, arun <smartpink111 at yahoo.com> wrote:
Hi, I used that because 10% of the values in the data were already NA. You are right. Sorry, ?match() is unnecessary. I was trying another solution with match() which didn't work out and forgot to check whether it was adequate or not. set.seed(49) dat1[!is.na(dat1)][sample(seq(dat1[!is.na(dat1)]),length(dat1[!is.na(dat1)])*(0.20))] <- NA A.K. Thanks for the reply. I don't get the 0.20 multiplied by the length of the non NA value, where did you take it from? Furthermore, why do we have to use the function match? Wouldn't it be enough to use the saple function? On Thursday, November 28, 2013 12:57 PM, arun <smartpink111 at yahoo.com> wrote: Hi, One way would be: set.seed(42) dat1 <- as.data.frame(matrix(sample(c(1:5,NA),50,replace=TRUE,prob=c(10,15,15,20,30,10)),ncol=5)) set.seed(49) dat1[!is.na(dat1)][ match( sample(seq(dat1[!is.na(dat1)]),length(dat1[!is.na(dat1)])*(0.20)),seq(dat1[!is.na(dat1)]))] <- NA length(dat1[is.na(dat1)])/length(unlist(dat1)) #[1] 0.28 A.K. Hello, I'm quite new at R so I don't know which is the most efficient way to execute a function that I could write easily in other languages. This is my problem: I have a dataframe with a certain numbers of NA (approximately 10%). I want to add other NA values in random positions of the dataframes until reaching an overall proportions of NA values of 30% (clearly the positions with NA values don't have to change). I tried looking at iterative function in R as apply or sapply but I can't actually figure out how to use them in this case. Thank you.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374