Dear R experts, I have a large database made up of mixed data types (numeric, character, factor, ordinal factor) with missing values, and I am looking for a package that would help me impute the missing values using ?either the mean if numerical or the mode if character/factor. I maybe could use replace like this: df$var[is.na(df$var)] <- mean(df$var, na.rm = TRUE) And go through all the many different variables of the datasets using mean or mode for each, but I was wondering if there was a faster way, or if a package existed to automate this (by doing 'mode' if it is a factor or character or 'mean' if it is numeric)? I have tried the package "dprep" because I wanted to use the function "ce.mimp", btu unfortunately it is not available anymore. Thank you for your help, -francy
Mean or mode imputation fro missing values
4 messages · Weidong Gu, francesca casalino, PIKAL Petr
In your case, it may not be sensible to simply fill missing values by mean or mode as multiple imputation becomes the norm this day. For your specific question, na.roughfix in randomForest package would do the work. Weidong Gu On Tue, Oct 11, 2011 at 8:11 AM, francesca casalino
<francy.casalino at gmail.com> wrote:
Dear R experts, I have a large database made up of mixed data types (numeric, character, factor, ordinal factor) with missing values, and I am looking for a package that would help me impute the missing values using ?either the mean if numerical or the mode if character/factor. I maybe could use replace like this: df$var[is.na(df$var)] <- mean(df$var, na.rm = TRUE) And go through all the many different variables of the datasets using mean or mode for each, but I was wondering if there was a faster way, or if a package existed to automate this (by doing 'mode' if it is a factor or character or 'mean' if it is numeric)? I have tried the package "dprep" because I wanted to use the function "ce.mimp", btu unfortunately it is not available anymore. Thank you for your help, -francy
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Yes thank you Gu?
I am just trying to do this as a rough step and will try other
imputation methods which are more appropriate later.
I am just learning R, and was trying to do the for loop and
f-statement by hand but something is going wrong?
This is what I have until now:
*****fake array:
age<- c(5,8,10,12,NA)
a<- factor(c("aa", "bb", NA, "cc", "cc"))
b<- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b<- as.character(df_test$b)
for (var in 1:ncol(df_test)) {
if (class(df_test$var)=="numeric") {
df_test$var[is.na(df_test$var)] <- mean(df_test$var, na.rm = TRUE)
} else if (class(df_test$var)=="character") {
Mode(df_test$var[is.na(df_test$var)], na.rm = TRUE)
}
}
Where 'Mode' is the function:
function (x, na.rm)
{
xtab <- table(x)
xmode <- names(which(xtab == max(xtab)))
if (length(xmode) > 1)
xmode <- ">1 mode"
return(xmode)
}
It seems as it is just ignoring the statements though, without giving
any error?Does anybody have any idea what is going on?
Thank you very much for all the great help!
-f
2011/10/11 Weidong Gu <anopheles123 at gmail.com>:
In your case, it may not be sensible to simply fill missing values by mean or mode as multiple imputation becomes the norm this day. For your specific question, na.roughfix in randomForest package would do the work. Weidong Gu On Tue, Oct 11, 2011 at 8:11 AM, francesca casalino <francy.casalino at gmail.com> wrote:
Dear R experts, I have a large database made up of mixed data types (numeric, character, factor, ordinal factor) with missing values, and I am looking for a package that would help me impute the missing values using ?either the mean if numerical or the mode if character/factor. I maybe could use replace like this: df$var[is.na(df$var)] <- mean(df$var, na.rm = TRUE) And go through all the many different variables of the datasets using mean or mode for each, but I was wondering if there was a faster way, or if a package existed to automate this (by doing 'mode' if it is a factor or character or 'mean' if it is numeric)? I have tried the package "dprep" because I wanted to use the function "ce.mimp", btu unfortunately it is not available anymore. Thank you for your help, -francy
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi
Yes thank you Gu?
I am just trying to do this as a rough step and will try other
imputation methods which are more appropriate later.
I am just learning R, and was trying to do the for loop and
f-statement by hand but something is going wrong?
This is what I have until now:
*****fake array:
age<- c(5,8,10,12,NA)
a<- factor(c("aa", "bb", NA, "cc", "cc"))
b<- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b<- as.character(df_test$b)
for (var in 1:ncol(df_test)) {
if (class(df_test$var)=="numeric") {
var goes from 1 to 3, above you actually use df_test$1 which is not what
you intend.
you shall use [] selection operator. However your Mode function does not
correctly assign values
for (var in 1:ncol(df_test)) {
if (class(df_test[,var])=="numeric") {
df_test[is.na(df_test[,var]), var] <-
mean(df_test[,var], na.rm = TRUE)
} else if
(class(df_test[,var])=="character") {
Mode(df_test[is.na(df_test[,var]),var],
na.rm = TRUE)
}
}
Warning message:
In max(xtab) : no non-missing arguments to max; returning -Inf
You shall use debug(Mode] to see what is going on. I have no time to
inspect it and do not see any obvious flaw.
Regards
Petr
df_test$var[is.na(df_test$var)] <- mean(df_test$var, na.rm = TRUE)
} else if (class(df_test$var)=="character") {
Mode(df_test$var[is.na(df_test$var)], na.rm = TRUE)
}
}
Where 'Mode' is the function:
function (x, na.rm)
{
xtab <- table(x)
xmode <- names(which(xtab == max(xtab)))
if (length(xmode) > 1)
xmode <- ">1 mode"
return(xmode)
}
It seems as it is just ignoring the statements though, without giving
any error?Does anybody have any idea what is going on?
Thank you very much for all the great help!
-f
2011/10/11 Weidong Gu <anopheles123 at gmail.com>:
In your case, it may not be sensible to simply fill missing values by mean or mode as multiple imputation becomes the norm this day. For your specific question, na.roughfix in randomForest package would do the work. Weidong Gu On Tue, Oct 11, 2011 at 8:11 AM, francesca casalino <francy.casalino at gmail.com> wrote:
Dear R experts, I have a large database made up of mixed data types (numeric, character, factor, ordinal factor) with missing values, and I am looking for a package that would help me impute the missing values using either the mean if numerical or the mode if character/factor. I maybe could use replace like this: df$var[is.na(df$var)] <- mean(df$var, na.rm = TRUE) And go through all the many different variables of the datasets using mean or mode for each, but I was wondering if there was a faster way, or if a package existed to automate this (by doing 'mode' if it is a factor or character or 'mean' if it is numeric)? I have tried the package "dprep" because I wanted to use the function "ce.mimp", btu unfortunately it is not available anymore. Thank you for your help, -francy
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.