Hi Dan,
Hi All,
many thanks for your help.
Please find enclosed my little function for your use:
-- cut --
#-------------------------------------------------------------------------------
# Module : t_count_na.R
# Author : Georg Maubach
# Date : 2016-05-24
# Update : 2016-05-25
# Description : Count NA's
# Source System : R 3.2.2 (64 Bit)
# Target System : R 3.2.2 (64 Bit)
# License : CC-BY-SA-NC
#--------1---------2---------3---------4---------5---------6---------7---------8
test <- FALSE
t_count_na <- function(dataset,
variables = "all") {
# Counts the number of NA within given set of veriables
#
# Args:
# dataset : Object with dimnames, e.g. data frame, data table.
# variables: Character vector with variable names.
#
# Operation:
# Adds the variable "na_count" to the given dataset containing the
count of
# NA's within the given variables
#
# Returns:
# Original dataset with variable "na_count" added.
#
# Error handling:
# None.
#
# Credits:
#
http://stackoverflow.com/questions/4862178/remove-rows-with-nas-in-data-frame
#
http://r.789695.n4.nabble.com/Creating-variables-on-the-fly-td4720034.html
version <- "2016-05-25"
if (identical(variables, "all")) {
variable_list <- names(dataset)
} else {
variable_list <- variables
}
dataset[["na_count"]] <- apply(dataset[,variable_list],
1,
function(x) sum(is.na(x)))
return(dataset)
}
#-------------------------------------------------------------------------------
test <- function(do_test = FALSE) {
cat("\n", "\n", "Test function t_count_na()", "\n", "\n")
# Example dataset
gene <-
c("ENSG00000208234","ENSG00000199674","ENSG00000221622","ENSG00000207604",
"ENSG00000207431","ENSG00000221312","ENSG00134940305","ENSG00394039490",
"ENSG09943004048")
hsap <- c(0,0,0, 0, 0, 0, 1,1, 1)
mmul <- c(NA,2 ,3, NA, 2, 1 , NA,2, NA)
mmus <- c(NA,2 ,NA, NA, NA, 2 , NA,3, 1)
rnor <- c(NA,2 ,NA, 1 , NA, 3 , NA,NA, 2)
cfam <- c(NA,2,NA, 2, 1, 2, 2,NA, NA)
ds_example <- data.frame(gene, hsap, mmul, mmus, rnor, cfam)
ds_example$gene <- as.character(ds_example$gene)
cat("\n", "\n", "Example dataset before function call", "\n", "\n")
print(ds_example)
cat("\n", "\n", "Function call", "\n", "\n")
ds_example <- t_count_na(dataset = ds_example,
variables = c("mmul", "mmus"))
cat("\n", "\n", "Example dataset after function call", "\n", "\n")
print(ds_example)
}
test(do_test = test)
# EOF .
-- cut --
Kind regards
Georg Maubach
Von: "Nordlund, Dan (DSHS/RDA)" <NordlDJ at dshs.wa.gov>
An: "r-help at r-project.org" <r-help at r-project.org>,
Datum: 24.05.2016 21:41
Betreff: Re: [R] Creating a data frame from scratch
Gesendet von: "R-help" <r-help-bounces at r-project.org>
I would probably write the function something like this:
t_count_na <- function(dataset,
variables = "all") {
if (identical(variables, "all")) {
variable_list <- names(dataset)
} else {
variable_list <- variables
}
apply(dataset[,variable_list], 1, function(x) sum(is.na(x)))
}
Hope this is helpful,
Dan
Daniel Nordlund, PhD
Research and Data Analysis Division
Services & Enterprise Support Administration
Washington State Department of Social and Health Services
-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of
G.Maubach at gmx.de
Sent: Tuesday, May 24, 2016 11:55 AM
To: r-help at r-project.org
Subject: [R] Creating a data frame from scratch
Hi All,
I need to create a data frame from scratch and fill variables created on
the fly
with values. What I have so far:
-- schnipp --
# Example dataset
gene <-
c("ENSG00000208234","ENSG00000199674","ENSG00000221622","ENSG00000
207604",
"ENSG00000207431","ENSG00000221312","ENSG00134940305","ENSG0039403
9490",
"ENSG09943004048")
hsap <- c(0,0,0, 0, 0, 0, 1,1, 1)
mmul <- c(NA,2 ,3, NA, 2, 1 , NA,2, NA)
mmus <- c(NA,2 ,NA, NA, NA, 2 , NA,3, 1) rnor <- c(NA,2 ,NA, 1 , NA, 3 ,
NA,NA, 2) cfam <- c(NA,2,NA, 2, 1, 2, 2,NA, NA)
ds_example <- data.frame(gene, hsap, mmul, mmus, rnor, cfam)
ds_example$gene <- as.character(ds_example$gene)
t_count_na <- function(dataset,
variables = "all")
# credit: http://stackoverflow.com/questions/4862178/remove-rows-with-
nas-in-data-frame
{
ds_na <- data.frame()
# if variables = "all" create character vector of variable names
if (variables == "all") {
variable_list <- dimnames(dataset)[[ 2 ]]
}
# if a character vector with variable names is given
# to run the function on a defined set of selected variables
else {
variable_list <- variables
}
for (var in variable_list) {
new_name <- paste0("na_", var)
ds_na[[ new_name ]] <- as.data.frame(is.na(dataset[[ var ]]))
}
ds_na[[ "na_count" ]] <- rowSums(ds_na)
return(ds_na)
}
test <- t_count_na(dataset = ds_example, variables = c("mmul", "mmus"))
-- schnipp --
gives:
Error in `[[<-.data.frame`(`*tmp*`, new_name, value =
list(`is.na(dataset[[var]])` = c(TRUE, :
replacement has 9 rows, data has 0 In addition: Warning message:
In if (variables == "all") { :
the condition has length > 1 and only the first element will be used
My goal is to create a dataset from scratch on the fly which has the
same
amount of variables as the dataset ds_example plus a single variable
storing
the amount of NA's in a row for the given variables. This is the basis
for a
decious which cases to keep and which to drop.
I do not want to alter the base dataset like ds_example in the first
place nor
do I want to make a copy of the existing dataset due to memory
allocation.
The function shall also work with big data, e. g. datasets with more
than 1 GB
memory consumption.
I also do not want the newly created variables to be stored in the
original
data frame. They shall be separate.
A former similar solution worked: