Back to formatted view
Raw Message

Message-ID: <2BF20995-6E6C-4C38-9EB2-DDE1ABA4E57E@me.com>
Date: 2012-11-20T22:03:55Z
From: Brian Feeny
Subject: Removing columns that are na or constant

I have a dataset that has many columns which are NA or constant, and so I remove them like so:


same <- sapply(dataset, function(.col){ 
  all(is.na(.col))  || all(.col[1L] == .col) 
}) 
dataset <- dataset[!same] 

This works GREAT (thanks to the r-users list archive I found this)

however, then when I do my data sampling like so:

testSize <- floor(nrow(x) * 10/100)
test <- sample(1:nrow(x), testSize)
    
train_data <- x[-test,]
test_data <- x[test, -1]
test_class <- x[test, 1]

It is now possible that test_data or train_data contain columns that are constants, however as one dataset they did not.

So the solution for me is to just re-run lines to remove all constants......not a problem, but is this normal?  is this how I should
be handling this in R?  many models I am attempting to use (SVM, lda, etc) don't like if a column has all the same value.......
so as a beginner, this is how I am handling it in R, but I am looking for someone to sanity check what I am doing is sound.

Brian