Antwort: Re: selecting columns from a data frame or data table by type, ie, numeric, integer
<G.Maubach at weinwolf.de>
on Wed, 4 May 2016 08:30:50 +0200 writes:
Hi All,
Hi Carl,
I am not sure if this is useful to you, but I followed your conversation
and thought of you when I read this:
for (i in 1:ncol(dataset)) {
if(class(dataset) == "character|numeric|factor|or whatsoever") {
dataset[, i] <- as.factor(dataset[, i])
}
}
Ouch -- so many problems in such a short piece of R code !!!
Source: Zumel, Nina / Mount, John: Practical Data Science with R, Manning Publications: Shelter Island, 2014, Chapter 2: Loading data into R, p. 25
Sorry, but after reading the above, I'd strongly recommend getting
better books about R...
{{maybe do not take those containing "data science" ;-)}}
Compared to the nice and efficient solution of Bill Dunlap,
the above is really bad-bad-bad in at least four ways :
0) They way you write it above, you cannot use it,
<string> == "variant1|variant2|..."
is pseudocode and does not really work
1) Note the missing "[, i]" in the 2nd line: It should be
if(class(dataset[, i]) ...
2) A for loop changing each column at a time is really slow for
largish data sets
3) [last but not at all least!]
Please ... many of you readers, do learn:
Using checks such as
if ( class(x) == "numeric" )
are (almost) always wrong by design !!!
Instead you really should (almost) always use
if(inherits(x, "numeric"))
Why? Because classes in R (S3 or S4) can *extend* other classes.
Example: Many of you know that after fm <- glm(...)
class(fm) is c("glm", "lm") and so
> if(class(fm) == "lm")
+ "yes"
Warning message:
In if (class(fm) == "lm") "yes" :
the condition has length > 1 and only the first element will be used
Similarly, in your case
y <- 1:10
class(y) <- c("myNumber", "numeric")
when that 'y' is a column in your data frame,
the test for if(class(dataset[,i]) == "numeric") will *not*
work but actually produce the above warning.
However, one could als have had
Num <- setClass("Num", contains="numeric")
N <- Num(1:10)
> Num <- setClass("Num", contains="numeric")
> N <- Num(1:10)
> N
An object of class "Num"
[1] 1 2 3 4 5 6 7 8 9 10
> if(class(N) == "numeric") "yes" else "no"
[1] "no"
>
I hope that many of the readers --- including *MANY* authors of
R packages !! --- have understood the above and will fix their R
code -- and even more their books where applicable !!
Martin Maechler,
ETH Zurich & R Core Team
This way you can select variables of a certain class only and do
transformations. I found that this approach is not applicable if used with
statistical functions like head(). Transformations worked fine for me.
I found reading the above given source worthwile.
Kind regards
Georg
PS: I am not related to the above given authors. I am just a reader
reporting on - at least to me - a valuable ressource.
Von: Carl Sutton via R-help <r-help at r-project.org>
An: William Dunlap <wdunlap at tibco.com>,
Kopie: "r-help at r-project.org" <r-help at r-project.org>
Datum: 29.04.2016 22:08
Betreff: Re: [R] selecting columns from a data frame or data table
by type, ie, numeric, integer
Gesendet von: "R-help" <r-help-bounces at r-project.org>
Thank you Bill Dunlap. So simple I never tried that approach. Tried
dozens of others though, read manuals till I was getting headaches, and of
course the answer was simple when one is competent. Learning, its a
struggle, but slowly getting there.
Thanks again
Carl Sutton CPA
On Friday, April 29, 2016 10:50 AM, William Dunlap <wdunlap at tibco.com>
wrote:
> dt1[ vapply(dt1, FUN=is.numeric, FUN.VALUE=NA) ] a c1 1 1.12 2
1.0...10 10 0.2
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Fri, Apr 29, 2016 at 9:19 AM, Carl Sutton via R-help
<r-help at r-project.org> wrote:
Good morning RGuru's
I have a data frame of 575 columns. I want to extract only those columns
that are numeric(double) or integer to do some machine learning with. I
have searched the web for a couple of days (off and on) and have not found
anything that shows how to do this. Lots of ways to extract rows, but
not columns. I have attempted to use "(x == y)" indices extraction method
but that threw error that == was for atomic vectors and lists, and I was
doing this on a data frame.
My test code is below
# a technique to get column classes
library(data.table)
a <- 1:10
b <- c("a","b","c","d","e","f","g","h","i","j")
c <- seq(1.1, .2, length = 10)
dt1 <- data.table(a,b,c)
str(dt1)
col.classes <- sapply(dt1, class)
head(col.classes)
dt2 <- subset(dt1, typeof = "double" | "numeric")
str(dt2)
dt2 # not subset
dt2 <- dt1[, list(typeof = "double")]
str(dt2)
class_data <- dt1[,sapply(dt1,is.integer) | sapply(dt1, is.numeric)]
class_data
sum(class_data)
typeof(class_data)
names(class_data)
str(class_data)
Any help is appreciated
Carl Sutton CPA