Identifying and Removing NA Columns and factor Columns with more than x Levels

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120830/6d6ede80/attachment.pl>
If d is your data frame

i1 <- sapply(d,function(x)is.factor(x)&&length(levels(x))>31)
## a vector of length ncol(d) that is TRUE only for factor columns
with >31 levels

i2 >- sapply(d,function(x)any(is.na(x)))
## You can figure it out.

-- Bert
Hi,

How do you subset a dataframe so that you only have columns:

1.       that contain one or more NAs?

2.       that contain factors with greater than or equal to 32 levels?

How do you remove from a dataframe columns**

3.       with one or more NA's?

4.       that contain factors with greater than or equal to 32 levels?

** I know how to remove columns at a basic level but I am trying to figure out a more efficient way of performing these particular tasks (my data set has 60 columns).
For NA's I essentially used summary(mtcars) and manually made a note of where NA's appeared than used:
mtcars1<-mtcars1[,!(names(mtcars1)%in% c("hp","wt","vs"))]
I did something similar for factors with greater than x levels only I used str(mtcars) to help me identify them.
BTW I know mtcars doesn't have any of these issues. I just used it as a quick reference.

Dan

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
Hi,
For the first part in the two questions, do this:
dat1<-data.frame(Temp=c(5,10,9,15,NA,14,25,21,24,23,21,24,35,35,36,34,32,33),Temp2=c(5,10,9,15,15,14,25,21,24,23,21,24,35,35,36,34,32,33),Month=rep(c("January","February","March","April","May","June"),each=3),Roof=as.factor(rep(1:6,times=3))) 

?dat1[,colMeans(is.na(dat1))!=0]
dat1[,colMeans(is.na(dat1))==0]
#or
?dat1[,complete.cases(t(dat1))]

#Second part of two questions: In your case, it is 32.
?dat1[unlist(lapply(dat1,function(x) length(levels(x))>=4))]
or,
dat1[sapply(dat1,function(x) length(levels(x))>=4)]

#and
?dat1[sapply(dat1,function(x) length(levels(x))<4)]

I guess you wanted this as separate solutions.? 
A.K.

----- Original Message -----
From: "Lopez, Dan" <lopez235 at llnl.gov>
To: "R help (r-help at r-project.org)" <r-help at r-project.org>
Cc: 
Sent: Thursday, August 30, 2012 11:38 AM
Subject: [R] Identifying and Removing NA Columns and factor Columns with more than x Levels

Hi,

How do you subset a dataframe so that you only have columns:

1.? ? ?  that contain one or more NAs?

2.? ? ?  that contain factors with greater than or equal to 32 levels?

How do you remove from a dataframe columns**

3.? ? ?  with one or more NA's?

4.? ? ?  that contain factors with greater than or equal to 32 levels?

** I know how to remove columns at a basic level but I am trying to figure out a more efficient way of performing these particular tasks (my data set has 60 columns).
For NA's I essentially used summary(mtcars) and manually made a note of where NA's appeared than used:
mtcars1<-mtcars1[,!(names(mtcars1)%in% c("hp","wt","vs"))]
I did something similar for factors with greater than x levels only I used str(mtcars) to help me identify them.
BTW I know mtcars doesn't have any of these issues. I just used it as a quick reference.

Dan

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi Bert,

Thanks! These worked perfectly.

mydata<-mydata[,!(i1)]

Dan

-----Original Message-----
From: Bert Gunter [mailto:gunter.berton at gene.com] 
Sent: Thursday, August 30, 2012 8:54 AM
To: Lopez, Dan
Cc: R help (r-help at r-project.org)
Subject: Re: [R] Identifying and Removing NA Columns and factor Columns with more than x Levels

If d is your data frame

i1 <- sapply(d,function(x)is.factor(x)&&length(levels(x))>31)
## a vector of length ncol(d) that is TRUE only for factor columns with >31 levels

i2 >- sapply(d,function(x)any(is.na(x)))
## You can figure it out.

-- Bert
Hi,

How do you subset a dataframe so that you only have columns:

1.       that contain one or more NAs?

2.       that contain factors with greater than or equal to 32 levels?

How do you remove from a dataframe columns**

3.       with one or more NA's?

4.       that contain factors with greater than or equal to 32 levels?

** I know how to remove columns at a basic level but I am trying to figure out a more efficient way of performing these particular tasks (my data set has 60 columns).
For NA's I essentially used summary(mtcars) and manually made a note of where NA's appeared than used:
mtcars1<-mtcars1[,!(names(mtcars1)%in% c("hp","wt","vs"))] I did 
something similar for factors with greater than x levels only I used str(mtcars) to help me identify them.
BTW I know mtcars doesn't have any of these issues. I just used it as a quick reference.

Dan

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
Hello everyone,

a hopefully easy to solve problem from an R novice...

I try to calculate a number of correlation matrices that finally should be combined in a three-dimensional array.
Here the my code with an R dataset as an example.

-----------------------------------

## Creation an array of correlation matrices from a rolling window application

TS <- EuStockMarkets
# Load internal dataset

n <- 30
# Choose size of rolling time window

T <- c(1:nrow(TS))
# Define number of steps

X <- array(data = NA, dim = c(ncol(TS), ncol(TS), nrow(TS)))
# Create data array

for (t in T[1:(length(T)-n)]){
 X[t] = cor(TS[t:(t+n), 1:ncol(TS)], use = "pairwise.complete.obs")
}
# Calculate correlation matrices

---------------------------------

Unfortunately, I only get a warning that the dimensions do not fit... Where is the mistake?

THANKS A LOT!

Nico
Hello,

You create a 3d array X and then index it as if it were 1d.
Correction:

TS <- EuStockMarkets

[...etc...]

for (t in T[1:(length(T)-n)]){
  X[ , , t] <- cor(TS[t:(t+n), 1:ncol(TS)], use = "pairwise.complete.obs")
}
# Calculate correlation matrices

Also, 't' and 'T' are not good names, the first is R's matrix transpose 
function and the second one is another name for TRUE.

Hope this helps,

Rui Barradas

Em 31-08-2012 17:24, Max Frisch escreveu:
Hello everyone,

a hopefully easy to solve problem from an R novice...

I try to calculate a number of correlation matrices that finally should be combined in a three-dimensional array.
Here the my code with an R dataset as an example.

-----------------------------------

## Creation an array of correlation matrices from a rolling window application

TS <- EuStockMarkets
# Load internal dataset

n <- 30
# Choose size of rolling time window

T <- c(1:nrow(TS))
# Define number of steps

X <- array(data = NA, dim = c(ncol(TS), ncol(TS), nrow(TS)))
# Create data array

for (t in T[1:(length(T)-n)]){
  X[t] = cor(TS[t:(t+n), 1:ncol(TS)], use = "pairwise.complete.obs")
}
# Calculate correlation matrices

---------------------------------

Unfortunately, I only get a warning that the dimensions do not fit... Where is the mistake?

THANKS A LOT!

Nico

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.