Having trouble converting a dataframe of character vectors to factors

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130221/c2c594ce/attachment.pl>
Pleaser re-read ?sapply and pay particular attention to the "simplify" argument.

The following should help explain the issues:
z <- data.frame(a=letters[1:3],b=letters[4:6],stringsAsFactors=FALSE)
sapply(z,class)
a           b
"character" "character"
z1 <- sapply(z,as.factor)
sapply(z1,class)
a           b           c           d           e           f
"character" "character" "character" "character" "character" "character"
z2 <- sapply(z,factor, simplify = FALSE)
sapply(z2,class)
a        b
"factor" "factor"
z3 <- lapply(z,factor)
sapply(z3,class)
a        b
"factor" "factor"
z3
$a
[1] a b c
Levels: a b c

$b
[1] d e f
Levels: d e f

## Note that both z2 and z3 are lists, and would have to be converted
back to data frames.

-- Bert
R Experts,

I have a dataframe made up of character vectors--these are results from survey questions. I need to convert them to factors.

I tried the following which did not work:
scs2<-sapply(scs2,as.factor)
also this didn't work:
scs2<-sapply(scs2,function(x) as.factor(x))

After doing either of above I end up with
str(scs2)
chr [1:10, 1:10] "very important" "very important" "very important" "very important" ...

 - attr(*, "dimnames")=List of 2

  ..$ : NULL

  ..$ : chr [1:10] "Q1_1" "Q1_2" "Q1_3" "Q1_4" ...

class(scs2)
"matrix"

But when I do it one at a time it works:
scs2$Q1_1<-as.factor(scs2$Q1_1)
scs2$Q1_2<- as.factor(scs2$Q1_2)

What am I doing wrong?  How do I accomplish this with sapply or similar function?

Data for reproducibility:

scs2<-structure(list(Q1_1 = c("very important", "very important", "very important",

"very important", "very important", "very important", "very important",

"somewhat important", "important", "very important"), Q1_2 = c("important",

"somewhat important", "very important", "important", "important",

"very important", "somewhat important", "somewhat important",

"very important", "very important"), Q1_3 = c("very important",

"important", "very important", "very important", "important",

"very important", "very important", "somewhat important", "not important",

"important"), Q1_4 = c("very important", "important", "very important",

"very important", "important", "important", "important", "very important",

"somewhat important", "important"), Q1_5 = c("very important",

"not important", "important", "very important", "not important",

"important", "somewhat important", "important", "somewhat important",

"not important"), Q1_6 = c("very important", "not important",

"important", "very important", "somewhat important", "very important",

"very important", "very important", "important", "important"),

    Q1_7 = c("very important", "somewhat important", "important",

    "somewhat important", "important", "important", "very important",

    "very important", "somewhat important", "not important"),

    Q2 = c("Somewhat", "Very Much", "Somewhat", "Very Much",

    "Very Much", "Very Much", "Very Much", "Very Much", "Very Much",

    "Very Much"), Q3 = c("yes", "yes", "yes", "yes", "yes", "yes",

    "yes", "yes", "yes", "yes"), Q4 = c("None", "None", "None",

    "None", "Confirmed Field of Study", "Confirmed Field of Study",

    "Confirmed Field of Study", "None", "None", "None")), .Names = c("Q1_1",

"Q1_2", "Q1_3", "Q1_4", "Q1_5", "Q1_6", "Q1_7", "Q2", "Q3", "Q4"

), row.names = c(78L, 46L, 80L, 196L, 188L, 197L, 39L, 195L,

172L, 110L), class = "data.frame")

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130220/4d35cfc9/attachment.pl>
Hi Bert,

Thanks for drawing my attention to "simplify" argument and for the examples. I understand know.

Thanks.
Dan

-----Original Message-----
From: Bert Gunter [mailto:gunter.berton at gene.com] 
Sent: Wednesday, February 20, 2013 4:25 PM
To: Lopez, Dan
Cc: R help (r-help at r-project.org)
Subject: Re: [R] Having trouble converting a dataframe of character vectors to factors

Pleaser re-read ?sapply and pay particular attention to the "simplify" argument.

The following should help explain the issues:
z <- data.frame(a=letters[1:3],b=letters[4:6],stringsAsFactors=FALSE)
sapply(z,class)
a           b
"character" "character"
z1 <- sapply(z,as.factor)
sapply(z1,class)
a           b           c           d           e           f
"character" "character" "character" "character" "character" "character"
z2 <- sapply(z,factor, simplify = FALSE)
sapply(z2,class)
a        b
"factor" "factor"
z3 <- lapply(z,factor)
sapply(z3,class)
a        b
"factor" "factor"
z3
$a
[1] a b c
Levels: a b c

$b
[1] d e f
Levels: d e f

## Note that both z2 and z3 are lists, and would have to be converted back to data frames.

-- Bert
R Experts,

I have a dataframe made up of character vectors--these are results from survey questions. I need to convert them to factors.

I tried the following which did not work:
scs2<-sapply(scs2,as.factor)
also this didn't work:
scs2<-sapply(scs2,function(x) as.factor(x))

After doing either of above I end up with
str(scs2)
chr [1:10, 1:10] "very important" "very important" "very important" "very important" ...

 - attr(*, "dimnames")=List of 2

  ..$ : NULL

  ..$ : chr [1:10] "Q1_1" "Q1_2" "Q1_3" "Q1_4" ...

class(scs2)
"matrix"

But when I do it one at a time it works:
scs2$Q1_1<-as.factor(scs2$Q1_1)
scs2$Q1_2<- as.factor(scs2$Q1_2)

What am I doing wrong?  How do I accomplish this with sapply or similar function?

Data for reproducibility:

scs2<-structure(list(Q1_1 = c("very important", "very important", 
"very important",

"very important", "very important", "very important", "very 
important",

"somewhat important", "important", "very important"), Q1_2 = 
c("important",

"somewhat important", "very important", "important", "important",

"very important", "somewhat important", "somewhat important",

"very important", "very important"), Q1_3 = c("very important",

"important", "very important", "very important", "important",

"very important", "very important", "somewhat important", "not 
important",

"important"), Q1_4 = c("very important", "important", "very 
important",

"very important", "important", "important", "important", "very 
important",

"somewhat important", "important"), Q1_5 = c("very important",

"not important", "important", "very important", "not important",

"important", "somewhat important", "important", "somewhat important",

"not important"), Q1_6 = c("very important", "not important",

"important", "very important", "somewhat important", "very important",

"very important", "very important", "important", "important"),

    Q1_7 = c("very important", "somewhat important", "important",

    "somewhat important", "important", "important", "very important",

    "very important", "somewhat important", "not important"),

    Q2 = c("Somewhat", "Very Much", "Somewhat", "Very Much",

    "Very Much", "Very Much", "Very Much", "Very Much", "Very Much",

    "Very Much"), Q3 = c("yes", "yes", "yes", "yes", "yes", "yes",

    "yes", "yes", "yes", "yes"), Q4 = c("None", "None", "None",

    "None", "Confirmed Field of Study", "Confirmed Field of Study",

    "Confirmed Field of Study", "None", "None", "None")), .Names = 
c("Q1_1",

"Q1_2", "Q1_3", "Q1_4", "Q1_5", "Q1_6", "Q1_7", "Q2", "Q3", "Q4"

), row.names = c(78L, 46L, 80L, 196L, 188L, 197L, 39L, 195L,

172L, 110L), class = "data.frame")

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
scs2<-data.frame(lapply(scs2, factor))
Calling data.frame() on the output of lapply() can result in changing column names
and will drop attributes that the input data.frame may have had.  I prefer to modify
the original data.frame instead of making a new one from scratch to avoid these problems.

Also, calling factor() on a factor will drop any unused levels, which you may not want
to do.  Calling as.factor will not.

Compare the following three methods

  f1 <- function (dataFrame) {
      dataFrame[] <- lapply(dataFrame, factor)
      dataFrame
  }
  f2 <- function (dataFrame) {
      dataFrame[] <- lapply(dataFrame, as.factor)
      dataFrame
  }
  f3 <- function (dataFrame) {
      data.frame(lapply(dataFrame, factor))
  }

on the following data.frame
  x <- data.frame(stringsAsFactors=FALSE, check.names=FALSE,
               "No/Yes" = factor(c("Yes","Yes","Yes"), levels=c("No","Yes")),
               "Size" = ordered(c("Small","Large","Medium"), levels=c("Small","Medium","Large")),
               "Name" = c("Adam","Bill","Chuck"))
  attr(x, "Date") <- as.POSIXlt("2013-02-21")

  > str(x)
  'data.frame':   3 obs. of  3 variables:
   $ No/Yes: Factor w/ 2 levels "No","Yes": 2 2 2
   $ Size  : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
   $ Name  : chr  "Adam" "Bill" "Chuck"
   - attr(*, "Date")= POSIXlt, format: "2013-02-21"

  > str(f1(x)) # drops unused levels
  'data.frame':   3 obs. of  3 variables:
   $ No/Yes: Factor w/ 1 level "Yes": 1 1 1
   $ Size  : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
   $ Name  : Factor w/ 3 levels "Adam","Bill",..: 1 2 3
   - attr(*, "Date")= POSIXlt, format: "2013-02-21"
  > str(f2(x))
  'data.frame':   3 obs. of  3 variables:
   $ No/Yes: Factor w/ 2 levels "No","Yes": 2 2 2
   $ Size  : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
   $ Name  : Factor w/ 3 levels "Adam","Bill",..: 1 2 3
   - attr(*, "Date")= POSIXlt, format: "2013-02-21"
  > str(f3(x)) # mangles column names, drops unused levels, drops Date attribute
  'data.frame':   3 obs. of  3 variables:
   $ No.Yes: Factor w/ 1 level "Yes": 1 1 1
   $ Size  : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
   $ Name  : Factor w/ 3 levels "Adam","Bill",..: 1 2 3

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of Mark Lamias
Sent: Wednesday, February 20, 2013 6:51 PM
To: Daniel Lopez; R help (r-help at r-project.org)
Subject: Re: [R] Having trouble converting a dataframe of character vectors to factors

How about this?

scs2<-data.frame(lapply(scs2, factor))

________________________________
 From: "Lopez, Dan" <lopez235 at llnl.gov>
To: "R help (r-help at r-project.org)" <r-help at r-project.org>
Sent: Wednesday, February 20, 2013 7:09 PM
Subject: [R] Having trouble converting a dataframe of character vectors to factors

R Experts,

I have a dataframe made up of character vectors--these are results from survey
questions. I need to convert them to factors.

I tried the following which did not work:
scs2<-sapply(scs2,as.factor)
also this didn't work:
scs2<-sapply(scs2,function(x) as.factor(x))

After doing either of above I end up with
str(scs2)

chr [1:10, 1:10] "very important" "very important" "very important" "very important" ...

- attr(*, "dimnames")=List of 2

? ..$ : NULL

? ..$ : chr [1:10] "Q1_1" "Q1_2" "Q1_3" "Q1_4" ...

class(scs2)
"matrix"

But when I do it one at a time it works:
scs2$Q1_1<-as.factor(scs2$Q1_1)
scs2$Q1_2<- as.factor(scs2$Q1_2)

What am I doing wrong?? How do I accomplish this with sapply or similar function?

Data for reproducibility:

scs2<-structure(list(Q1_1 = c("very important", "very important", "very important",

"very important", "very important", "very important", "very important",

"somewhat important", "important", "very important"), Q1_2 = c("important",

"somewhat important", "very important", "important", "important",

"very important", "somewhat important", "somewhat important",

"very important", "very important"), Q1_3 = c("very important",

"important", "very important", "very important", "important",

"very important", "very important", "somewhat important", "not important",

"important"), Q1_4 = c("very important", "important", "very important",

"very important", "important", "important", "important", "very important",

"somewhat important", "important"), Q1_5 = c("very important",

"not important", "important", "very important", "not important",

"important", "somewhat important", "important", "somewhat important",

"not important"), Q1_6 = c("very important", "not important",

"important", "very important", "somewhat important", "very important",

"very important", "very important", "important", "important"),

? ? Q1_7 = c("very important", "somewhat important", "important",

? ? "somewhat important", "important", "important", "very important",

? ? "very important", "somewhat important", "not important"),

? ? Q2 = c("Somewhat", "Very Much", "Somewhat", "Very Much",

? ? "Very Much", "Very Much", "Very Much", "Very Much", "Very Much",

? ? "Very Much"), Q3 = c("yes", "yes", "yes", "yes", "yes", "yes",

? ? "yes", "yes", "yes", "yes"), Q4 = c("None", "None", "None",

? ? "None", "Confirmed Field of Study", "Confirmed Field of Study",

? ? "Confirmed Field of Study", "None", "None", "None")), .Names = c("Q1_1",

"Q1_2", "Q1_3", "Q1_4", "Q1_5", "Q1_6", "Q1_7", "Q2", "Q3", "Q4"

), row.names = c(78L, 46L, 80L, 196L, 188L, 197L, 39L, 195L,

172L, 110L), class = "data.frame")

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
	[[alternative HTML version deleted]]
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130221/968daa28/attachment.pl>