Skip to content

Having trouble converting a dataframe of character vectors to factors

6 messages · Bert Gunter, Mark Lamias, Lopez, Dan +1 more

#
Pleaser re-read ?sapply and pay particular attention to the "simplify" argument.

The following should help explain the issues:
a           b
"character" "character"
a           b           c           d           e           f
"character" "character" "character" "character" "character" "character"
a        b
"factor" "factor"
a        b
"factor" "factor"
$a
[1] a b c
Levels: a b c

$b
[1] d e f
Levels: d e f

## Note that both z2 and z3 are lists, and would have to be converted
back to data frames.

-- Bert
On Wed, Feb 20, 2013 at 4:09 PM, Lopez, Dan <lopez235 at llnl.gov> wrote:

  
    
#
Hi Bert,

Thanks for drawing my attention to "simplify" argument and for the examples. I understand know.

Thanks.
Dan


-----Original Message-----
From: Bert Gunter [mailto:gunter.berton at gene.com] 
Sent: Wednesday, February 20, 2013 4:25 PM
To: Lopez, Dan
Cc: R help (r-help at r-project.org)
Subject: Re: [R] Having trouble converting a dataframe of character vectors to factors

Pleaser re-read ?sapply and pay particular attention to the "simplify" argument.

The following should help explain the issues:
a           b
"character" "character"
a           b           c           d           e           f
"character" "character" "character" "character" "character" "character"
a        b
"factor" "factor"
a        b
"factor" "factor"
$a
[1] a b c
Levels: a b c

$b
[1] d e f
Levels: d e f

## Note that both z2 and z3 are lists, and would have to be converted back to data frames.

-- Bert
On Wed, Feb 20, 2013 at 4:09 PM, Lopez, Dan <lopez235 at llnl.gov> wrote:

  
    
#
Calling data.frame() on the output of lapply() can result in changing column names
and will drop attributes that the input data.frame may have had.  I prefer to modify
the original data.frame instead of making a new one from scratch to avoid these problems.

Also, calling factor() on a factor will drop any unused levels, which you may not want
to do.  Calling as.factor will not.

Compare the following three methods

  f1 <- function (dataFrame) {
      dataFrame[] <- lapply(dataFrame, factor)
      dataFrame
  }
  f2 <- function (dataFrame) {
      dataFrame[] <- lapply(dataFrame, as.factor)
      dataFrame
  }
  f3 <- function (dataFrame) {
      data.frame(lapply(dataFrame, factor))
  }

on the following data.frame
  x <- data.frame(stringsAsFactors=FALSE, check.names=FALSE,
               "No/Yes" = factor(c("Yes","Yes","Yes"), levels=c("No","Yes")),
               "Size" = ordered(c("Small","Large","Medium"), levels=c("Small","Medium","Large")),
               "Name" = c("Adam","Bill","Chuck"))
  attr(x, "Date") <- as.POSIXlt("2013-02-21")


  > str(x)
  'data.frame':   3 obs. of  3 variables:
   $ No/Yes: Factor w/ 2 levels "No","Yes": 2 2 2
   $ Size  : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
   $ Name  : chr  "Adam" "Bill" "Chuck"
   - attr(*, "Date")= POSIXlt, format: "2013-02-21"

  > str(f1(x)) # drops unused levels
  'data.frame':   3 obs. of  3 variables:
   $ No/Yes: Factor w/ 1 level "Yes": 1 1 1
   $ Size  : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
   $ Name  : Factor w/ 3 levels "Adam","Bill",..: 1 2 3
   - attr(*, "Date")= POSIXlt, format: "2013-02-21"
  > str(f2(x))
  'data.frame':   3 obs. of  3 variables:
   $ No/Yes: Factor w/ 2 levels "No","Yes": 2 2 2
   $ Size  : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
   $ Name  : Factor w/ 3 levels "Adam","Bill",..: 1 2 3
   - attr(*, "Date")= POSIXlt, format: "2013-02-21"
  > str(f3(x)) # mangles column names, drops unused levels, drops Date attribute
  'data.frame':   3 obs. of  3 variables:
   $ No.Yes: Factor w/ 1 level "Yes": 1 1 1
   $ Size  : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 3 2
   $ Name  : Factor w/ 3 levels "Adam","Bill",..: 1 2 3

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com