Skip to content
Back to formatted view

Raw Message

Message-ID: <50560BD7.5060306@sapo.pt>
Date: 2012-09-16T17:26:47Z
From: Rui Barradas
Subject: multi-column factor
In-Reply-To: <87k3vt3ohh.fsf@gnu.org>

Hello,

The obvious simplification is to call union()  only once. With 10M rows 
it should save time.
Then I've asked myself whether unique() wouldn't be faster.


f1 <- function(x){
     x[[1]] <- factor(x[[1]], levels = union(x[[1]], x[[2]]))
     x[[2]] <- factor(x[[2]], levels = union(x[[1]], x[[2]]))
     x
}

f2 <- function(x){
     levels <- union(x[[1]], x[[2]])
     x[[1]] <- factor(x[[1]], levels = levels)
     x[[2]] <- factor(x[[2]], levels = levels)
     x
}

f3 <- function(x){
     levels <- unique(c(x[[1]], x[[2]]))
     x[[1]] <- factor(x[[1]], levels = levels)
     x[[2]] <- factor(x[[2]], levels = levels)
     x
}

set.seed(5467)
n <- 1e7
z <- data.frame(a = sample(letters[1:3], n, TRUE),
     b = sample(letters[2:4], n, TRUE),
     stringsAsFactors=FALSE)

t1 <- system.time(z1 <- f1(z))
t2 <- system.time(z2 <- f2(z))
t3 <- system.time(z3 <- f3(z))

identical(z1, z2) #[1] TRUE
identical(z1, z3) #[1] TRUE

rbind(t1, t2, t3)
    user.self sys.self elapsed user.child sys.child
t1      2.55     0.47    3.01         NA        NA
t2      1.57     0.29    1.87         NA        NA
t3      1.51     0.26    1.78         NA        NA

Hope this helps,

Rui Barradas

Em 16-09-2012 17:46, Sam Steingold escreveu:
> I have a data frame with columns which draw on the same underlying
> universe, so I want them to be factors with the same level set:
>
> --8<---------------cut here---------------start------------->8---
>> z <- data.frame(a=c("a","b","c"),b=c("b","c","d"),stringsAsFactors=FALSE)
>> str(z)
> 'data.frame':	3 obs. of  2 variables:
>   $ a: chr  "a" "b" "c"
>   $ b: chr  "b" "c" "d"
>> z$a <- factor(z$a,levels=union(z$a,z$b))
>> z$b <- factor(z$b,levels=union(z$a,z$b))
>> str(z)
> 'data.frame':	3 obs. of  2 variables:
>   $ a: Factor w/ 4 levels "a","b","c","d": 1 2 3
>   $ b: Factor w/ 4 levels "a","b","c","d": 2 3 4
> --8<---------------cut here---------------end--------------->8---
> factor(z$a,levels=union(z$a,z$b))
> is factor(z$a,levels=union(z$a,z$b)) the right way to handle this?
> maybe there is a better way to extract levels than union()?
> (bear in mind that I have ~10M rows and ~1M levels, so performance is an
> issue).
>
> Thanks!
>