Skip to content
Prev 305597 / 398506 Next

multi-column factor

Hello,

The obvious simplification is to call union()  only once. With 10M rows 
it should save time.
Then I've asked myself whether unique() wouldn't be faster.


f1 <- function(x){
     x[[1]] <- factor(x[[1]], levels = union(x[[1]], x[[2]]))
     x[[2]] <- factor(x[[2]], levels = union(x[[1]], x[[2]]))
     x
}

f2 <- function(x){
     levels <- union(x[[1]], x[[2]])
     x[[1]] <- factor(x[[1]], levels = levels)
     x[[2]] <- factor(x[[2]], levels = levels)
     x
}

f3 <- function(x){
     levels <- unique(c(x[[1]], x[[2]]))
     x[[1]] <- factor(x[[1]], levels = levels)
     x[[2]] <- factor(x[[2]], levels = levels)
     x
}

set.seed(5467)
n <- 1e7
z <- data.frame(a = sample(letters[1:3], n, TRUE),
     b = sample(letters[2:4], n, TRUE),
     stringsAsFactors=FALSE)

t1 <- system.time(z1 <- f1(z))
t2 <- system.time(z2 <- f2(z))
t3 <- system.time(z3 <- f3(z))

identical(z1, z2) #[1] TRUE
identical(z1, z3) #[1] TRUE

rbind(t1, t2, t3)
    user.self sys.self elapsed user.child sys.child
t1      2.55     0.47    3.01         NA        NA
t2      1.57     0.29    1.87         NA        NA
t3      1.51     0.26    1.78         NA        NA

Hope this helps,

Rui Barradas

Em 16-09-2012 17:46, Sam Steingold escreveu: