Regression stars
On Feb 12, 2013, at 11:05 AM, Brian Lee Yung Rowe wrote:
I thought that the default was the way it was for performance reasons. For large data.frames or repeated applications, using factors should be faster for non-trivial strings.
fs <- c('apple','peach','watermelon','spinach','persimmon','potato','kale')
n <- 1000000
a1 <- data.frame(f=sample(fs,n,replace=TRUE), x1=rnorm(n), x2=rnorm(n), stringsAsFactors=TRUE)
a2 <- data.frame(f=sample(fs,n,replace=TRUE), x1=rnorm(n), x2=rnorm(n), stringsAsFactors=FALSE)
fn <- function(i,x) x[x$f %in% c('kale','spinach'),]
system.time(z <- sapply(1:100, fn, a1))
user system elapsed 19.614 4.037 24.649
system.time(z <- sapply(1:100, fn, a2))
user system elapsed 19.726 7.715 36.761
Not really:
system.time(z <- sapply(1:100, fn, a1))
user system elapsed 13.780 0.444 14.229
rm(z) gc()
used (Mb) gc trigger (Mb) max used (Mb) Ncells 182113 9.8 407500 21.8 337655 18.1 Vcells 5789638 44.2 133982285 1022.3 163019778 1243.8
system.time(z <- sapply(1:100, fn, a2))
user system elapsed 13.201 0.668 13.873 But your test is bogus, because %in% uses match() which converts factors to character vectors anyway, so in your case you're just measuring noise in your system, character vectors are always faster in your example. The reason is that in R strings are hashed so character vectors are technically very similar to factors just with faster access (because they don't need to go through the integer indirection). On 32-bit strings are in theory always faster than factors, on 64-bit they use double the size so they may or may not be faster depending on how you hit the cache etc. Anyway, in modern R versions you're much better off using character vectors than factors for any processing, so stringsAsFactors=FALSE is what I use exclusively. Cheers, Simon
On Feb 12, 2013, at 10:40 AM, Ben Bolker <bbolker at gmail.com> wrote:
Thanks, Uwe. Now let me go one step farther. Can you (or anyone) give a good argument **other than backward compatibility** for keeping the stringAsFactors=TRUE argument on data.frame()? I appreciate your distinction between data.frame() and read.table()'s use of stringAsFactors, and I can see that there is some point for quick-and-dirty interactive use in setting all non-numeric variables to factors (arguing that wanting non-numerics as factors is somewhat more common than wanting them as strings). It might be nice to add an optional stringsAsFactors (and check.names) argument to transform(): I've had to write my own Transform() function to allow the defaults to be overridden, since transform() calls data.frame() with the defaults. (Setting the stringsAsFactors option globally would work, although not for check.names.) Ben BOlker
What I will likely do is make a few changes so that character vectors are automatically changed to factors in modelling functions, so that operating with stringsAsFactors=FALSE doesn't trigger silly warnings. Duncan Murdoch
[apologies for snipping context: "gmane made me do it"]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel