Efficiency of factor objects

4 messages · Stavros Macrakis, Milan Bouchet-Valat, Jeff Ryan +1 more

Original

1

4

Stavros Macrakis

Fri, Nov 4, 2011 4:19 PM #

R factors are the natural way to represent factors -- and should be
efficient since they use small integers.  But in fact, for many (but
not all) operations, R factors are considerably slower than integers,
or even character strings.  This appears to be because whenever a
factor vector is subsetted, the entire levels vector is copied.  For
example:

user  system elapsed
   0.03    0.00    0.04

user  system elapsed
   0.04    0.00    0.04

user  system elapsed
   0.67    0.00    0.68

Putting the levels vector in an environment speeds up subsetting:

myfactor <- function(...) {
     f <- factor(...)
     g <- unclass(f)
     class(g) <- "myfactor"
     attr(g,"mylevels") <- as.environment(list(levels=attr(f,"mylevels")))
     g }
`[.myfactor` <-
function (x, ...)
{
    y <- NextMethod("[")
    attributes(y) <- attributes(x)
    y
}

user  system elapsed
   0.05    0.00    0.04

Given R's value semantics, I believe this approach can be extended to
most of class factor's functionality without problems, copying the
environment if necessary.  Some quick tests seem to show that this is
no slower than ordinary factors even for very small numbers of levels.
 To do this, appropriate methods for this class (print, [<-, levels<-,
etc.) would have to be written. Perhaps some core R functions also
have to be changed?

Am I missing some obvious flaw in this approach?  Has anyone already
implemented a factors package using this or some similar approach?

Thanks,

             -s

Milan Bouchet-Valat

Sat, Nov 5, 2011 9:30 AM #

Le vendredi 04 novembre 2011 ? 19:19 -0400, Stavros Macrakis a ?crit :

Is it so common for a factor to have so many levels? One can probably
argue that, in that case, using a numeric or character vector is
preferred - factors are no longer the "natural way" of representing this
kind of data.

Adding code to fix a completely theoretical problem is generally not a
good idea. I think you'd have to come up with a real use case to hope
convincing the developers a change is needed. There are probably many
more interesting areas where speedups can be gained than that.


Regards

Jeff Ryan

Sat, Nov 5, 2011 9:45 AM #

Or better still, extend R via the mechanisms in place.  Something akin
to a fast factor package.  Any change to R causes downstream issues in
(hundreds of?) millions of lines of deployed code.

It almost seems hard to fathom that a package for this doesn't already
exist. Have you searched CRAN?

Jeff

On Sat, Nov 5, 2011 at 11:30 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Jeffrey Ryan
jeffrey.ryan at lemnica.com

www.lemnica.com
www.esotericR.com

Sat, Nov 5, 2011 11:12 AM #

Perhaps 'data.table' would be a package
on CRAN that would be acceptable.

On 05/11/2011 16:45, Jeffrey Ryan wrote:

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Patrick Burns
pburns at pburns.seanet.com
twitter: @portfolioprobe
http://www.portfolioprobe.com/blog
http://www.burns-stat.com
(home of 'Some hints for the R beginner'
and 'The R Inferno')