Opinion: Why I find factors convenient to use
Hello, Em 17-08-2012 20:27, Bert Gunter escreveu:
... so it may be just the way object.size() counts in the two cases, right?
Or maybe the way character vectors and factors are coded. (64 bit Windows 7 or ubuntu 12.04) 80k for the character vector seems to be 8 * 1e4 for pointers plus room for the strings themselves, and 40k for the factor seems more like 32 bit ints * 1e4 in consecutive memory locations. I confess to being too lazy to go check the sources, but if this is the case then it's an other point to factors, they are indeed more efficient memory-wise. And 64 bit OSs are to become more and more used, processors aren't becoming worse. There is also the statistical side of it. Factors are the natural way of coding nominal or categorical variables. The small/medium/large example is a good one. Or seasons, we like to see Fall or Autumn after Spring and Summer, not before. (btw, does anyone know why M/F?) And this has nothing to do with the usefullness of charaters, I like persons' names to be names, alphabetic. I've also made a simple check, apparently, character vectors are kept as a vector of pointers and a vector of unique strings. If we change one of the strings, even for something smaller, occupying less bytes, object.size will report an increase in size. Try x[1] <- "a" and see the new size of x. It's bigger and the number of pointers to strings is the same. For 32 and 64 bit Windows 7 and for 64 bit ubuntu 12.04, R was: > R.version [...] version.string R version 2.15.1 (2012-06-22) nickname Roasted Marshmallows Rui Barradas
-- Bert On Fri, Aug 17, 2012 at 11:42 AM, Peter Langfelder <peter.langfelder at gmail.com> wrote:
On Fri, Aug 17, 2012 at 11:34 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello, No, factors may use less memory. System dependent?
I think it's a 32-bit vs. 64-bit distinction - I get Rui's results on 64-bit Windows and Linux installation, but Bert's result on a 32-bit Linux machine. Peter
x <-sample(c("small","medium","large"),1e4,rep=TRUE)
y <- factor(x)
object.size(x)
80184 bytes
object.size(y)
40576 bytes