Skip to content

Memory consumption, integer versus factor

2 messages · Ajay Shah, Duncan Murdoch

#
R is so smart! I found that when you switch a column from integer to
factor, the memory consumption goes down rather impressively.

Now I'd like to learn more. How does R do this? What does R do? How do
I learn more?

I got to thinking: If I was really smart, I'd see that a factor with 2
levels requires only 1 bit of storage. So I'd be able to cram 8 such
factors into a byte. But this would come at the price of complexity of
code since reading and writing that object would require sub-byte
operations. Does R go this far? I think not, given the more modest
gains that I see. Does he go down till a byte? A four-byte word
instead of 8-bytes of storage?

What are Ncells and Vcells, and what determines his consumption of
memory for each kind?

If you're curious about this, here's a program that serves as a demo:

   x <- matrix(as.numeric(runif(1e6)>.5), nrow=100000)
   D <- data.frame(x)
   rm(x)

   # Take stock:
   gc()
   sum(gc()[,2])
   object.size(D)

   # Switch to factors --
   D$X1 <- factor(D$X1);   D$X2 <- factor(D$X2);   D$X3 <- factor(D$X3)
   D$X4 <- factor(D$X4);   D$X5 <- factor(D$X5);   D$X6 <- factor(D$X6)
   D$X7 <- factor(D$X7);   D$X8 <- factor(D$X8);   D$X9 <- factor(D$X9)
   D$X10 <- factor(D$X10)

   # Take stock:
   gc()
   sum(gc()[,2])
   object.size(D)


Using this, I find that the cost of these 10 vectors goes down from 12
Meg to 8 Meg. This suggests savings, but not the dramatic impact of
recognising that a factor with 2 levels only requires 1 bit.
#
Ajay Narottam Shah wrote:
Most numeric variables are stored as 8 byte doubles.  Factors are stored 
as 4 byte integers, plus a table giving the factor levels.
You will sometimes find what you want in the R Language Definition, for 
example here:

"Factors are currently implemented using an integer array to specify the 
actual levels and
a second array of names that are mapped to the integers. Rather 
unfortunately users often
make use of the implementation in order to make some calculations 
easier. This, however, is an
implementation issue and is not guaranteed to hold in all 
implementations of R."

For more details, there are some implementation documents on 
developer.r-project.org, but in general the only sure way to find out 
how something is implemented is to look at the source code.

Usually it's a bad idea to rely on the implementation details, as the 
last sentence quoted above says.  If it's not documented, it's subject 
to change without warning.
See the man pages ?gc, ?Memory, and the source code.

Duncan Murdoch