At Friday 08:41 PM 8/13/2004, Marc Schwartz wrote:
Part of that decision may depend upon how big the dataset is and what is
intended to be done with the ID's:
object.size(1011001001001)
object.size("1011001001001")
object.size(factor("1011001001001"))
[1] 244
They will by default, as Andy indicates, be read and stored as doubles.
They are too large for integers, at least on my system:
[1] 2147483647
Converting to a character might make sense, with only a minimal memory
penalty. However, using a factor results in a notable memory penalty, if
the attributes of a factor are not needed.
That depends on how long the vectors are. The memory overhead for factors
is per vector, with only 4 bytes used for each additional element (if the
level already appears). The memory overhead for character data is per
element -- there is no amortization for repeated values.
> object.size(factor("1011001001001"))
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))
[1] 308
> # bytes per element in factor, for length 4:
>
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4
[1] 77
> # bytes per element in factor, for length 1000:
>
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000
[1] 4.292
> # bytes per element in character data, for length 1000:
>
object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000
[1] 20.028
So, for long vectors with relatively few different values, storage as
factors is far more memory efficient (this is because the character data is
stored only once per level, and each element is stored as a 4-byte
integer). (The above was done on Windows 2000).
-- Tony Plate