numerical accuracy, dumb question

If I'm not mistaken, numerics are read in as doubles, so that shouldn't be a
problem.  However, I'd try using factor or character.

Andy
From: Dan Bolser

I store an id as a big number, could this be a problem?

Should I convert to at string when I use read.table(...

example id's

1001001001001
1001001001002
...
1002001002005

Bigest is probably 

1011001001001

Ta, 
Dan.

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

Part of that decision may depend upon how big the dataset is and what is
intended to be done with the ID's:
object.size(1011001001001)
[1] 36
object.size("1011001001001")
[1] 52
object.size(factor("1011001001001"))
[1] 244

They will by default, as Andy indicates, be read and stored as doubles.
They are too large for integers, at least on my system:
.Machine$integer.max
[1] 2147483647

Converting to a character might make sense, with only a minimal memory
penalty. However, using a factor results in a notable memory penalty, if
the attributes of a factor are not needed.

If any mathematical operations are to be performed with the ID's then
leaving them as doubles makes most sense.

Dan, more information on the numerical characteristics of your system
can be found by using:

.Machine

See ?.Machine and ?object.size for more information.

HTH,

Marc Schwartz
If I'm not mistaken, numerics are read in as doubles, so that shouldn't be a
problem.  However, I'd try using factor or character.

Andy

From: Dan Bolser

I store an id as a big number, could this be a problem?

Should I convert to at string when I use read.table(...

example id's

1001001001001
1001001001002
...
1002001002005

Bigest is probably 

1011001001001

Ta, 
Dan.

Part of that decision may depend upon how big the dataset is and what is
intended to be done with the ID's:

object.size(1011001001001)
[1] 36

object.size("1011001001001")
[1] 52

object.size(factor("1011001001001"))
[1] 244

They will by default, as Andy indicates, be read and stored as doubles.
They are too large for integers, at least on my system:

.Machine$integer.max
[1] 2147483647

Converting to a character might make sense, with only a minimal memory
penalty. However, using a factor results in a notable memory penalty, if
the attributes of a factor are not needed.
That depends on how long the vectors are.  The memory overhead for factors 
is per vector, with only 4 bytes used for each additional element (if the 
level already appears).  The memory overhead for character data is per 
element -- there is no amortization for repeated values.

 > object.size(factor("1011001001001"))
[1] 244
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))
[1] 308
 > # bytes per element in factor, for length 4:
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4
[1] 77
 > # bytes per element in factor, for length 1000:
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000
[1] 4.292
 > # bytes per element in character data, for length 1000:
 > 
object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000
[1] 20.028
 >

So, for long vectors with relatively few different values, storage as 
factors is far more memory efficient (this is because the character data is 
stored only once per level, and each element is stored as a 4-byte 
integer).  (The above was done on Windows 2000).

-- Tony Plate
If any mathematical operations are to be performed with the ID's then
leaving them as doubles makes most sense.

Dan, more information on the numerical characteristics of your system
can be found by using:

.Machine

See ?.Machine and ?object.size for more information.

HTH,

Marc Schwartz

On Fri, 2004-08-13 at 21:02, Liaw, Andy wrote:
If I'm not mistaken, numerics are read in as doubles, so that shouldn't 
be a
problem.  However, I'd try using factor or character.

Andy

From: Dan Bolser

I store an id as a big number, could this be a problem?

Should I convert to at string when I use read.table(...

example id's

1001001001001
1001001001002
...
1002001002005

Bigest is probably

1011001001001

Ta,
Dan.

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Thanks all for the expert advice and guidance.
At Friday 08:41 PM 8/13/2004, Marc Schwartz wrote:
Part of that decision may depend upon how big the dataset is and what is
intended to be done with the ID's:

object.size(1011001001001)
[1] 36

object.size("1011001001001")
[1] 52

object.size(factor("1011001001001"))
[1] 244

They will by default, as Andy indicates, be read and stored as doubles.
They are too large for integers, at least on my system:

.Machine$integer.max
[1] 2147483647

Converting to a character might make sense, with only a minimal memory
penalty. However, using a factor results in a notable memory penalty, if
the attributes of a factor are not needed.
That depends on how long the vectors are.  The memory overhead for factors 
is per vector, with only 4 bytes used for each additional element (if the 
level already appears).  The memory overhead for character data is per 
element -- there is no amortization for repeated values.

 > object.size(factor("1011001001001"))
[1] 244
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))
[1] 308
 > # bytes per element in factor, for length 4:
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4
[1] 77
 > # bytes per element in factor, for length 1000:
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000
[1] 4.292
 > # bytes per element in character data, for length 1000:
 > 
object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000
[1] 20.028
 >
So, for long vectors with relatively few different values, storage as 
factors is far more memory efficient (this is because the character data is 
stored only once per level, and each element is stored as a 4-byte 
integer).  (The above was done on Windows 2000).

-- Tony Plate
Good point Tony. I was making the, perhaps incorrect assumption, that
the ID's were unique or relatively so. However, as it turns out, even
that assumption is relevant only to a certain extent with respect to how
much memory is required.

What is interesting (and presumably I need to do some more reading on
how R stores objects internally) is that the incremental amount of
memory is not consistent on a per element basis for a given object,
though there is a pattern. It is also dependent upon the size of the new
elements to be added, as I note at the bottom.

This all of course presumes that object.size() is giving a reasonable
approximation of the amount of memory actually allocated to an object,
for which the notes in ?object.size raise at least some doubt. This is a
critical assumption for the data below, which is on FC2 on a P4.

For example:
object.size("a")
[1] 44
object.size(letters)
[1] 340

In the second case, as Tony has noted, the size of letters (a character
vector) is not 26 * 44.

Now note:
object.size(c("a", "b"))
[1] 52
object.size(c("a", "b", "c"))
[1] 68
object.size(c("a", "b", "c", "d"))
[1] 76
object.size(c("a", "b", "c", "d", "e"))
[1] 92

The incremental sizes are a sequence of 8 and 16.

Now for a factor:
object.size(factor("a"))
[1] 236
object.size(factor(c("a", "b")))
[1] 244
object.size(factor(c("a", "b", "c")))
[1] 268
object.size(factor(c("a", "b", "c", "d")))
[1] 276
object.size(factor(c("a", "b", "c", "d", "e")))
[1] 300

The incremental sizes are a sequence of 8 and 24.

Using elements along the lines of Dan's:
object.size("1000000000000")
[1] 52
object.size(c("1000000000000", "1000000000001"))
[1] 68
object.size(c("1000000000000", "1000000000001", "1000000000002"))
[1] 92
object.size(c("1000000000000", "1000000000001", "1000000000002",
"1000000000003"))
[1] 108
object.size(c("1000000000000", "1000000000001", "1000000000002",
"1000000000003", "1000000000004"))
[1] 132

The sequence is 16 and 24.

For factors:
object.size(factor("1000000000000")
[1] 244
object.size(factor(c("1000000000000", "1000000000001")))
[1] 260
object.size(factor(c("1000000000000", "1000000000001",
"1000000000002")))
[1] 292
object.size(factor(c("1000000000000", "1000000000001",
"1000000000002", "1000000000003")))
[1] 308
object.size(factor(c("1000000000000", "1000000000001",
"1000000000002", "1000000000003",
                       "1000000000004")))
[1] 340

The sequence is 24 and 32.

So, the incremental size seems to alternate as elements are added. 

The behavior above would perhaps suggest that memory is allocated to
objects to enable pairs of elements to be added. When the second element
of the pair is added, only a minimal incremental amount of additional
memory (and presumably time) is required.

However, when I add a "third" element, there is additional memory
required to store that new element because the object needs to be
adjusted in a more fundamental way to handle this new element.

There also appears to be some memory allocation "adjustment" at play
here. Note:
object.size(factor("1000000000000"))
[1] 244
object.size(factor("1000000000000", "a"))
[1] 236

In the second case, the amount of memory reported actually declines by 8
bytes. This suggests (to some extent consistent with my thoughts above)
that when the object is initially created, there is space for two new
elements and that space is allocated based upon the size of the first
element. When the second element is added, the space required is
adjusted based upon the actual size of the second element.

Again, all of the above presumes that object.size() is reporting correct
information.

Thanks,

Marc

There also appears to be some memory allocation "adjustment" at play
here. Note:

object.size(factor("1000000000000"))
[1] 244

object.size(factor("1000000000000", "a"))
[1] 236
Arggh.

Negate that last comment. I had a typo in the second example. It should
be:
object.size(factor(c("1000000000000", "a")))
[1] 252

which of course results in an increase in memory.

Geez. Time for lunch.

Marc

object.size("a")
[1] 44

object.size(letters)
[1] 340

In the second case, as Tony has noted, the size of letters (a character
vector) is not 26 * 44.
Of course not.  Both are character vectors, so have the overhead of any R
object plus an allocation for pointers to the elements plus an amount for
each element of the vector (see the end).

These calculations differ on 32-bit and 64-bit machines.  For a 32-bit
machine storage is in units of either 28 bytes (Ncells) or 8 bytes
(Vcells) so single-letter characters are wasteful, viz
object.size("aaaaaaa")
[1] 44

That is 1 Ncell and 2 Vcells, 1 for the string (7 bytes plus terminator)
and 1 for the pointer.

Whereas
object.size(letters)
[1] 340

has 1 Ncell and 39 Vcells, 26 for the strings and 13 for the pointers 
(which fit two to a Vcell).

Note that repeated character strings may share storage, so for example
object.size(rep("a", 26))
[1] 340

is wrong (140, I think).  And that makes comparisons with factors depend
on exactly how they were created, for a character vector there probably is 
a lot of sharing.

I have a feeling that these calculations are off for character vectors, as 
each element is a CHARSXP and so may have an Ncell not accounted for by 
object.size.  (`May' because of potential sharing.)  Would anyone who is 
sure like to confirm or deny this?

It ought to be possible to improve the estimates for character vectors a 
bit as we can detect sharing amongst the elements.
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
On Sat, 14 Aug 2004, Marc Schwartz wrote:

object.size("a")
[1] 44

object.size(letters)
[1] 340

In the second case, as Tony has noted, the size of letters (a character
vector) is not 26 * 44.
Of course not.  Both are character vectors, so have the overhead of any R
object plus an allocation for pointers to the elements plus an amount for
each element of the vector (see the end).

These calculations differ on 32-bit and 64-bit machines.  For a 32-bit
machine storage is in units of either 28 bytes (Ncells) or 8 bytes
(Vcells) so single-letter characters are wasteful, viz

object.size("aaaaaaa")
[1] 44

That is 1 Ncell and 2 Vcells, 1 for the string (7 bytes plus terminator)
and 1 for the pointer.

Whereas

object.size(letters)
[1] 340

has 1 Ncell and 39 Vcells, 26 for the strings and 13 for the pointers 
(which fit two to a Vcell).

Note that repeated character strings may share storage, so for example

object.size(rep("a", 26))
[1] 340

is wrong (140, I think).  And that makes comparisons with factors depend
on exactly how they were created, for a character vector there probably is 
a lot of sharing.

I have a feeling that these calculations are off for character vectors, as 
each element is a CHARSXP and so may have an Ncell not accounted for by 
object.size.  (`May' because of potential sharing.)  Would anyone who is 
sure like to confirm or deny this?

It ought to be possible to improve the estimates for character vectors a 
bit as we can detect sharing amongst the elements.
Prof. Ripley,

Thanks for the clarifications. 

I'll need to spend some time reading through R-exts.pdf and
Rinternals.h.

Regards,

Marc