Skip to content

numerical accuracy, dumb question

8 messages · Liaw, Andy, Tony Plate, Dan Bolser +2 more

#
If I'm not mistaken, numerics are read in as doubles, so that shouldn't be a
problem.  However, I'd try using factor or character.

Andy
#
Part of that decision may depend upon how big the dataset is and what is
intended to be done with the ID's:
[1] 36
[1] 52
[1] 244


They will by default, as Andy indicates, be read and stored as doubles.
They are too large for integers, at least on my system:
[1] 2147483647

Converting to a character might make sense, with only a minimal memory
penalty. However, using a factor results in a notable memory penalty, if
the attributes of a factor are not needed.

If any mathematical operations are to be performed with the ID's then
leaving them as doubles makes most sense.

Dan, more information on the numerical characteristics of your system
can be found by using:

.Machine

See ?.Machine and ?object.size for more information.

HTH,

Marc Schwartz
On Fri, 2004-08-13 at 21:02, Liaw, Andy wrote:
#
At Friday 08:41 PM 8/13/2004, Marc Schwartz wrote:
That depends on how long the vectors are.  The memory overhead for factors 
is per vector, with only 4 bytes used for each additional element (if the 
level already appears).  The memory overhead for character data is per 
element -- there is no amortization for repeated values.

 > object.size(factor("1011001001001"))
[1] 244
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))
[1] 308
 > # bytes per element in factor, for length 4:
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4
[1] 77
 > # bytes per element in factor, for length 1000:
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000
[1] 4.292
 > # bytes per element in character data, for length 1000:
 > 
object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000
[1] 20.028
 >

So, for long vectors with relatively few different values, storage as 
factors is far more memory efficient (this is because the character data is 
stored only once per level, and each element is stored as a 4-byte 
integer).  (The above was done on Windows 2000).

-- Tony Plate
#
On Sat, 2004-08-14 at 08:42, Tony Plate wrote:
Good point Tony. I was making the, perhaps incorrect assumption, that
the ID's were unique or relatively so. However, as it turns out, even
that assumption is relevant only to a certain extent with respect to how
much memory is required.

What is interesting (and presumably I need to do some more reading on
how R stores objects internally) is that the incremental amount of
memory is not consistent on a per element basis for a given object,
though there is a pattern. It is also dependent upon the size of the new
elements to be added, as I note at the bottom.

This all of course presumes that object.size() is giving a reasonable
approximation of the amount of memory actually allocated to an object,
for which the notes in ?object.size raise at least some doubt. This is a
critical assumption for the data below, which is on FC2 on a P4.

For example:
[1] 44
[1] 340

In the second case, as Tony has noted, the size of letters (a character
vector) is not 26 * 44.

Now note:
[1] 52
[1] 68
[1] 76
[1] 92

The incremental sizes are a sequence of 8 and 16.

Now for a factor:
[1] 236
[1] 244
[1] 268
[1] 276
[1] 300

The incremental sizes are a sequence of 8 and 24.


Using elements along the lines of Dan's:
[1] 52
[1] 68
[1] 92
"1000000000003"))
[1] 108
"1000000000003", "1000000000004"))
[1] 132

The sequence is 16 and 24.

For factors:
[1] 244
[1] 260
"1000000000002")))
[1] 292
"1000000000002", "1000000000003")))
[1] 308
"1000000000002", "1000000000003",
                       "1000000000004")))
[1] 340

The sequence is 24 and 32.


So, the incremental size seems to alternate as elements are added. 

The behavior above would perhaps suggest that memory is allocated to
objects to enable pairs of elements to be added. When the second element
of the pair is added, only a minimal incremental amount of additional
memory (and presumably time) is required.

However, when I add a "third" element, there is additional memory
required to store that new element because the object needs to be
adjusted in a more fundamental way to handle this new element.

There also appears to be some memory allocation "adjustment" at play
here. Note:
[1] 244
[1] 236

In the second case, the amount of memory reported actually declines by 8
bytes. This suggests (to some extent consistent with my thoughts above)
that when the object is initially created, there is space for two new
elements and that space is allocated based upon the size of the first
element. When the second element is added, the space required is
adjusted based upon the actual size of the second element.

Again, all of the above presumes that object.size() is reporting correct
information.

Thanks,

Marc
#
On Sat, 2004-08-14 at 12:01, Marc Schwartz wrote:

            
Arggh.

Negate that last comment. I had a typo in the second example. It should
be:
[1] 252

which of course results in an increase in memory.

Geez. Time for lunch.

Marc
#
On Sat, 14 Aug 2004, Marc Schwartz wrote:

            
Of course not.  Both are character vectors, so have the overhead of any R
object plus an allocation for pointers to the elements plus an amount for
each element of the vector (see the end).

These calculations differ on 32-bit and 64-bit machines.  For a 32-bit
machine storage is in units of either 28 bytes (Ncells) or 8 bytes
(Vcells) so single-letter characters are wasteful, viz
[1] 44

That is 1 Ncell and 2 Vcells, 1 for the string (7 bytes plus terminator)
and 1 for the pointer.

Whereas
[1] 340

has 1 Ncell and 39 Vcells, 26 for the strings and 13 for the pointers 
(which fit two to a Vcell).

Note that repeated character strings may share storage, so for example
[1] 340

is wrong (140, I think).  And that makes comparisons with factors depend
on exactly how they were created, for a character vector there probably is 
a lot of sharing.

I have a feeling that these calculations are off for character vectors, as 
each element is a CHARSXP and so may have an Ncell not accounted for by 
object.size.  (`May' because of potential sharing.)  Would anyone who is 
sure like to confirm or deny this?

It ought to be possible to improve the estimates for character vectors a 
bit as we can detect sharing amongst the elements.
#
On Sat, 2004-08-14 at 13:19, Prof Brian Ripley wrote:
Prof. Ripley,

Thanks for the clarifications. 

I'll need to spend some time reading through R-exts.pdf and
Rinternals.h.

Regards,

Marc