Skip to content

bug in sum() on integer vector

10 messages · Duncan Murdoch, Hervé Pagès, Murray Stokely +1 more

#
Hi,

   x <- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))

This is correct:

   > sum(as.double(x))
   [1] 0

This is not:

   > sum(x)
   [1] 4996000

Returning NA (with a warning) would also be acceptable for the latter.
That would make it consistent with cumsum(x):

   > cumsum(x)[length(x)]
   [1] NA
   Warning message:
   Integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

Thanks!
H.

 > sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
#
On 09/12/2011 1:40 PM, Herv? Pag?s wrote:
This is a 64 bit problem; in 32 bits things work out properly.   I'd 
guess in 64 bit arithmetic we or the run-time are doing something to 
simulate 32 bit arithmetic (since integers are 32 bits), but it looks as 
though we're not quite getting it right.

Duncan Murdoch
#
Hi Duncan,
On 11-12-09 11:39 AM, Duncan Murdoch wrote:
It doesn't work properly for me on Leopard (32-bit mode):

   > x <- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))
   > sum(as.double(x))
   [1] 0
   > sum(x)
   [1] 4996000
   > sessionInfo()
   R version 2.14.0 RC (2011-10-27 r57452)
   Platform: i386-apple-darwin9.8.0/i386 (32-bit)

   locale:
   [1] C

   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base

It looks like the problem is that isum() (in src/main/summary.c)
uses a 'double' internally to do the sum, whereas rsum() and csum()
use a 'long double'.

Note that isum() seems to be assuming that NA_INTEGER and NA_LOGICAL
will always be the same (probably fine) and that TRUE values in the
input vector are always represented as a 1 (not so sure about this one).

A more fundamental question: is switching back and forth between
'int' and 'double' (or 'long double') the right thing to do for doing
"safe" arithmetic on integers?

Thanks!
H.

  
    
#
On 11-12-09 4:41 PM, Herv? Pag?s wrote:
A double has 53 bits to store the mantissa, so any 32 bit integer can be 
stored exactly.
If you have enough terms in the sum that an intermediate value exceeds 
53 bits in length, then you'll get the wrong answer, because the 
intermediate sum can't be stored exactly.  That happens in your example. 
On the 32 bit platform I tested (Windows 32 bit), intermediate values 
are stored in registers with 64 bit precision, which is probably why 
Windows 32 bit gets it right, but various other platforms don't.

On your fundamental question:  I think the answer is that R is doing the 
right thing.  R doesn't think of an integer as a particular 
representation, it thinks of it as a number.  So if you ask for the sum 
of those numbers, R should return its best approximation to that sum, 
and it does.

A different approach would be to do the sum in 32 bit registers and 
detect 32 bit overflow in intermediate results.  But that's a very 
hardware-oriented approach, rather than a mathematical approach.

Duncan Murdoch
3 days later
#
Hi Duncan,
On 11-12-10 05:27 AM, Duncan Murdoch wrote:
It does, really? Seems like returning 0 would be a better approximation
;-) And with the argument that "R doesn't think of an integer as a
particular representation" then there is no reason why sum(x)
would get it wrong and sum(as.double(x)) would get it right. Also why
bother having an integer type in R?

Seriously, I completely disagree with your view (hopefully it's only
yours, and not an R "feature") that it's ok for integer arithmetic to
return an approximation. It should always return the correct value or
fail. This is one of the reasons why programmers use integers and not
floating point numbers (memory usage being another one). Integers are
used for indexing elements in an array or for shifting pointers at the
C-level. The idea that integer arithmetic can be approximate is scary.

Cheers,
H.

  
    
#
FYI, the new int64 package on CRAN gets this right, but is of course
somewhat slower since it is not doing hardware 64-bit arithmetic.

?x <- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))
 library(int64)
 sum(as.int64(x))
# [1] 0

             - Murray

2011/12/9 Herv? Pag?s <hpages at fhcrc.org>:
#
[See at end]
On 13-Dec-11 23:41:12, Herv? Pag?s wrote:
The approximation is inevitable, even for integers, if the
integers are large enough.

The number of particles in the Universe is believed to be
around 10^80 (estimates vary from 10^72 to 10^87). If we
could use each particle as a binary element in storage
(e.g. using positive and negative spin as 1 or 0) then
the largest integer that could be stored in the entire
Universe would be about 2^(10^80). A huge number, of course,
but it is the limit of what is possible.

Now, to do arithmetic with integers, you need to store two
ort more integers, thus at least halving the power of 2.
Then you need a computer to do the computation, and you
won't have room for that in the Universe unless you cut
down on the largest integer.

So, in real life (whatever Mathematics may say), you can't
expect arbitrary integer arithmetic.

Now, computer programs for numerical computation can broadly
be divided into two types.

In one, "arbitrary precision" is available: you can tell
the program how many decimal digits you want it to work to.
An example of this is 'bc':

  http://en.wikipedia.org/wiki/Bc_programming_language

You can set as many decimal ditgits as you like, *provided*
they fall within the storage capacity of your computer, for
which an upper bound is the storage capacity of the Universe
(see above). For integers and results which surpass the
decimal places you have set, the result will be an approximation.
Inevitably.

In the other type, the program is written so as to embody
integers to a fixed maximum number of decimal (or binary)
digits. An example of this is R (and most other numerical
programs). This may be 32 bits or 64 bits. Any result ot
computation which involve smore than this numer of bits
is inevitably an approximation.

Provided the user is aware of this, there is no need for
your "It should always return the correct value or fail."
It will return the correct value if the integers are not
too large; otherwise it will retuirn the best approximation
that it can cope with in the fixed finite storage space
for which it has been programmed.

There is an implcit element of the arbitrary in this. You
can install 32-bit R on a 64-bit-capable machine, or a
64-bit version. You could re-program R so that it can
work to, say, 128 bits or 256 bits even on a 32-bit machine
(using techniques like those that underlie 'bc'), but
that would be an arbitrary choice. However, the essential
point is that some choice is unavoidable, since if you push
it too far the Universe will run out of particles -- and the
computer industry will run out of transistors long before
you hit the Universe limit!

So you just have to accept the limits. Provided you are aware
of the approximations which may set in at some point, you can
cope with the consequences, so long as you take account of
some concept of "adequacy" in the inevitable approximations.
Simply to "fail" is far too unsophisticated a result!

Hoping this is useful,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 14-Dec-11                                       Time: 00:52:49
------------------------------ XFMail ------------------------------
#
Hi Ted,
On 11-12-13 04:52 PM, (Ted Harding) wrote:
[...]
AFAICT, with bc and other tools doing arithmetic on arbitrary large
integers, operations like +, -, *, ^ etc either give the exact answer
or they fail. That's the beauty of those tools. Otherwise you could
call them "pointless" or "broken".

Arbitrary-precision vs fixed-precision is slightly off topics though.
In particular I didn't suggest that doing sum() on an integer vector
should use arbitrary large integers internally to do the computation.

Cheers,
H.

  
    
#
On 11-12-13 6:41 PM, Herv? Pag?s wrote:
I think you need to be more specific in your design, because the function

`+` <- function(x,y) stop("fail")

meets your specs.  I think the following requirements are all desirable, 
but I don't think they can all be met at the same time:

1.  Return the exactly correct answer in all (or even nearly all) of the 
cases where the current code returns the correct answer.

2.  Do it as quickly (or nearly as quickly) as the current code.

3.  Do it without requiring much more memory than the current code does.

4.  Never return an incorrect answer.

If you think I'm wrong, show me.

Duncan Murdoch

This is one of the reasons why programmers use integers and not
1 day later
#
Hi Duncan,
On 11-12-14 03:57 AM, Duncan Murdoch wrote:
Interesting. At least this program is *correct*, strictly speaking.
Which doesn't mean that it is useful. This one too is correct:

   `+` <- function(x,y) {
            warning("cannot sum 'x' and 'y' - returning NA")
            NA
          }

But this one is *not* correct:

   `+` <- function(x,y) {set.seed(4); return(some_random_number())}

because sometimes it returns the wrong answer ;-) (which is more or
less what sum() does on an integer vector at the moment).
I agree that you can't have it all but IMO correctness should have a
higher priority than speed or memory usage considerations. Quoting
my previous boss (someone you know): "There are many fast and memory
efficient ways to compute the wrong result". Just before people in
the meeting room were about to start passionate discussions about the
respective merits of those fast, efficient, but unfortunately incorrect
ways.

Cheers,
H.