Skip to content

bug in sum() on integer vector

8 messages · Peter Dalgaard, John C Nash, Uwe Ligges +2 more

#
Following this thread, I wondered why nobody tried cumsum to see where the integer
overflow occurs. On the shorter xx vector in the little script below I get a message:

Warning message:
Integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'
But sum() does not give such a warning, which I believe is the point of contention. Since
cumsum() does manage to give such a warning, and show where the overflow occurs, should
sum() not be able to do so? For the record, I don't class the non-zero answer as an error
in itself. I regard the failure to warn as the issue.

For info, on my Ubnuntu Lucid 10.04 system that has 4 GB of RAM but no swap, the last line
of the script to do the int64 sum chugs for about 2 minutes then gives "Killed" and
returns to the terminal prompt. It also seems to render some other applications unstable
(I had Thunderbird running to read R-devel, and this started to behave strangely after the
crash, and I had to reboot.) I'm copying Romain as package maintainer, and I'll be happy
to try to work off-list to figure out how to avoid the "Killed" result. (On a 16GB
machine, I got the 0 answer.)

Best,

John Nash

Here's the system info and small script.
## sumerr.R  20111214
library(int64)
x <- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))
xx <- c(rep(1800000003L, 1000), -rep(1200000002L, 1500))
sum(x)
sum(as.double(x))
sum(xx)
sum(as.double(xx))
cumsum(xx)
cumsum(as.int64(xx))

tmp<-readline("Now try the VERY SLOW int64")
sum(as.int64(x))
#
On Dec 14, 2011, at 16:19 , John C Nash wrote:

            
It (sum) does warn if you take the two "halves" separately. The issue is that the overflow is detected at the end of the summation, when the result is to be saved to an integer (which of course happens for all intermediate sums in cumsum)
[1] NA
Warning message:
In sum(x[1:1e+07]) : Integer overflow - use sum(as.numeric(.))
[1] NA
Warning message:
In sum(x[10000001:1.5e+07]) : Integer overflow - use sum(as.numeric(.))
[1] 4996000

There's a pretty easy fix, essentially to move

    if(s > INT_MAX || s < R_INT_MIN){
        warningcall(call, _("Integer overflow - use sum(as.numeric(.))"));
        *value = NA_INTEGER;
    }

inside the summation loop. Obviously, there's a speed penalty from two FP comparisons per element, but I wouldn't know whether it matters in practice for anyone.
#
On 14.12.2011 17:19, peter dalgaard wrote:
I don't think I am interested in where the overflow happens if I call 
sum()...

Uwe
#
I agree that where the overflow occurs is not critical (one can go back to cumsum and find
out). I am assuming that Uwe still wants to know there has been an overflow at some point
i.e., a warning. This could become more "interesting" as parallel computation causes
different summation orderings on sums of large numbers of items.

JN
On 12/14/2011 03:58 PM, Uwe Ligges wrote:
#
On 14.12.2011 22:16, John C Nash wrote:
Yes, sure.

Uwe
#
Hi Peter,
On 11-12-14 08:19 AM, peter dalgaard wrote:
Since you want to generate this warning once only, your test (now
inside the loop) needs to be something like:

     if (warn && (s > INT_MAX || s < R_INT_MIN)) {
         generate the warning
         warn = 0;
     }

with 'warn' initialized to 1. This makes the isum() function almost
twice slower on my machine (64-bit Ubuntu) when compiling with
gcc -O2 and when no overflow occurs (the most common use case I guess).

Why not just do the sum in a long double instead of a double?
It slows down isum() by only 8% on my machine when compiling
with gcc -O2.
But most importantly this solution also has the advantage of making
sum(x) consistent with sum(as.double(x)). The latter uses rsum() which
does the sum in a long double. So by using a long double in both isum()
and rsum(), consistency between sum(x) and sum(as.double(x)) is
guaranteed.
Maybe that still doesn't give you the guarantee that sum(x) will always
return the correct value (when it does not return NA) because that
depends now on the ability of long double to represent exactly the sum
of at most INT_MAX arbitrary ints. The nb of bits used for long double
seems to vary a lot across platforms/compilers so it's hard to tell.
Not an ideal solution, but at least it makes isum() more accurate than
the current isum() and it makes sum(x) consistent with sum(as.double(x))
on all platforms, without degrading performance too much.

Cheers,
H.
#
On Dec 15, 2011, at 02:51 , Herv? Pag?s wrote:

            
Hum, yes. Also the test would be overly cautious: The real thing to test is whether we overrun the range in which integers are exactly representable in FP i.e. roughly +/-2^52, not the +/-2^31 that fits 32 bit integers. Or +/-2^63 if we have long doubles.

However, we still need to decide whether the issue is that sum(as.double(x)) can be inconsistent with sum(x), or whether it is that integer arithmetic can be inexact. Also, the timings should really be viewed in context: Does _any_ actual code use isum to an extent where halving its speed would have any noticeable impact?

We probably shouldn't touch this for 2.14.1, then.

  
    
#

        
> On Dec 15, 2011, at 02:51 , Herv? Pag?s wrote:
>> Hi Peter,
    >>
>> On 11-12-14 08:19 AM, peter dalgaard wrote:
>>>
>>> On Dec 14, 2011, at 16:19 , John C Nash wrote:
>>> 
    >>>> 
    >>>> Following this thread, I wondered why nobody tried
    >>>> cumsum to see where the integer overflow occurs. On the
    >>>> shorter xx vector in the little script below I get a
    >>>> message:
    >>>> 
    >>>> Warning message: Integer overflow in 'cumsum'; use
    >>>> 'cumsum(as.numeric(.))'
    >>>>> 
    >>>> 
    >>>> But sum() does not give such a warning, which I believe
    >>>> is the point of contention. Since cumsum() does manage
    >>>> to give such a warning, and show where the overflow
    >>>> occurs, should sum() not be able to do so? For the
    >>>> record, I don't class the non-zero answer as an error
    >>>> in itself. I regard the failure to warn as the issue.
    >>> 
    >>> It (sum) does warn if you take the two "halves"
    >>> separately. The issue is that the overflow is detected
    >>> at the end of the summation, when the result is to be
    >>> saved to an integer (which of course happens for all
    >>> intermediate sums in cumsum)
    >>> 
    >>>> x<- c(rep(1800000003L, 10000000), -rep(1200000002L,
    >>>> 15000000)) sum(x[1:10000000])
    >>> [1] NA Warning message: In sum(x[1:1e+07]) : Integer
    >>> overflow - use sum(as.numeric(.))
    >>>> sum(x[10000001:25000000])
    >>> [1] NA Warning message: In sum(x[10000001:1.5e+07]) :
    >>> Integer overflow - use sum(as.numeric(.))
    >>>> sum(x)
    >>> [1] 4996000
    >>> 
    >>> There's a pretty easy fix, essentially to move
    >>> 
    >>> if(s> INT_MAX || s< R_INT_MIN){ warningcall(call,
    >>> _("Integer overflow - use sum(as.numeric(.))")); *value
    >>> = NA_INTEGER; }
    >>> 
    >>> inside the summation loop. Obviously, there's a speed
    >>> penalty from two FP comparisons per element, but I
    >>> wouldn't know whether it matters in practice for anyone.
    >>> 
    >> 
    >> Since you want to generate this warning once only, your
    >> test (now inside the loop) needs to be something like:
    >> 
    >> if (warn && (s > INT_MAX || s < R_INT_MIN)) { generate
    >> the warning warn = 0; }
    >> 
    >> with 'warn' initialized to 1. This makes the isum()
    >> function almost twice slower on my machine (64-bit
    >> Ubuntu) when compiling with gcc -O2 and when no overflow
    >> occurs (the most common use case I guess).
    >> 
    >> Why not just do the sum in a long double instead of a
    >> double?  It slows down isum() by only 8% on my machine
    >> when compiling with gcc -O2.  But most importantly this
    >> solution also has the advantage of making sum(x)
    >> consistent with sum(as.double(x)). The latter uses rsum()
    >> which does the sum in a long double. So by using a long
    >> double in both isum() and rsum(), consistency between
    >> sum(x) and sum(as.double(x)) is guaranteed.

    > Hum, yes. Also the test would be overly cautious: The real
    > thing to test is whether we overrun the range in which
    > integers are exactly representable in FP i.e. roughly
    > +/-2^52, not the +/-2^31 that fits 32 bit integers. Or
    > +/-2^63 if we have long doubles.

    > However, we still need to decide whether the issue is that
    > sum(as.double(x)) can be inconsistent with sum(x), or
    > whether it is that integer arithmetic can be
    > inexact. Also, the timings should really be viewed in
    > context: Does _any_ actual code use isum to an extent
    > where halving its speed would have any noticeable impact?

    > We probably shouldn't touch this for 2.14.1, then.

Definitely agree on that.


Given all the (good) considerations in this thread,
I think that Herv?'s proposal of just *not* calling isum()
but rather rsum() really makes more sense:
- Not only will it give the correct (finite) answer in more cases,
- but it will be much easier to document what sum(.) does:
  sum(x)  and  sum(as.double(x)) will return the same values.
- it is faster than the isum() version which would check for
  overflow "every time".
We could think of coercing back to integer when the result is
inside the integer range and also when NA: returning
integer instead of real NA.

Martin

    >> Maybe that still doesn't give you the guarantee that
    >> sum(x) will always return the correct value (when it does
    >> not return NA) because that depends now on the ability of
    >> long double to represent exactly the sum of at most
    >> INT_MAX arbitrary ints. The nb of bits used for long
    >> double seems to vary a lot across platforms/compilers so
    >> it's hard to tell.  Not an ideal solution, but at least
    >> it makes isum() more accurate than the current isum() and
    >> it makes sum(x) consistent with sum(as.double(x)) on all
    >> platforms, without degrading performance too much.
    >> 
    >> Cheers, H.
    >> 
    >> -- 
    >> Herv? Pag?s
    >> 
    >> Program in Computational Biology Division of Public
    >> Health Sciences Fred Hutchinson Cancer Research Center
    >> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
    >> 98109-1024
    >> 
    >> E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206)
    >> 667-1319

    > -- 
    > Peter Dalgaard, Professor Center for Statistics,
    > Copenhagen Business School Solbjerg Plads 3, 2000
    > Frederiksberg, Denmark Phone: (+45)38153501 Email:
    > pd.mes at cbs.dk Priv: PDalgd at gmail.com

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel