print(...,digits=2) behavior - R-devel

Sat, Feb 5, 2011 12:58 PM #

A bug was recently posted to the R bug database (which probably would
better have been posted as a query here) as to why this happens:

[1] 8

[1] 7.9

  Two things I *haven't* done to help make sense of this for myself are
(1) writing out the binary representations to see if something obvious
pops out about why this would be a breakpoint and (2) poking in the
source code (I did a little bit of this but gave up).

  I know that confusion over rounding etc. is very common, and my first
reaction to this sort of question is always that there must be some sort
of confusion (either (1) in a FAQ 7.31-ish sort of way that floating
point values have finite precision in the first place, or (2) a
confusion over the difference between the value and the representation
of the number, or (3) more subtly, about the differences among various
choices of rounding conventions).

  However, in this case I am a bit stumped: I don't see that any of the
standard confusions apply.  I grepped the R manuals for "rounding" and
didn't find anything useful ...  I have a very strong prior that such a
core part of R must be correct, and that therefore I (and the original
bug reporter) must be misunderstanding something.

  Thoughts/references?

  cheers
    Ben Bolker

Martin Maechler

Mon, Feb 7, 2011 12:56 AM #

>   A bug was recently posted to the R bug database (which
    > probably would better have been posted as a query here) as
    > to why this happens:

    >> print(7.921,digits=2)
    > [1] 8
    >> print(7.92,digits=2)
    > [1] 7.9

    >   Two things I *haven't* done to help make sense of this
    > for myself are (1) writing out the binary representations
    > to see if something obvious pops out about why this would
    > be a breakpoint and (2) poking in the source code (I did a
    > little bit of this but gave up).

    >   I know that confusion over rounding etc. is very common,
    > and my first reaction to this sort of question is always
    > that there must be some sort of confusion (either (1) in a
    > FAQ 7.31-ish sort of way that floating point values have
    > finite precision in the first place, or (2) a confusion
    > over the difference between the value and the
    > representation of the number, or (3) more subtly, about
    > the differences among various choices of rounding
    > conventions).

    >   However, in this case I am a bit stumped: I don't see
    > that any of the standard confusions apply.  I grepped the
    > R manuals for "rounding" and didn't find anything useful
    > ...  I have a very strong prior that such a core part of R
    > must be correct, and that therefore I (and the original
    > bug reporter) must be misunderstanding something.

    >   Thoughts/references?

I had started to delve into this after you've posted the bug
report. It is clearly a bug(let),
caused by code that has been in  R  from its very
beginning, at least in the first source code I've seen in 1994.

The problem is not related to digits=2,
but using such a low number of digits shows it more
dramatically, e.g., also

 > print(5.9994, digits=4)
 [1] 5.999
 > print(5.9994001, digits=4)
 [1] 6

Interestingly, the problem seems *not* to be present for
digits = 1 (only).

I haven't found time to mathematically "analyze" it for
determining a correct solution though.
Note that fixing this bug(let) will probably (very slightly)
change a huge number of R outputs .. so there is a caveat,
but nonetheless, we must approach it.

The responsible code is the C function  scientific()
in src/main/format.c 

a somewhat direct R interface to its functionality is given by
R's format.info() :

  > format.info(7.92, digits=2)
  [1] 3 1 0
  > format.info(7.921, digits=2)
  [1] 1 0 0

which means that 7.92 will use 3 characters, 1 digit after the decimal
7.921 will use 1 character, 0 digits after decimal.

The crucial "buggy" part of the code is  {line 180 ff} :

	/* compute number of digits */

	*nsig = R_print.digits;
	for (j = 1; j <= *nsig; j++) {
	    if (fabs(alpha - floor(alpha+0.5)) < eps * alpha) {
		*nsig = j;
		break;
	    }
	    alpha *= 10.0;
	}

where notably
	    if (fabs(alpha - floor(alpha+0.5)) < eps * alpha) {
is not quite the correct check.
Note that  eps = 10^-2  in our case and  alpha = 7.92 or 7.921
and that  8 - 7.921 = 0.079 <  7.921/10^2  is just at the border.

Looking for "all the border cases" is solving the above
analytically with '=' and setting  k = ceiling(alpha) :

d=1          d=2         d=3        d=4          d=5
k=6 5.4545454545 5.9405940594 5.994005994 5.99940006 5.9999400006
k=7 6.3636363636 6.9306930693 6.993006993 6.99930007 6.9999300007
k=8 7.2727272727 7.9207920792 7.992007992 7.99920008 7.9999200008
k=9 8.1818181818 8.9108910891 8.991008991 8.99910009 8.9999100009

So we see that for our case, 7.920792... is more exactly the
border case.

One change that is *not* correct is making it an absolute
instead of relative comparison, i.e.  replacing the RHS
eps * alpha  by  eps

Namely, one important purpose of the test is to ensure that e.g.

  print(3.597, digits = 3)

is printed as  3.6   and not 3.60

Now I have had -- since 1997 at least -- an R version of
scientific() for more easy experimentation.
Here's an updated version of that:

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: scientific.R
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20110207/d05c4b37/attachment.pl>
-------------- next part --------------

and here some of the experimentation code:

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: scientific-ex.R
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20110207/d05c4b37/attachment-0001.pl>
-------------- next part --------------

-------

Now, I'm waiting for comments/suggestions ..
Martin

Petr Savicky

Mon, Feb 7, 2011 2:14 PM #

On Mon, Feb 07, 2011 at 09:56:18AM +0100, Martin Maechler wrote:

Ben Bolker <bbolker at gmail.com>
    on Sat, 5 Feb 2011 15:58:09 -0500 writes:

    >   A bug was recently posted to the R bug database (which
    > probably would better have been posted as a query here) as
    > to why this happens:

    >> print(7.921,digits=2)

    > [1] 8

    >> print(7.92,digits=2)

    > [1] 7.9

    >   Two things I *haven't* done to help make sense of this
    > for myself are (1) writing out the binary representations
    > to see if something obvious pops out about why this would
    > be a breakpoint and (2) poking in the source code (I did a
    > little bit of this but gave up).

    >   I know that confusion over rounding etc. is very common,
    > and my first reaction to this sort of question is always
    > that there must be some sort of confusion (either (1) in a
    > FAQ 7.31-ish sort of way that floating point values have
    > finite precision in the first place, or (2) a confusion
    > over the difference between the value and the
    > representation of the number, or (3) more subtly, about
    > the differences among various choices of rounding
    > conventions).

    >   However, in this case I am a bit stumped: I don't see
    > that any of the standard confusions apply.  I grepped the
    > R manuals for "rounding" and didn't find anything useful
    > ...  I have a very strong prior that such a core part of R
    > must be correct, and that therefore I (and the original
    > bug reporter) must be misunderstanding something.

    >   Thoughts/references?

I had started to delve into this after you've posted the bug
report. It is clearly a bug(let),
caused by code that has been in  R  from its very
beginning, at least in the first source code I've seen in 1994.

The problem is not related to digits=2,
but using such a low number of digits shows it more
dramatically, e.g., also

 > print(5.9994, digits=4)

 [1] 5.999

 > print(5.9994001, digits=4)

 [1] 6

I think that this may be a consequence of the following
analysis, which i did some time ago using the source code
of scientific() function in src/main/format.c. The
conclusion was the following.

Printing to k digits looks for the smallest number of
digits l <= k so that the relative error of the printed
mantissa (significand) is at most 10^-k. If the mantissa
begins with a digit less than 5, then this condition is
stronger than having absolute error at most 5*10^-k. So,
in this case, we cannot get lower accuracy than rounding
to k digits.

If the mantissa begins with 5 or more, then having relative
error at most 10^-k is a weaker condition than having absolute
error at most 5*10^-k. In this case, the chosen l may be
smaller than k even if the printed number has larger error
than the number rounded to k digits.

This may be seen in the following example

  xx <- 8 + (6:10)*10^-7
  for (x in xx) print(x)

  [1] 8
  [1] 8
  [1] 8
  [1] 8.000001
  [1] 8.000001

where all the printed numbers are 8.000001 if rounded
to 7 digits.

The cases 

  print(7.921,digits=2)
  [1] 8

  print(7.92,digits=2)
  [1] 7.9

seem to have the same explanation. The relative errors of the
numbers rounded to a single digit are

  (8 - 7.921)/7.921
  [1] 0.009973488

  (8 - 7.92)/7.92
  [1] 0.01010101

In the first case, this is less than 10^-2 and so, l = 1
is used. In the second case, the relative error for l = 1
is larger than 10^-2 an so, l = 2 is chosen.

In the cases

  print(5.9994, digits=4)
  [1] 5.999
  print(5.9994001, digits=4)
  [1] 6

the relative errors of one digit output are

  (6 - 5.9994)/5.9994
  [1] 0.00010001
  
  (6 - 5.9994001)/5.9994001
  [1] 9.999333e-05

Here, one digit output is chosen if the relative error is
less than 10^-4 and not otherwise.

Petr Savicky.

Petr Savicky

Wed, Feb 9, 2011 4:55 AM #

On Mon, Feb 07, 2011 at 09:56:18AM +0100, Martin Maechler wrote:

[...]

I inspected the source of scientific() and formatReal() (2.13.0, revision
2011-02-08 r54284). Let me point out an example of the difference between
the output of print() and rounding to "digits" significant digits, which
is slightly different from the previous ones, since also the exponent
changes. Namely,

  print(9.991, digits=3)
  [1] 10

while rounding to 3 digits yields 9.99. The reason is in scientific(),
where the situation that rounding increases the exponent is tested using

    /* make sure that alpha is in [1,10) AFTER rounding */

    if (10.0 - alpha < eps*alpha) {
        alpha /= 10.0;
        kp += 1;
    }

Here, eps is determined in formatReal() as

    double eps = pow(10.0, -(double)R_print.digits);

so we have eps = 10^-digits. The above condition on alpha is equivalent to

  alpha > 10.0/(1 + 10^-digits)

For digits=3, this is

  alpha > 9.99000999000999

This bound may be verified as follows

  print(9.9900099900, digits=3)
  [1] 9.99

  print(9.9900099901, digits=3)
  [1] 10

The existing algorithm for choosing the number of digits is designed to
predict the format suitable for all numbers in a vector before the actual
call of sprintf() or snprintf(). For speed, this algorithm should use
the standard double precision, so it is necessarily inaccurate for precision
15 and more and there may be some rare such cases also for smaller
precisions. For smaller precisions, say below 7, the algorithm can be made
more precise. This would change the output in rare cases and mainly for
printing single numbers. If a vector is printed, then the format is typically
not determined by the problematic numbers.

Changing the default behavior may be undesirable for backward compatibility
reasons. If this is the case, then a possible solution is to make the
documentation more precise on this and include pointers to possible
solutions. The functions sprintf(), formatC() and signif() may be used. In
particular, if signif() is used to round the numbers before printing, then we
get the correct output

  print(signif(7.921, digits=2))  
  [1] 7.9

  print(signif(9.9900099901, digits=3))
  [1] 9.99

The current ?print.default contains

  digits: a non-null value for ?digits? specifies the minimum number of
          significant digits to be printed in values.

The word "minimum" here means that all numbers in the vector will have
at least the chosen number of digits, but some may have more. I suggest
to add "See 'options' for more detail.".

The current ?options contains

    ?digits?: controls the number of digits to print when printing
          numeric values.  It is a suggestion only.

I suggest to extend this to

   It is a suggestion only and the actual number of printed digits
   may be smaller, if the relative error of the output number is
   less than 10^-digits. Use 'signif(, digits)' before printing to get
   the exact number of the printed significant digits.

I appreciate to know the opinion of R developers on this.

Petr Savicky.

Petr Savicky

Sun, Feb 13, 2011 4:16 AM #

On Mon, Feb 07, 2011 at 09:56:18AM +0100, Martin Maechler wrote:

[...]

[...]

In a previous email, i discussed the case print(9.991, digits=3).
This case is already handled in your R level experimental function
scientific(). I noticed this only after sending the previous email.

I would like to suggest a modification of the algorithm. The purpose
is to determine the minimum number of digits, such that the printed number
has the same value as the number rounded to exactly getOption("digits")
digits, but shorter representation.

Algorithm. First, the mantissa of the number rounded to "digits" digits
is computed as an integer by the formula

  a <- floor(alpha*10^(digits - 1) + 0.5)

The obtained integer number a satisfies

  10^(digits-1) <= a <= 10^digits

The case a = 10^digits corresponds to the situation that we have to
increase the exponent. This may be tested either before the remaining
part of the code or as the last step as below.

For example, if alpha = 3.597 and digits = 3, we get a = 360. The
question, whether (digits - 1) digits are sufficient for printing the
number, is equivalent to the condition that "a" is divisible by 10.
Similarly, (digits - 2) digits are sufficient if and only if "a" is
divisible by 100. This may be tested using the following code

    nsig <- digits
    for (j in 1:nsig) {
        a <- a / 10
        if (a == floor(a)) {
            nsig <- nsig - 1
        } else {
            break
        }
    }
    ## nsig == 0 if and only if we started with a == 10^digits
    if (nsig == 0) {
        nsig <- 1
        kpower <- kpower + 1
    }

This code uses double precision for a, since values up to 10^digits
may occur and digits may be 15 or more. The algorithm is not exact
for digits = 15, but works reasonably. I suggest to use a different,
slower and more accurate algorithm, if getOption("digits") >= 16.

Please, find in an attachment two versions of the above algorithm. One
is a modification of your R level function from scientific.R and the
other is a patch against src/main/format.c, which i tested.

I suggest to consider the following test cases. The presented output
is obtained using the patch.

  example1 <- c(
  7.94999999999999,
  7.95000000000001,
  8.04999999999999,
  8.05000000000001)
  for (x in example1) print(x, digits=2)
  
  [1] 7.9
  [1] 8
  [1] 8
  [1] 8.1
  
  example2 <- c(
  3.59949999999999,
  3.59950000000001,
  3.60049999999999,
  3.60050000000001)
  for (x in example2) print(x, digits=4)
  
  [1] 3.599
  [1] 3.6
  [1] 3.6
  [1] 3.601
  
  example3 <- c(
  1.00000049999999,
  1.00000050000001)
  for (x in example3) print(x, digits=7)
  
  [1] 1
  [1] 1.000001
  
  example4 <- c(
  9.99999949999999,
  9.99999950000001)
  for (x in example4) print(x, digits=7)
  
  [1] 9.999999
  [1] 10

I appreciate comments.

Petr Savicky.

-------------- next part --------------
###--- R function that does the same as 'scientific' in

###--- /usr/local/R-0.50/src/main/format.c
###--- ~~~~~~~~~~~||||||~~~~~~~~~~~~~~~~~~

scientific1 <- function(x, digits = getOption('digits')) ## eps = 10^-(digits)
{
  ##-   /* for 1 number  x , return
  ##-    *      sgn    = 1_{x < 0}  {0/1}
  ##-    *      kpower = Exponent of 10;
  ##-    *      nsig   = min(digits, #{significant digits of alpha})
  ##-    *
  ##-    * where  |x| = alpha * 10^kpower   and  1 <= alpha < 10
  ##-    */

  eps <- 10 ^(-digits)

  if (x == 0) {
    kpower <- 0
    nsig <- 1
    sgn <- 0
  } else {
    if(x < 0) {
      sgn <- 1; r <- -x
    } else {
      sgn <- 0; r <- x
    }
    kpower <- floor(log10(r));##-->      r = |x| ;  10^k <= r

    if (kpower <= -308) ## close to denormalization -- added for R 2.x.y
        alpha <- (r * 1e+30) / 10^(kpower+30)
    else
        alpha <- r / 10^kpower

    ## "a" integer, 10^(digits-1) <= a <= 10^digits

    a <- floor(alpha*10^(digits - 1) + 0.5)
    nsig <- digits
    for (j in 1:nsig) {
        a <- a / 10
        if (a == floor(a)) {
            nsig <- nsig - 1
        } else {
            break
        }
    }
    if (nsig == 0) {
        nsig <- 1
        kpower <- kpower + 1
    }
  }
  left <- kpower + 1
  c(sgn = sgn, kpower = kpower, nsig = nsig,
    left = left, right = nsig - left,
    sleft = sgn + max(1,left))
}

-------------- next part --------------
diff --minimal -U 5 -r R-devel/src/main/format.c R-devel-print/src/main/format.c
--- R-devel/src/main/format.c	2010-04-27 17:52:24.000000000 +0200
+++ R-devel-print/src/main/format.c	2011-02-12 11:07:37.000000000 +0100
@@ -167,28 +167,26 @@
 	    alpha = (r * 1e+30)/pow(10.0, (double)(kp+30));
 	}
 	else
 	    alpha = r / pow(10.0, (double)kp);
 
-	/* make sure that alpha is in [1,10) AFTER rounding */
-
-	if (10.0 - alpha < eps*alpha) {
-	    alpha /= 10.0;
-	    kp += 1;
-	}
-	*kpower = kp;
-
-	/* compute number of digits */
-
-	*nsig = R_print.digits;
-	for (j = 1; j <= *nsig; j++) {
-	    if (fabs(alpha - floor(alpha+0.5)) < eps * alpha) {
-		*nsig = j;
-		break;
-	    }
-	    alpha *= 10.0;
-	}
+        /* alpha integer, 10^(digits-1) <= alpha <= 10^digits */
+        alpha = floor(alpha*pow(10.0, R_print.digits - 1.0) + 0.5);
+        *nsig = R_print.digits;
+        for (j = 1; j <= R_print.digits; j++) {
+            alpha /= 10.0;
+            if (alpha == floor(alpha)) {
+                (*nsig)--;
+            } else {
+                break;
+            }
+        }
+        if (*nsig == 0) {
+            *nsig = 1;
+            kp += 1;
+        }
+        *kpower = kp;
     }
 }
 
 /*
    The return values are