Formatting numbers with a limited amount of digits consistently - R-help

Mon, May 30, 2005 7:43 AM #

I have tried to get signif, round and format to display numbers like 
these consistently in a table, using e.g. signif(x,digits=3)

17.01
18.15

I want

17.0
18.2

Not

17
18.2


Why is the last digit stripped off in the case when it is zero!

Is this a "feature" of R or did I miss something?



---------------------------------------------
Henrik Andersson
Netherlands Institute of Ecology -
Centre for Estuarine and Marine Ecology
P.O. Box 140
4400 AC Yerseke
Phone: +31 113 577473
h.andersson at nioo.knaw.nl
http://www.nioo.knaw.nl/ppages/handersson

Duncan Murdoch

Mon, May 30, 2005 10:20 AM #

Henrik Andersson wrote:

signif() changes the value; you don't want that, you want to affect how 
a number is displayed.  Use format() or formatC() instead, for example

 > x <- c(17.01, 18.15)
 > format(x, digits=3)
[1] "17.0" "18.1"
 > noquote(format(x, digits=3))
[1] 17.0 18.1

I'd say both.

Duncan Murdoch

Gabor Grothendieck

Mon, May 30, 2005 10:57 AM #

On 5/30/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:

That works in the above context but I don't think it works generally:

R> f <- head(faithful)
R> f
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

R> format(f, digits = 3)
  eruptions waiting
1      3.60      79
2      1.80      54
3      3.33      74
4      2.28      62
5      4.53      85
6      2.88      55

R> # this works in this case
R> noquote(prettyNum(round(f,1), nsmall = 1))
     eruptions waiting
[1,] 3.6       79.0   
[2,] 1.8       54.0   
[3,] 3.3       74.0   
[4,] 2.3       62.0   
[5,] 4.5       85.0   
[6,] 2.9       55.0   

and even that does not work in the desired way (which presumably
is not to use exponent format) if you have some
large enough numbers like 1e6 which it will display using
the e notation rather than using ordinary notation.

R> f[1,1] <- 1e6 + 0.11
R> noquote(prettyNum(round(f,1), nsmall = 1))
     eruptions waiting
[1,] 1.0e+06   79.0   
[2,] 1.8e+00   54.0   
[3,] 3.3e+00   74.0   
[4,] 2.3e+00   62.0   
[5,] 4.5e+00   85.0   
[6,] 2.9e+00   55.0   

I have struggled with this myself and have generally been able
to come up with something for specific instances but I have generally 
found it a pain to do a simple thing like format a table exactly as I want 
without undue effort.  Maybe someone else has figured this out.

Duncan Murdoch

Mon, May 30, 2005 12:12 PM #

Gabor Grothendieck wrote:

formatC with format="f" seems to work for me, though it assumes you're 
specifying decimal places rather than significant digits.  It also wants 
a vector of numbers as input, not a dataframe.  So the following gives 
pretty flexible control over what a table will look like:

 > data.frame(eruptions = formatC(f$eruptions, digits=2, format='f'),
+            waiting = formatC(f$waiting, digits=1, format='f'))
    eruptions waiting
1 1000000.11    79.0
2       1.80    54.0
3       3.33    74.0
4       2.28    62.0
5       4.53    85.0
6       2.88    55.0

I think that formatting tables properly requires some thought, and R is 
no good at thinking.  You can easily recognize a badly formatted table, 
but it's very hard to write down rules that work in general 
circumstances.  It's also a matter of taste, so if I managed to write a 
function that matched my taste, you would find you wanted to make changes.

It's sort of like expecting plot(x, y) to always come up with the best 
possible plot of y versus x.  It's just not a reasonable expectation. 
It's better to provide tools (like abline() for plots or formatC() for 
tables) that allow you to tailor a plot or table to your particular needs.

Duncan Murdoch

Gabor Grothendieck

Mon, May 30, 2005 8:53 PM #

On 5/30/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:

Gabor Grothendieck wrote:

On 5/30/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:

Henrik Andersson wrote:

I have tried to get signif, round and format to display numbers like
these consistently in a table, using e.g. signif(x,digits=3)

17.01
18.15

I want

17.0
18.2

Not

17
18.2


Why is the last digit stripped off in the case when it is zero!

signif() changes the value; you don't want that, you want to affect how
a number is displayed.  Use format() or formatC() instead, for example

x <- c(17.01, 18.15)
format(x, digits=3)

[1] "17.0" "18.1"

noquote(format(x, digits=3))

[1] 17.0 18.1


That works in the above context but I don't think it works generally:

R> f <- head(faithful)
R> f
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

R> format(f, digits = 3)
  eruptions waiting
1      3.60      79
2      1.80      54
3      3.33      74
4      2.28      62
5      4.53      85
6      2.88      55

R> # this works in this case
R> noquote(prettyNum(round(f,1), nsmall = 1))
     eruptions waiting
[1,] 3.6       79.0
[2,] 1.8       54.0
[3,] 3.3       74.0
[4,] 2.3       62.0
[5,] 4.5       85.0
[6,] 2.9       55.0

and even that does not work in the desired way (which presumably
is not to use exponent format) if you have some
large enough numbers like 1e6 which it will display using
the e notation rather than using ordinary notation.

formatC with format="f" seems to work for me, though it assumes you're
specifying decimal places rather than significant digits.  It also wants
a vector of numbers as input, not a dataframe.  So the following gives
pretty flexible control over what a table will look like:

 > data.frame(eruptions = formatC(f$eruptions, digits=2, format='f'),

+            waiting = formatC(f$waiting, digits=1, format='f'))
   eruptions waiting
1 1000000.11    79.0
2       1.80    54.0
3       3.33    74.0
4       2.28    62.0
5       4.53    85.0
6       2.88    55.0

I have struggled with this myself and have generally been able
to come up with something for specific instances but I have generally
found it a pain to do a simple thing like format a table exactly as I want
without undue effort.  Maybe someone else has figured this out.

I think that formatting tables properly requires some thought, and R is
no good at thinking.  You can easily recognize a badly formatted table,
but it's very hard to write down rules that work in general
circumstances.  It's also a matter of taste, so if I managed to write a
function that matched my taste, you would find you wanted to make changes.

It's sort of like expecting plot(x, y) to always come up with the best
possible plot of y versus x.  It's just not a reasonable expectation.
It's better to provide tools (like abline() for plots or formatC() for
tables) that allow you to tailor a plot or table to your particular needs.

Thanks.  That seems to be the idiom I was missing.  One thing that would
be nice would be if formatC could handle data frames.

Marc Schwartz

Tue, May 31, 2005 7:30 AM #

On Mon, 2005-05-30 at 23:53 -0400, Gabor Grothendieck wrote:

On 5/30/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:

Gabor Grothendieck wrote:

On 5/30/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:

Henrik Andersson wrote:

I have tried to get signif, round and format to display numbers like
these consistently in a table, using e.g. signif(x,digits=3)

17.01
18.15

I want

17.0
18.2

Not

17
18.2


Why is the last digit stripped off in the case when it is zero!

signif() changes the value; you don't want that, you want to affect how
a number is displayed.  Use format() or formatC() instead, for example

x <- c(17.01, 18.15)
format(x, digits=3)

[1] "17.0" "18.1"

noquote(format(x, digits=3))

[1] 17.0 18.1


That works in the above context but I don't think it works generally:

R> f <- head(faithful)
R> f
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

R> format(f, digits = 3)
  eruptions waiting
1      3.60      79
2      1.80      54
3      3.33      74
4      2.28      62
5      4.53      85
6      2.88      55

R> # this works in this case
R> noquote(prettyNum(round(f,1), nsmall = 1))
     eruptions waiting
[1,] 3.6       79.0
[2,] 1.8       54.0
[3,] 3.3       74.0
[4,] 2.3       62.0
[5,] 4.5       85.0
[6,] 2.9       55.0

and even that does not work in the desired way (which presumably
is not to use exponent format) if you have some
large enough numbers like 1e6 which it will display using
the e notation rather than using ordinary notation.

formatC with format="f" seems to work for me, though it assumes you're
specifying decimal places rather than significant digits.  It also wants
a vector of numbers as input, not a dataframe.  So the following gives
pretty flexible control over what a table will look like:

 > data.frame(eruptions = formatC(f$eruptions, digits=2, format='f'),

+            waiting = formatC(f$waiting, digits=1, format='f'))
   eruptions waiting
1 1000000.11    79.0
2       1.80    54.0
3       3.33    74.0
4       2.28    62.0
5       4.53    85.0
6       2.88    55.0

I have struggled with this myself and have generally been able
to come up with something for specific instances but I have generally
found it a pain to do a simple thing like format a table exactly as I want
without undue effort.  Maybe someone else has figured this out.

I think that formatting tables properly requires some thought, and R is
no good at thinking.  You can easily recognize a badly formatted table,
but it's very hard to write down rules that work in general
circumstances.  It's also a matter of taste, so if I managed to write a
function that matched my taste, you would find you wanted to make changes.

It's sort of like expecting plot(x, y) to always come up with the best
possible plot of y versus x.  It's just not a reasonable expectation.
It's better to provide tools (like abline() for plots or formatC() for
tables) that allow you to tailor a plot or table to your particular needs.

Thanks.  That seems to be the idiom I was missing.  One thing that would
be nice would be if formatC could handle data frames.

Guys, perhaps I am missing something here, but there seems to be some
confusion as to how the numbers are stored internally, versus how the
output is displayed and the meaning of "significant digits", which is
what I believe Henrik's original query was about.

By default, R's printed output uses the settings from options("digits")
and options("scipen") to define output based upon the number of
significant digits, which is of course not the same as the number of
decimal places. Hence the variance in the output that Henrik gets and
why the trailing zero is dropped.

The use of signif() does not help here because it is still based upon
the number of significant digits, where the trailing zero still gets
dropped.

The use of the above are "inexact" when it comes to creating formatted
output for a table with a consistent number of decimal places to align
columns of numbers.

format() is still problematic here because it too uses the number of
significant digits, defaulting to options("digits").

Using formatC() or sprintf() in conjunction with cat() is usually the
best way to gain control over how numeric output is formatted,
especially in a nicely aligned table. This is what I use in CrossTable
(), where I want decimal aligned columns for numbers in the tabular
output, along with fixed width columns for textual output (ie. labels,
etc.).

Briefly, along the lines of Gabor's example on the output using the
faithful dataset above, one could use something like:

eruptions waiting
1 3.6       79.0
2 1.8       54.0
3 3.3       74.0
4 2.3       62.0
5 4.5       85.0
6 2.9       55.0

which only affects how the data is printed, not the data itself. It can
work fine for a 2D object that has all numeric columns. 

Note however that the numeric columns are left-aligned, not right-
aligned, as in the default print method, since the output of the above
function is a character matrix, rather than a data.frame with numeric
columns. Hence, note:

eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55


Thus, for greater control, one should use sprintf() and cat():


out.lines <- sprintf("%15s %15s\n", colnames(f)[1], colnames(f)[2])

for (i in 1:nrow(f))
{
  out.lines <- c(out.lines, 
                 sprintf("%14.1f  %14.1f\n", f[i, 1], f[i, 2]))
}

eruptions         waiting
            3.6            79.0
            1.8            54.0
            3.3            74.0
            2.3            62.0
            4.5            85.0
            2.9            55.0



In the above case, one can specify the column widths for the column
labels and the row values. Of course, the above could be extended to
become a generic function for data frames with multiple data types, with
arguments enabling the specification of column widths, number of decimal
places, etc. One might even want more than one specification for the
number of decimal places depending upon the nature of the columns on the
object to be printed, so vectors could be used for these arguments.

I'll leave that for further exercise.

Final note to Henrik: Note that the IEEE 754 rounding standard as
implemented in R results in:

[1] 18.1

[1] "18.1"

[1] " 18.1"

This is because the rounding method implemented is the "go to the even
digit" approach. Thus, you don't get 18.2. 

See ?round for more information.

HTH,

Marc Schwartz

Duncan Murdoch

Tue, May 31, 2005 8:11 AM #

Marc Schwartz wrote:

I don't think "go to the even digit" is being applied here:  ".1" is not 
  an even digit.

I suspect what's going on in this example is that 18.15 is not being 
represented exactly; it's stored internally as something slightly less 
than that value, so it rounds down.

You'd see the "go to the even digit" rule applied when rounding 17.5 or 
18.5, which can be represented exactly, being fractions with a power of 
2 in the denominator:

 > round(18.5, 0)
[1] 18
 > round(17.5, 0)
[1] 18

(This is very gratifying.  Usually when I try to predict the exact 
behaviour of round() or signif() I end up having to rewrite my 
prediction afterwards.  But this time I got it right. Honest!)

Duncan Murdoch

Gabor Grothendieck

Tue, May 31, 2005 10:25 AM #

On 5/31/05, Marc Schwartz <MSchwartz at mn.rr.com> wrote:

On Mon, 2005-05-30 at 23:53 -0400, Gabor Grothendieck wrote:

On 5/30/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:

Gabor Grothendieck wrote:

On 5/30/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:

Henrik Andersson wrote:

I have tried to get signif, round and format to display numbers like
these consistently in a table, using e.g. signif(x,digits=3)

17.01
18.15

I want

17.0
18.2

Not

17
18.2


Why is the last digit stripped off in the case when it is zero!

signif() changes the value; you don't want that, you want to affect how
a number is displayed.  Use format() or formatC() instead, for example

x <- c(17.01, 18.15)
format(x, digits=3)

[1] "17.0" "18.1"

noquote(format(x, digits=3))

[1] 17.0 18.1


That works in the above context but I don't think it works generally:

R> f <- head(faithful)
R> f
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

R> format(f, digits = 3)
  eruptions waiting
1      3.60      79
2      1.80      54
3      3.33      74
4      2.28      62
5      4.53      85
6      2.88      55

R> # this works in this case
R> noquote(prettyNum(round(f,1), nsmall = 1))
     eruptions waiting
[1,] 3.6       79.0
[2,] 1.8       54.0
[3,] 3.3       74.0
[4,] 2.3       62.0
[5,] 4.5       85.0
[6,] 2.9       55.0

and even that does not work in the desired way (which presumably
is not to use exponent format) if you have some
large enough numbers like 1e6 which it will display using
the e notation rather than using ordinary notation.

formatC with format="f" seems to work for me, though it assumes you're
specifying decimal places rather than significant digits.  It also wants
a vector of numbers as input, not a dataframe.  So the following gives
pretty flexible control over what a table will look like:

 > data.frame(eruptions = formatC(f$eruptions, digits=2, format='f'),

+            waiting = formatC(f$waiting, digits=1, format='f'))
   eruptions waiting
1 1000000.11    79.0
2       1.80    54.0
3       3.33    74.0
4       2.28    62.0
5       4.53    85.0
6       2.88    55.0

I have struggled with this myself and have generally been able
to come up with something for specific instances but I have generally
found it a pain to do a simple thing like format a table exactly as I want
without undue effort.  Maybe someone else has figured this out.

I think that formatting tables properly requires some thought, and R is
no good at thinking.  You can easily recognize a badly formatted table,
but it's very hard to write down rules that work in general
circumstances.  It's also a matter of taste, so if I managed to write a
function that matched my taste, you would find you wanted to make changes.

It's sort of like expecting plot(x, y) to always come up with the best
possible plot of y versus x.  It's just not a reasonable expectation.
It's better to provide tools (like abline() for plots or formatC() for
tables) that allow you to tailor a plot or table to your particular needs.

Thanks.  That seems to be the idiom I was missing.  One thing that would
be nice would be if formatC could handle data frames.


Guys, perhaps I am missing something here, but there seems to be some
confusion as to how the numbers are stored internally, versus how the
output is displayed and the meaning of "significant digits", which is
what I believe Henrik's original query was about.

By default, R's printed output uses the settings from options("digits")
and options("scipen") to define output based upon the number of
significant digits, which is of course not the same as the number of
decimal places. Hence the variance in the output that Henrik gets and
why the trailing zero is dropped.

The use of signif() does not help here because it is still based upon
the number of significant digits, where the trailing zero still gets
dropped.

The use of the above are "inexact" when it comes to creating formatted
output for a table with a consistent number of decimal places to align
columns of numbers.

format() is still problematic here because it too uses the number of
significant digits, defaulting to options("digits").

Good point.  It would be nice if format had an argument that allowed
one to specify the number of digits after the decimal place.  I think
this would reduce frustrations in quickly formatting data frames.

Marc Schwartz

Tue, May 31, 2005 4:22 PM #

On Tue, 2005-05-31 at 11:11 -0400, Duncan Murdoch wrote:

Duncan,

Just got back from a day long meeting.

You are indeed correct on the rounding here. If you look at how 18.15
appears when printed with more significant digits:

[1] 18.149999999999998579

That's what I get for trying to deal with floating point representation
issues first thing after a three day weekend...  ;-)

Thanks for the correction.

Marc