Skip to content

infelicity in `na.print = ""` for numeric columns of data frames/formatting numeric values

3 messages · Martin Maechler, Ben Bolker

#
format(c(1:2, NA)) gives the last value as "NA" rather than 
preserving it as NA, even if na.encode = FALSE (which does the 
'expected' thing for character vectors, but not numeric vectors).

   This was already brought up in 2008 in 
https://bugs.r-project.org/show_bug.cgi?id=12318 where Gregor Gorjanc 
pointed out the issue. Documentation was added and the bug closed as 
invalid. GG ended with:

 > IMHO it would be better that na.encode argument would also have an
effect for numeric like vectors. Nearly any function in R returns NA 
values and I expected the same for format, at least when na.encode=FALSE.

   I agree!

   I encountered this in the context of printing a data frame with 
na.print = "", which works as expected when printing the individual 
columns but not when printing the whole data frame (because 
print.data.frame calls format.data.frame, which calls format.default 
...).  Example below.

   It's also different from what you would get if you converted to 
character before formatting and printing:

print(format(as.character(c(1:2, NA)), na.encode=FALSE), na.print ="")

   Everything about this is documented (if you look carefully enough), 
but IMO it violates the principle of least surprise 
https://en.wikipedia.org/wiki/Principle_of_least_astonishment , so I 
would call it at least an 'infelicity' (sensu Bill Venables)

   Is there any chance that this design decision could be revisited?

   cheers
    Ben Bolker


---

   Consider

dd <- data.frame(f = factor(1:2), c = as.character(1:2), n = 
as.numeric(1:2), i = 1:2)
dd[3,] <- rep(NA, 4)
print(dd, na.print = "")


print(dd, na.print = "")
   f c  n  i
1 1 1  1  1
2 2 2  2  2
3     NA NA

This is in fact as documented (see below), but seems suboptimal given 
that printing the columns separately with na.print = "" would 
successfully print the NA entries as blank even in the numeric columns:

invisible(lapply(dd, print, na.print = ""))
[1] 1 2
Levels: 1 2
[1] "1" "2"
[1] 1 2
[1] 1 2

* ?print.data.frame documents that it calls format() for each column 
before printing
* the code of print.data.frame() shows that it calls format.data.frame() 
with na.encode = FALSE
* ?format.data.frame specifically notes that na.encode "only applies to 
elements of character vectors, not to numerical, complex nor logical 
?NA?s, which are always encoded as ?"NA"?.

    So the NA values in the numeric columns become "NA" rather than 
remaining as NA values, and are thus printed rather than being affected 
by the na.print argument.
1 day later
#
> format(c(1:2, NA)) gives the last value as "NA" rather than 
    > preserving it as NA, even if na.encode = FALSE (which does the 
    > 'expected' thing for character vectors, but not numeric vectors).

    > This was already brought up in 2008 in 
    > https://bugs.r-project.org/show_bug.cgi?id=12318 where Gregor Gorjanc 
    > pointed out the issue. Documentation was added and the bug closed as 
    > invalid. GG ended with:

    >> IMHO it would be better that na.encode argument would also have an
    > effect for numeric like vectors. Nearly any function in R returns NA 
    > values and I expected the same for format, at least when na.encode=FALSE.

    > I agree!

I do too, at least "in principle", keeping in mind that
backward compatibility is also an important principle ...

Not sure if the 'na.encode' argument should matter or possibly a
new optional argument, but "in principle" I think that

  format(c(1:2, NA, 4))

should preserve is.na(.) even by default.

    > I encountered this in the context of printing a data frame with 
    > na.print = "", which works as expected when printing the individual 
    > columns but not when printing the whole data frame (because 
    > print.data.frame calls format.data.frame, which calls format.default 
    > ...).  Example below.

    > It's also different from what you would get if you converted to 
    > character before formatting and printing:

    > print(format(as.character(c(1:2, NA)), na.encode=FALSE), na.print ="")

    > Everything about this is documented (if you look carefully enough), 
    > but IMO it violates the principle of least surprise 
    > https://en.wikipedia.org/wiki/Principle_of_least_astonishment , so I 
    > would call it at least an 'infelicity' (sensu Bill Venables)

    > Is there any chance that this design decision could be revisited?

We'd have to hear other opinions / gut feelings.

Also, someone (not me) would ideally volunteer to run
'R CMD check <pkg>' for a few 1000 (not necessarily all) CRAN &
BioC packages with an accordingly patched version of R-devel
(I might volunteer to create such a branch, e.g., a bit before the R
 Sprint 2023 end of August).


    > cheers
    > Ben Bolker


    > ---

The following issue you are raising
may really be a *different* one, as it involves format() and
print() methods for "data.frame", i.e.,

   format.data.frame() vs
    print.data.frame()

which is quite a bit related, of course, to how 'numeric'
columns are formatted -- as you note yourself below;
I vaguely recall that the data.frame method could be an even
"harder problem" .. but I don't remember the details.

It may also be that there are no changes necessary to the
*.data.frame() methods, and only the documentation (you mention)
should be updated ...

Martin

    > Consider

    > dd <- data.frame(f = factor(1:2), c = as.character(1:2), n = 
    > as.numeric(1:2), i = 1:2)
    > dd[3,] <- rep(NA, 4)
    > print(dd, na.print = "")


    > print(dd, na.print = "")
    >   f c  n  i
    > 1 1 1  1  1
    > 2 2 2  2  2
    > 3     NA NA

    > This is in fact as documented (see below), but seems suboptimal given 
    > that printing the columns separately with na.print = "" would 
    > successfully print the NA entries as blank even in the numeric columns:

    > invisible(lapply(dd, print, na.print = ""))
    > [1] 1 2
    > Levels: 1 2
    > [1] "1" "2"
    > [1] 1 2
    > [1] 1 2

    > * ?print.data.frame documents that it calls format() for each column 
    > before printing
    > * the code of print.data.frame() shows that it calls format.data.frame() 
    > with na.encode = FALSE
    > * ?format.data.frame specifically notes that na.encode "only applies to 
    > elements of character vectors, not to numerical, complex nor logical 
    > ?NA?s, which are always encoded as ?"NA"?.

    > So the NA values in the numeric columns become "NA" rather than 
    > remaining as NA values, and are thus printed rather than being affected 
    > by the na.print argument.

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel
#
On 2023-06-05 9:27 a.m., Martin Maechler wrote:
I would say it should preserve `is.na` *only* if na.encode = FALSE - 
that seems like the minimal appropriate change away from the current 
behaviour.
I might be willing to do that, although it would be nice if there 
were a pre-existing framework (analogous to r-lib/revdepcheck) for 
automating it and collecting the results ...
I *think* that if format.default() were changed so that 
na.encode=FALSE also applied to numeric types, then data frame printing 
would naturally work 'right' (since print.data.frame calls 
format.data.frame which calls format() for the individual columns 
specifying encode=FALSE ...)