Skip to content

[R-pkg-devel] Absent variables and tibble

15 messages · Johannes Ranke, Thierry Onkelinx, Hadley Wickham +3 more

#
My package 'lsmeans' is now suddenly broken because of a new provision in the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$" methods for 'tbl_df' objects - as documented - throw an error if a variable is not found. 

The problem is that my code uses tests like this:

	if (is.null (x$var)) {...}

to see whether 'x' has a variable 'var'. Obviously, I can work around this using

	if (!("var" %in% names(x))) {...}

but (a) I like the first version better, in terms of the code being understandable; and (b) isn't there a long history whereby we can expect a NULL result when accessing an absent member of a list (and hence a data.frame)? (c) the code base for 'lsmeans' has about 50 instances of such tests.

Anyway, I wonder if a lot of other package developers test for absent variables in that first way; if so, they too are in for a rude awakening if their users provide a tbl_df instead of a data.frame. And what is considered the best practice for testing absence of a list member? Apparently, not either of the above; and because of (c), I want to do these many tedious corrections only once.

Thanks for any light you can shed.

Russ

Russell V. Lenth  -  Professor Emeritus
Department of Statistics and Actuarial Science   
The University of Iowa  -  Iowa City, IA 52242  USA   
Voice (319)335-0712 (Dept. office)  -  FAX (319)335-3017

Just because you have numbers, that doesn't necessarily mean you have data.
#
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
This is why CRAN asks that people test reverse dependencies.

I think the most defensive thing you can do is to write a small function

name_missing <- function(x, name)
     !(name %in% names(x))

and use name_missing(x, "var") in your tests.  (Pick your own name to 
make your code understandable if you don't like my choice.)

You could suggest to the tibble maintainers that they add a function 
like this.

Duncan Murdoch
#
Am Montag, 27. Juni 2016, 10:03:35 schrieb Duncan Murdoch:
This is also being discussed here:

https://github.com/hadley/tibble/issues/91

Kind regards,

Johannes Ranke
#
On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
Which we did do - the problem is that this is actually caused by a
recursive reverse dependency (lsmeans -> dplyr -> tibble), and we
didn't correctly anticipate how much pain this would cause.
We're definitely going to add this.

And I think we'll make df[["var"]] return NULL too, so at least
there's one easy way to opt out.

The motivation for this change was that returning NULL + recycling
rules means it's very easy for errors to silently propagate. But I
think this approach might be somewhat too aggressive - I hadn't
considered that people use `is.null()` to check for missing columns.

We'll try and get an update to tibble out soon after useR.  Thoughts
on what we should do are greatly appreciated.

Hadley
#
Thanks, Hadley. I do understand why you'd want more careful checking. 

If you're going to provide a variable-existing function, may I suggest a short name like 'has'? I.e., has(x, var) returns TRUE if x has var in it. 

Thanks

Russ
#
Dear Russell.

The assertthat package (by Hadley) provides a has_name() function.
[1] TRUE
[1] FALSE

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2016-06-27 17:05 GMT+02:00 Lenth, Russell V <russell-lenth at uiowa.edu>:

  
  
#
On 27/06/2016 10:46 AM, Hadley Wickham wrote:
In fact, it's even harder than that, according to a message Russell sent 
me in private.  Neither package depends on the other; it happens when a 
user passes a 'tbl_df' object to Russell's package, and the tibble 
methods get called for it.  This is an unfortunate consequence of the S3 
system:  there's no place to define exactly what S3 methods are supposed 
to do, and no easy way for a package writer to test against all possible 
objects that might get passed in.

I guess my advice would be not to trigger an error in a case like this, 
though you might want to lobby for the base "[[" and "$" methods to 
(optionally?) do so.

Duncan Murdoch
#
The other thing you need to be aware of it you're using the other
approach is partial matching:

df <- data.frame(xyz = 1)
is.null(df$x)
#> [1] FALSE

Duncan - I think that argues for including a has_name() (hasName() ?)
function in base R. Is that something you'd consider?

Hadley

On Mon, Jun 27, 2016 at 10:05 AM, Lenth, Russell V
<russell-lenth at uiowa.edu> wrote:

  
    
#
On 27/06/2016 1:09 PM, Hadley Wickham wrote:
Yes, I'd consider it.  I think hasName() would be more consistent with 
other has*() functions in the R sources.

I guess the implementation should be defined to be equivalent to

hasName <- function(x, name)
   name %in% names(x)

though it would make sense to make a faster internal implementation; 
!is.null(df$x) is quite a bit faster than "x" %in% names(df).

Duncan Murdoch
#
Hadley's note on partial matching has me scared the most concerning the as.null() coding. So the need for a hasName() (or whatever) function seems all the more compelling, and that it be in base R. Perhaps it should be generic, with a default method that searches in the names attribute, potentially extensible to other classes.

Thanks so much, several of you, for your positive and helpful responses.

Russ

-----Original Message-----
From: Duncan Murdoch [mailto:murdoch.duncan at gmail.com] 
Sent: Monday, June 27, 2016 12:50 PM
To: Hadley Wickham <h.wickham at gmail.com>; Lenth, Russell V <russell-lenth at uiowa.edu>
Cc: r-package-devel at r-project.org
Subject: Re: [R-pkg-devel] Absent variables and tibble
On 27/06/2016 1:09 PM, Hadley Wickham wrote:
Yes, I'd consider it.  I think hasName() would be more consistent with other has*() functions in the R sources.

I guess the implementation should be defined to be equivalent to

hasName <- function(x, name)
   name %in% names(x)

though it would make sense to make a faster internal implementation;
!is.null(df$x) is quite a bit faster than "x" %in% names(df).

Duncan Murdoch
#
On 27/06/2016 10:15 PM, Lenth, Russell V wrote:
I am thinking of putting it in, but if I do the definition will be 
equivalent to the one-liner down below.  That's already slower than the 
is.null() test; making it generic would slow it down too much.

Duncan Murdoch
#
Currently exists("someName", where=someDataFrame) reports if "someName" is
an column
of the data.frame 'someDataFrame' and the 'where=' may be omitted.  If we
have an
environment we use exsts("someName", envir=someEnvironment).  It might be
nice to
continue using exists() instead of introducing a new function has(),
although, since we
want the same syntax to work for environments, data.frames, tbl_dfs,
data.tables, etc.,
we may need the new function.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Jun 28, 2016 at 4:08 AM, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:

  
  
#
On 28/06/2016 10:03 AM, William Dunlap wrote:
One issue with exists("someName", someDataFrame) is that it's quite a 
bit slower.  (I think it converts the dataframe to an environment.) On 
the other hand, getting the names from an environment requires more work 
than checking for one, so exists("someName", someEnvironment) is faster 
than checking for the name in the obvious way.  The slow operations
could be sped up, but is that worth the effort?

The other issue with exists() is that it has a complicated definition 
and hard to follow argument list (with args "where", "envir", "frame" 
that all do related things); the thing I like about hasName() is that it 
is very clear what it does.  A criticism of it is that it is hardly any 
shorter than just doing

   name %in% names(x)

so is there really any point in making a function for this?

Duncan Murdoch
#
I've now added a simple implementation of hasName to R-devel and 
R-patched.  When I find the time, I'll go through the base packages and 
change the !is.null(x$name) idiom to hasName.  (All but "base", that 
is:  hasName is in utils, and it is better if base remains self-contained.)

If any bottlenecks turn up, I could make hasName faster by redoing it in 
C code, but so far it is just R code very similar to the %in% 
implementation.

Duncan Murdoch
#
On 28/06/2016 1:15 PM, Duncan Murdoch wrote:
After looking at a few cases, I don't think I'll do that.  Often the 
test is used to find out if x$name will work.  hasName(x, "name") is not 
sufficient for that:  x might have that as a name, but x$name won't 
work, e.g. in a named numeric vector.  I don't think we have a simple 
test corresponding to

!is.null(x$name) && hasName(x, "name")

Probably the best approach is to run tests with 
options(warnPartialMatchDollar = TRUE), and just use the simple 
!is.null(x$name).

Duncan Murdoch