My package 'lsmeans' is now suddenly broken because of a new provision in the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$" methods for 'tbl_df' objects - as documented - throw an error if a variable is not found.
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work around this using
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code being understandable; and (b) isn't there a long history whereby we can expect a NULL result when accessing an absent member of a list (and hence a data.frame)? (c) the code base for 'lsmeans' has about 50 instances of such tests.
Anyway, I wonder if a lot of other package developers test for absent variables in that first way; if so, they too are in for a rude awakening if their users provide a tbl_df instead of a data.frame. And what is considered the best practice for testing absence of a list member? Apparently, not either of the above; and because of (c), I want to do these many tedious corrections only once.
Thanks for any light you can shed.
Russ
Russell V. Lenth - Professor Emeritus
Department of Statistics and Actuarial Science
The University of Iowa - Iowa City, IA 52242 USA
Voice (319)335-0712 (Dept. office) - FAX (319)335-3017
Just because you have numbers, that doesn't necessarily mean you have data.
[R-pkg-devel] Absent variables and tibble
15 messages · Johannes Ranke, Thierry Onkelinx, Hadley Wickham +3 more
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
My package 'lsmeans' is now suddenly broken because of a new provision in the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$" methods for 'tbl_df' objects - as documented - throw an error if a variable is not found.
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work around this using
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code being understandable; and (b) isn't there a long history whereby we can expect a NULL result when accessing an absent member of a list (and hence a data.frame)? (c) the code base for 'lsmeans' has about 50 instances of such tests.
Anyway, I wonder if a lot of other package developers test for absent variables in that first way; if so, they too are in for a rude awakening if their users provide a tbl_df instead of a data.frame. And what is considered the best practice for testing absence of a list member? Apparently, not either of the above; and because of (c), I want to do these many tedious corrections only once.
Thanks for any light you can shed.
This is why CRAN asks that people test reverse dependencies.
I think the most defensive thing you can do is to write a small function
name_missing <- function(x, name)
!(name %in% names(x))
and use name_missing(x, "var") in your tests. (Pick your own name to
make your code understandable if you don't like my choice.)
You could suggest to the tibble maintainers that they add a function
like this.
Duncan Murdoch
Am Montag, 27. Juni 2016, 10:03:35 schrieb Duncan Murdoch:
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
My package 'lsmeans' is now suddenly broken because of a new provision in
the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$"
methods for 'tbl_df' objects - as documented - throw an error if a
variable is not found.>
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work around this
using>
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code being
understandable; and (b) isn't there a long history whereby we can expect
a NULL result when accessing an absent member of a list (and hence a
data.frame)? (c) the code base for 'lsmeans' has about 50 instances of
such tests.
Anyway, I wonder if a lot of other package developers test for absent
variables in that first way; if so, they too are in for a rude awakening
if their users provide a tbl_df instead of a data.frame. And what is
considered the best practice for testing absence of a list member?
Apparently, not either of the above; and because of (c), I want to do
these many tedious corrections only once.
Thanks for any light you can shed.
This is why CRAN asks that people test reverse dependencies.
I think the most defensive thing you can do is to write a small function
name_missing <- function(x, name)
!(name %in% names(x))
and use name_missing(x, "var") in your tests. (Pick your own name to
make your code understandable if you don't like my choice.)
You could suggest to the tibble maintainers that they add a function
like this.
This is also being discussed here: https://github.com/hadley/tibble/issues/91 Kind regards, Johannes Ranke
On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
My package 'lsmeans' is now suddenly broken because of a new provision in
the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$"
methods for 'tbl_df' objects - as documented - throw an error if a variable
is not found.
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work around this
using
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code being
understandable; and (b) isn't there a long history whereby we can expect a
NULL result when accessing an absent member of a list (and hence a
data.frame)? (c) the code base for 'lsmeans' has about 50 instances of such
tests.
Anyway, I wonder if a lot of other package developers test for absent
variables in that first way; if so, they too are in for a rude awakening if
their users provide a tbl_df instead of a data.frame. And what is considered
the best practice for testing absence of a list member? Apparently, not
either of the above; and because of (c), I want to do these many tedious
corrections only once.
Thanks for any light you can shed.
This is why CRAN asks that people test reverse dependencies.
Which we did do - the problem is that this is actually caused by a recursive reverse dependency (lsmeans -> dplyr -> tibble), and we didn't correctly anticipate how much pain this would cause.
I think the most defensive thing you can do is to write a small function
name_missing <- function(x, name)
!(name %in% names(x))
and use name_missing(x, "var") in your tests. (Pick your own name to make
your code understandable if you don't like my choice.)
You could suggest to the tibble maintainers that they add a function like
this.
We're definitely going to add this. And I think we'll make df[["var"]] return NULL too, so at least there's one easy way to opt out. The motivation for this change was that returning NULL + recycling rules means it's very easy for errors to silently propagate. But I think this approach might be somewhat too aggressive - I hadn't considered that people use `is.null()` to check for missing columns. We'll try and get an update to tibble out soon after useR. Thoughts on what we should do are greatly appreciated. Hadley
Thanks, Hadley. I do understand why you'd want more careful checking. If you're going to provide a variable-existing function, may I suggest a short name like 'has'? I.e., has(x, var) returns TRUE if x has var in it. Thanks Russ
On Jun 27, 2016, at 9:47 AM, Hadley Wickham <h.wickham at gmail.com> wrote: On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
My package 'lsmeans' is now suddenly broken because of a new provision in
the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$"
methods for 'tbl_df' objects - as documented - throw an error if a variable
is not found.
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work around this
using
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code being
understandable; and (b) isn't there a long history whereby we can expect a
NULL result when accessing an absent member of a list (and hence a
data.frame)? (c) the code base for 'lsmeans' has about 50 instances of such
tests.
Anyway, I wonder if a lot of other package developers test for absent
variables in that first way; if so, they too are in for a rude awakening if
their users provide a tbl_df instead of a data.frame. And what is considered
the best practice for testing absence of a list member? Apparently, not
either of the above; and because of (c), I want to do these many tedious
corrections only once.
Thanks for any light you can shed.
This is why CRAN asks that people test reverse dependencies.
Which we did do - the problem is that this is actually caused by a recursive reverse dependency (lsmeans -> dplyr -> tibble), and we didn't correctly anticipate how much pain this would cause.
I think the most defensive thing you can do is to write a small function name_missing <- function(x, name) !(name %in% names(x)) and use name_missing(x, "var") in your tests. (Pick your own name to make your code understandable if you don't like my choice.) You could suggest to the tibble maintainers that they add a function like this.
We're definitely going to add this. And I think we'll make df[["var"]] return NULL too, so at least there's one easy way to opt out. The motivation for this change was that returning NULL + recycling rules means it's very easy for errors to silently propagate. But I think this approach might be somewhat too aggressive - I hadn't considered that people use `is.null()` to check for missing columns. We'll try and get an update to tibble out soon after useR. Thoughts on what we should do are greatly appreciated. Hadley -- http://hadley.nz
Dear Russell. The assertthat package (by Hadley) provides a has_name() function.
library(assertthat) x <- data.frame(y = NA) has_name(x, "y")
[1] TRUE
has_name(x, "x")
[1] FALSE Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-06-27 17:05 GMT+02:00 Lenth, Russell V <russell-lenth at uiowa.edu>:
Thanks, Hadley. I do understand why you'd want more careful checking. If you're going to provide a variable-existing function, may I suggest a short name like 'has'? I.e., has(x, var) returns TRUE if x has var in it. Thanks Russ
On Jun 27, 2016, at 9:47 AM, Hadley Wickham <h.wickham at gmail.com> wrote: On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
My package 'lsmeans' is now suddenly broken because of a new provision
in
the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and
"$"
methods for 'tbl_df' objects - as documented - throw an error if a
variable
is not found.
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work around
this
using
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code being
understandable; and (b) isn't there a long history whereby we can
expect a
NULL result when accessing an absent member of a list (and hence a data.frame)? (c) the code base for 'lsmeans' has about 50 instances of
such
tests. Anyway, I wonder if a lot of other package developers test for absent variables in that first way; if so, they too are in for a rude
awakening if
their users provide a tbl_df instead of a data.frame. And what is
considered
the best practice for testing absence of a list member? Apparently, not either of the above; and because of (c), I want to do these many
tedious
corrections only once. Thanks for any light you can shed.
This is why CRAN asks that people test reverse dependencies.
Which we did do - the problem is that this is actually caused by a recursive reverse dependency (lsmeans -> dplyr -> tibble), and we didn't correctly anticipate how much pain this would cause.
I think the most defensive thing you can do is to write a small function name_missing <- function(x, name) !(name %in% names(x)) and use name_missing(x, "var") in your tests. (Pick your own name to
make
your code understandable if you don't like my choice.) You could suggest to the tibble maintainers that they add a function
like
this.
We're definitely going to add this. And I think we'll make df[["var"]] return NULL too, so at least there's one easy way to opt out. The motivation for this change was that returning NULL + recycling rules means it's very easy for errors to silently propagate. But I think this approach might be somewhat too aggressive - I hadn't considered that people use `is.null()` to check for missing columns. We'll try and get an update to tibble out soon after useR. Thoughts on what we should do are greatly appreciated. Hadley -- http://hadley.nz
______________________________________________ R-package-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
On 27/06/2016 10:46 AM, Hadley Wickham wrote:
On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
My package 'lsmeans' is now suddenly broken because of a new provision in
the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$"
methods for 'tbl_df' objects - as documented - throw an error if a variable
is not found.
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work around this
using
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code being
understandable; and (b) isn't there a long history whereby we can expect a
NULL result when accessing an absent member of a list (and hence a
data.frame)? (c) the code base for 'lsmeans' has about 50 instances of such
tests.
Anyway, I wonder if a lot of other package developers test for absent
variables in that first way; if so, they too are in for a rude awakening if
their users provide a tbl_df instead of a data.frame. And what is considered
the best practice for testing absence of a list member? Apparently, not
either of the above; and because of (c), I want to do these many tedious
corrections only once.
Thanks for any light you can shed.
This is why CRAN asks that people test reverse dependencies.
Which we did do - the problem is that this is actually caused by a recursive reverse dependency (lsmeans -> dplyr -> tibble), and we didn't correctly anticipate how much pain this would cause.
In fact, it's even harder than that, according to a message Russell sent me in private. Neither package depends on the other; it happens when a user passes a 'tbl_df' object to Russell's package, and the tibble methods get called for it. This is an unfortunate consequence of the S3 system: there's no place to define exactly what S3 methods are supposed to do, and no easy way for a package writer to test against all possible objects that might get passed in. I guess my advice would be not to trigger an error in a case like this, though you might want to lobby for the base "[[" and "$" methods to (optionally?) do so. Duncan Murdoch
The other thing you need to be aware of it you're using the other approach is partial matching: df <- data.frame(xyz = 1) is.null(df$x) #> [1] FALSE Duncan - I think that argues for including a has_name() (hasName() ?) function in base R. Is that something you'd consider? Hadley On Mon, Jun 27, 2016 at 10:05 AM, Lenth, Russell V
<russell-lenth at uiowa.edu> wrote:
Thanks, Hadley. I do understand why you'd want more careful checking. If you're going to provide a variable-existing function, may I suggest a short name like 'has'? I.e., has(x, var) returns TRUE if x has var in it. Thanks Russ
On Jun 27, 2016, at 9:47 AM, Hadley Wickham <h.wickham at gmail.com> wrote: On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
My package 'lsmeans' is now suddenly broken because of a new provision in
the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$"
methods for 'tbl_df' objects - as documented - throw an error if a variable
is not found.
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work around this
using
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code being
understandable; and (b) isn't there a long history whereby we can expect a
NULL result when accessing an absent member of a list (and hence a
data.frame)? (c) the code base for 'lsmeans' has about 50 instances of such
tests.
Anyway, I wonder if a lot of other package developers test for absent
variables in that first way; if so, they too are in for a rude awakening if
their users provide a tbl_df instead of a data.frame. And what is considered
the best practice for testing absence of a list member? Apparently, not
either of the above; and because of (c), I want to do these many tedious
corrections only once.
Thanks for any light you can shed.
This is why CRAN asks that people test reverse dependencies.
Which we did do - the problem is that this is actually caused by a recursive reverse dependency (lsmeans -> dplyr -> tibble), and we didn't correctly anticipate how much pain this would cause.
I think the most defensive thing you can do is to write a small function name_missing <- function(x, name) !(name %in% names(x)) and use name_missing(x, "var") in your tests. (Pick your own name to make your code understandable if you don't like my choice.) You could suggest to the tibble maintainers that they add a function like this.
We're definitely going to add this. And I think we'll make df[["var"]] return NULL too, so at least there's one easy way to opt out. The motivation for this change was that returning NULL + recycling rules means it's very easy for errors to silently propagate. But I think this approach might be somewhat too aggressive - I hadn't considered that people use `is.null()` to check for missing columns. We'll try and get an update to tibble out soon after useR. Thoughts on what we should do are greatly appreciated. Hadley -- http://hadley.nz
On 27/06/2016 1:09 PM, Hadley Wickham wrote:
The other thing you need to be aware of it you're using the other approach is partial matching: df <- data.frame(xyz = 1) is.null(df$x) #> [1] FALSE Duncan - I think that argues for including a has_name() (hasName() ?) function in base R. Is that something you'd consider?
Yes, I'd consider it. I think hasName() would be more consistent with other has*() functions in the R sources. I guess the implementation should be defined to be equivalent to hasName <- function(x, name) name %in% names(x) though it would make sense to make a faster internal implementation; !is.null(df$x) is quite a bit faster than "x" %in% names(df). Duncan Murdoch
Hadley On Mon, Jun 27, 2016 at 10:05 AM, Lenth, Russell V <russell-lenth at uiowa.edu> wrote:
Thanks, Hadley. I do understand why you'd want more careful checking. If you're going to provide a variable-existing function, may I suggest a short name like 'has'? I.e., has(x, var) returns TRUE if x has var in it. Thanks Russ
On Jun 27, 2016, at 9:47 AM, Hadley Wickham <h.wickham at gmail.com> wrote: On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
My package 'lsmeans' is now suddenly broken because of a new provision in
the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$"
methods for 'tbl_df' objects - as documented - throw an error if a variable
is not found.
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work around this
using
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code being
understandable; and (b) isn't there a long history whereby we can expect a
NULL result when accessing an absent member of a list (and hence a
data.frame)? (c) the code base for 'lsmeans' has about 50 instances of such
tests.
Anyway, I wonder if a lot of other package developers test for absent
variables in that first way; if so, they too are in for a rude awakening if
their users provide a tbl_df instead of a data.frame. And what is considered
the best practice for testing absence of a list member? Apparently, not
either of the above; and because of (c), I want to do these many tedious
corrections only once.
Thanks for any light you can shed.
This is why CRAN asks that people test reverse dependencies.
Which we did do - the problem is that this is actually caused by a recursive reverse dependency (lsmeans -> dplyr -> tibble), and we didn't correctly anticipate how much pain this would cause.
I think the most defensive thing you can do is to write a small function name_missing <- function(x, name) !(name %in% names(x)) and use name_missing(x, "var") in your tests. (Pick your own name to make your code understandable if you don't like my choice.) You could suggest to the tibble maintainers that they add a function like this.
We're definitely going to add this. And I think we'll make df[["var"]] return NULL too, so at least there's one easy way to opt out. The motivation for this change was that returning NULL + recycling rules means it's very easy for errors to silently propagate. But I think this approach might be somewhat too aggressive - I hadn't considered that people use `is.null()` to check for missing columns. We'll try and get an update to tibble out soon after useR. Thoughts on what we should do are greatly appreciated. Hadley -- http://hadley.nz
Hadley's note on partial matching has me scared the most concerning the as.null() coding. So the need for a hasName() (or whatever) function seems all the more compelling, and that it be in base R. Perhaps it should be generic, with a default method that searches in the names attribute, potentially extensible to other classes. Thanks so much, several of you, for your positive and helpful responses. Russ -----Original Message----- From: Duncan Murdoch [mailto:murdoch.duncan at gmail.com] Sent: Monday, June 27, 2016 12:50 PM To: Hadley Wickham <h.wickham at gmail.com>; Lenth, Russell V <russell-lenth at uiowa.edu> Cc: r-package-devel at r-project.org Subject: Re: [R-pkg-devel] Absent variables and tibble
On 27/06/2016 1:09 PM, Hadley Wickham wrote:
The other thing you need to be aware of it you're using the other approach is partial matching: df <- data.frame(xyz = 1) is.null(df$x) #> [1] FALSE Duncan - I think that argues for including a has_name() (hasName() ?) function in base R. Is that something you'd consider?
Yes, I'd consider it. I think hasName() would be more consistent with other has*() functions in the R sources. I guess the implementation should be defined to be equivalent to hasName <- function(x, name) name %in% names(x) though it would make sense to make a faster internal implementation; !is.null(df$x) is quite a bit faster than "x" %in% names(df). Duncan Murdoch
Hadley On Mon, Jun 27, 2016 at 10:05 AM, Lenth, Russell V <russell-lenth at uiowa.edu> wrote:
Thanks, Hadley. I do understand why you'd want more careful checking. If you're going to provide a variable-existing function, may I suggest a short name like 'has'? I.e., has(x, var) returns TRUE if x has var in it. Thanks Russ
On Jun 27, 2016, at 9:47 AM, Hadley Wickham <h.wickham at gmail.com> wrote: On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
My package 'lsmeans' is now suddenly broken because of a new
provision in the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$"
methods for 'tbl_df' objects - as documented - throw an error if
a variable is not found.
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work
around this using
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code
being understandable; and (b) isn't there a long history whereby
we can expect a NULL result when accessing an absent member of a
list (and hence a data.frame)? (c) the code base for 'lsmeans'
has about 50 instances of such tests.
Anyway, I wonder if a lot of other package developers test for
absent variables in that first way; if so, they too are in for a
rude awakening if their users provide a tbl_df instead of a
data.frame. And what is considered the best practice for testing
absence of a list member? Apparently, not either of the above;
and because of (c), I want to do these many tedious corrections only once.
Thanks for any light you can shed.
This is why CRAN asks that people test reverse dependencies.
Which we did do - the problem is that this is actually caused by a recursive reverse dependency (lsmeans -> dplyr -> tibble), and we didn't correctly anticipate how much pain this would cause.
I think the most defensive thing you can do is to write a small function name_missing <- function(x, name) !(name %in% names(x)) and use name_missing(x, "var") in your tests. (Pick your own name to make your code understandable if you don't like my choice.) You could suggest to the tibble maintainers that they add a function like this.
We're definitely going to add this. And I think we'll make df[["var"]] return NULL too, so at least there's one easy way to opt out. The motivation for this change was that returning NULL + recycling rules means it's very easy for errors to silently propagate. But I think this approach might be somewhat too aggressive - I hadn't considered that people use `is.null()` to check for missing columns. We'll try and get an update to tibble out soon after useR. Thoughts on what we should do are greatly appreciated. Hadley -- http://hadley.nz
On 27/06/2016 10:15 PM, Lenth, Russell V wrote:
Hadley's note on partial matching has me scared the most concerning the as.null() coding. So the need for a hasName() (or whatever) function seems all the more compelling, and that it be in base R. Perhaps it should be generic, with a default method that searches in the names attribute, potentially extensible to other classes.
I am thinking of putting it in, but if I do the definition will be equivalent to the one-liner down below. That's already slower than the is.null() test; making it generic would slow it down too much. Duncan Murdoch
Thanks so much, several of you, for your positive and helpful responses. Russ -----Original Message----- From: Duncan Murdoch [mailto:murdoch.duncan at gmail.com] Sent: Monday, June 27, 2016 12:50 PM To: Hadley Wickham <h.wickham at gmail.com>; Lenth, Russell V <russell-lenth at uiowa.edu> Cc: r-package-devel at r-project.org Subject: Re: [R-pkg-devel] Absent variables and tibble On 27/06/2016 1:09 PM, Hadley Wickham wrote:
The other thing you need to be aware of it you're using the other approach is partial matching: df <- data.frame(xyz = 1) is.null(df$x) #> [1] FALSE Duncan - I think that argues for including a has_name() (hasName() ?) function in base R. Is that something you'd consider?
Yes, I'd consider it. I think hasName() would be more consistent with other has*() functions in the R sources. I guess the implementation should be defined to be equivalent to hasName <- function(x, name) name %in% names(x) though it would make sense to make a faster internal implementation; !is.null(df$x) is quite a bit faster than "x" %in% names(df). Duncan Murdoch
Hadley On Mon, Jun 27, 2016 at 10:05 AM, Lenth, Russell V <russell-lenth at uiowa.edu> wrote:
Thanks, Hadley. I do understand why you'd want more careful checking. If you're going to provide a variable-existing function, may I suggest a short name like 'has'? I.e., has(x, var) returns TRUE if x has var in it. Thanks Russ
On Jun 27, 2016, at 9:47 AM, Hadley Wickham <h.wickham at gmail.com> wrote: On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
My package 'lsmeans' is now suddenly broken because of a new
provision in the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$"
methods for 'tbl_df' objects - as documented - throw an error if
a variable is not found.
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work
around this using
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code
being understandable; and (b) isn't there a long history whereby
we can expect a NULL result when accessing an absent member of a
list (and hence a data.frame)? (c) the code base for 'lsmeans'
has about 50 instances of such tests.
Anyway, I wonder if a lot of other package developers test for
absent variables in that first way; if so, they too are in for a
rude awakening if their users provide a tbl_df instead of a
data.frame. And what is considered the best practice for testing
absence of a list member? Apparently, not either of the above;
and because of (c), I want to do these many tedious corrections only once.
Thanks for any light you can shed.
This is why CRAN asks that people test reverse dependencies.
Which we did do - the problem is that this is actually caused by a recursive reverse dependency (lsmeans -> dplyr -> tibble), and we didn't correctly anticipate how much pain this would cause.
I think the most defensive thing you can do is to write a small function name_missing <- function(x, name) !(name %in% names(x)) and use name_missing(x, "var") in your tests. (Pick your own name to make your code understandable if you don't like my choice.) You could suggest to the tibble maintainers that they add a function like this.
We're definitely going to add this. And I think we'll make df[["var"]] return NULL too, so at least there's one easy way to opt out. The motivation for this change was that returning NULL + recycling rules means it's very easy for errors to silently propagate. But I think this approach might be somewhat too aggressive - I hadn't considered that people use `is.null()` to check for missing columns. We'll try and get an update to tibble out soon after useR. Thoughts on what we should do are greatly appreciated. Hadley -- http://hadley.nz
Currently exists("someName", where=someDataFrame) reports if "someName" is
an column
of the data.frame 'someDataFrame' and the 'where=' may be omitted. If we
have an
environment we use exsts("someName", envir=someEnvironment). It might be
nice to
continue using exists() instead of introducing a new function has(),
although, since we
want the same syntax to work for environments, data.frames, tbl_dfs,
data.tables, etc.,
we may need the new function.
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Tue, Jun 28, 2016 at 4:08 AM, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
On 27/06/2016 10:15 PM, Lenth, Russell V wrote:
Hadley's note on partial matching has me scared the most concerning the as.null() coding. So the need for a hasName() (or whatever) function seems all the more compelling, and that it be in base R. Perhaps it should be generic, with a default method that searches in the names attribute, potentially extensible to other classes.
I am thinking of putting it in, but if I do the definition will be equivalent to the one-liner down below. That's already slower than the is.null() test; making it generic would slow it down too much. Duncan Murdoch Thanks so much, several of you, for your positive and helpful responses.
Russ -----Original Message----- From: Duncan Murdoch [mailto:murdoch.duncan at gmail.com] Sent: Monday, June 27, 2016 12:50 PM To: Hadley Wickham <h.wickham at gmail.com>; Lenth, Russell V < russell-lenth at uiowa.edu> Cc: r-package-devel at r-project.org Subject: Re: [R-pkg-devel] Absent variables and tibble On 27/06/2016 1:09 PM, Hadley Wickham wrote:
The other thing you need to be aware of it you're using the other approach is partial matching: df <- data.frame(xyz = 1) is.null(df$x) #> [1] FALSE Duncan - I think that argues for including a has_name() (hasName() ?) function in base R. Is that something you'd consider?
Yes, I'd consider it. I think hasName() would be more consistent with other has*() functions in the R sources. I guess the implementation should be defined to be equivalent to hasName <- function(x, name) name %in% names(x) though it would make sense to make a faster internal implementation; !is.null(df$x) is quite a bit faster than "x" %in% names(df). Duncan Murdoch
Hadley On Mon, Jun 27, 2016 at 10:05 AM, Lenth, Russell V <russell-lenth at uiowa.edu> wrote:
Thanks, Hadley. I do understand why you'd want more careful checking. If you're going to provide a variable-existing function, may I suggest a short name like 'has'? I.e., has(x, var) returns TRUE if x has var in it. Thanks Russ On Jun 27, 2016, at 9:47 AM, Hadley Wickham <h.wickham at gmail.com>
wrote: On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
My package 'lsmeans' is now suddenly broken because of a new
provision in the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby
the "[[" and "$"
methods for 'tbl_df' objects - as documented - throw an error if
a variable is not found.
The problem is that my code uses tests like this:
if (is.null (x$var)) {...}
to see whether 'x' has a variable 'var'. Obviously, I can work
around this using
if (!("var" %in% names(x))) {...}
but (a) I like the first version better, in terms of the code
being understandable; and (b) isn't there a long history whereby
we can expect a NULL result when accessing an absent member of a
list (and hence a data.frame)? (c) the code base for 'lsmeans'
has about 50 instances of such tests.
Anyway, I wonder if a lot of other package developers test for
absent variables in that first way; if so, they too are in for a
rude awakening if their users provide a tbl_df instead of a
data.frame. And what is considered the best practice for testing
absence of a list member? Apparently, not either of the above;
and because of (c), I want to do these many tedious corrections only
once.
Thanks for any light you can shed.
This is why CRAN asks that people test reverse dependencies.
Which we did do - the problem is that this is actually caused by a recursive reverse dependency (lsmeans -> dplyr -> tibble), and we didn't correctly anticipate how much pain this would cause. I think the most defensive thing you can do is to write a small
function name_missing <- function(x, name) !(name %in% names(x)) and use name_missing(x, "var") in your tests. (Pick your own name to make your code understandable if you don't like my choice.) You could suggest to the tibble maintainers that they add a function like this.
We're definitely going to add this. And I think we'll make df[["var"]] return NULL too, so at least there's one easy way to opt out. The motivation for this change was that returning NULL + recycling rules means it's very easy for errors to silently propagate. But I think this approach might be somewhat too aggressive - I hadn't considered that people use `is.null()` to check for missing columns. We'll try and get an update to tibble out soon after useR. Thoughts on what we should do are greatly appreciated. Hadley -- http://hadley.nz
______________________________________________ R-package-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
On 28/06/2016 10:03 AM, William Dunlap wrote:
Currently exists("someName", where=someDataFrame) reports if
"someName" is an column
of the data.frame 'someDataFrame' and the 'where=' may be omitted. If
we have an
environment we use exsts("someName", envir=someEnvironment). It might
be nice to
continue using exists() instead of introducing a new function has(),
although, since we
want the same syntax to work for environments, data.frames, tbl_dfs,
data.tables, etc.,
we may need the new function.
One issue with exists("someName", someDataFrame) is that it's quite a
bit slower. (I think it converts the dataframe to an environment.) On
the other hand, getting the names from an environment requires more work
than checking for one, so exists("someName", someEnvironment) is faster
than checking for the name in the obvious way. The slow operations
could be sped up, but is that worth the effort?
The other issue with exists() is that it has a complicated definition
and hard to follow argument list (with args "where", "envir", "frame"
that all do related things); the thing I like about hasName() is that it
is very clear what it does. A criticism of it is that it is hardly any
shorter than just doing
name %in% names(x)
so is there really any point in making a function for this?
Duncan Murdoch
Bill Dunlap TIBCO Software wdunlap tibco.com <http://tibco.com> On Tue, Jun 28, 2016 at 4:08 AM, Duncan Murdoch <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> wrote: On 27/06/2016 10:15 PM, Lenth, Russell V wrote: Hadley's note on partial matching has me scared the most concerning the as.null() coding. So the need for a hasName() (or whatever) function seems all the more compelling, and that it be in base R. Perhaps it should be generic, with a default method that searches in the names attribute, potentially extensible to other classes. I am thinking of putting it in, but if I do the definition will be equivalent to the one-liner down below. That's already slower than the is.null() test; making it generic would slow it down too much. Duncan Murdoch Thanks so much, several of you, for your positive and helpful responses. Russ -----Original Message----- From: Duncan Murdoch [mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>] Sent: Monday, June 27, 2016 12:50 PM To: Hadley Wickham <h.wickham at gmail.com <mailto:h.wickham at gmail.com>>; Lenth, Russell V <russell-lenth at uiowa.edu <mailto:russell-lenth at uiowa.edu>> Cc: r-package-devel at r-project.org <mailto:r-package-devel at r-project.org> Subject: Re: [R-pkg-devel] Absent variables and tibble On 27/06/2016 1:09 PM, Hadley Wickham wrote: The other thing you need to be aware of it you're using the other approach is partial matching: df <- data.frame(xyz = 1) is.null(df$x) #> [1] FALSE Duncan - I think that argues for including a has_name() (hasName() ?) function in base R. Is that something you'd consider? Yes, I'd consider it. I think hasName() would be more consistent with other has*() functions in the R sources. I guess the implementation should be defined to be equivalent to hasName <- function(x, name) name %in% names(x) though it would make sense to make a faster internal implementation; !is.null(df$x) is quite a bit faster than "x" %in% names(df). Duncan Murdoch Hadley On Mon, Jun 27, 2016 at 10:05 AM, Lenth, Russell V <russell-lenth at uiowa.edu <mailto:russell-lenth at uiowa.edu>> wrote: Thanks, Hadley. I do understand why you'd want more careful checking. If you're going to provide a variable-existing function, may I suggest a short name like 'has'? I.e., has(x, var) returns TRUE if x has var in it. Thanks Russ On Jun 27, 2016, at 9:47 AM, Hadley Wickham <h.wickham at gmail.com <mailto:h.wickham at gmail.com>> wrote: On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> wrote: On 27/06/2016 9:22 AM, Lenth, Russell V wrote: My package 'lsmeans' is now suddenly broken because of a new provision in the 'tibble' package (loaded by 'dplyr' 0.5.0), whereby the "[[" and "$" methods for 'tbl_df' objects - as documented - throw an error if a variable is not found. The problem is that my code uses tests like this: if (is.null (x$var)) {...} to see whether 'x' has a variable 'var'. Obviously, I can work around this using if (!("var" %in% names(x))) {...} but (a) I like the first version better, in terms of the code being understandable; and (b) isn't there a long history whereby we can expect a NULL result when accessing an absent member of a list (and hence a data.frame)? (c) the code base for 'lsmeans' has about 50 instances of such tests. Anyway, I wonder if a lot of other package developers test for absent variables in that first way; if so, they too are in for a rude awakening if their users provide a tbl_df instead of a data.frame. And what is considered the best practice for testing absence of a list member? Apparently, not either of the above; and because of (c), I want to do these many tedious corrections only once. Thanks for any light you can shed. This is why CRAN asks that people test reverse dependencies. Which we did do - the problem is that this is actually caused by a recursive reverse dependency (lsmeans -> dplyr -> tibble), and we didn't correctly anticipate how much pain this would cause. I think the most defensive thing you can do is to write a small function name_missing <- function(x, name) !(name %in% names(x)) and use name_missing(x, "var") in your tests. (Pick your own name to make your code understandable if you don't like my choice.) You could suggest to the tibble maintainers that they add a function like this. We're definitely going to add this. And I think we'll make df[["var"]] return NULL too, so at least there's one easy way to opt out. The motivation for this change was that returning NULL + recycling rules means it's very easy for errors to silently propagate. But I think this approach might be somewhat too aggressive - I hadn't considered that people use `is.null()` to check for missing columns. We'll try and get an update to tibble out soon after useR. Thoughts on what we should do are greatly appreciated. Hadley -- http://hadley.nz
______________________________________________
R-package-devel at r-project.org
<mailto:R-package-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel
I've now added a simple implementation of hasName to R-devel and R-patched. When I find the time, I'll go through the base packages and change the !is.null(x$name) idiom to hasName. (All but "base", that is: hasName is in utils, and it is better if base remains self-contained.) If any bottlenecks turn up, I could make hasName faster by redoing it in C code, but so far it is just R code very similar to the %in% implementation. Duncan Murdoch
On 28/06/2016 1:15 PM, Duncan Murdoch wrote:
I've now added a simple implementation of hasName to R-devel and R-patched. When I find the time, I'll go through the base packages and change the !is.null(x$name) idiom to hasName. (All but "base", that is: hasName is in utils, and it is better if base remains self-contained.)
After looking at a few cases, I don't think I'll do that. Often the test is used to find out if x$name will work. hasName(x, "name") is not sufficient for that: x might have that as a name, but x$name won't work, e.g. in a named numeric vector. I don't think we have a simple test corresponding to !is.null(x$name) && hasName(x, "name") Probably the best approach is to run tests with options(warnPartialMatchDollar = TRUE), and just use the simple !is.null(x$name). Duncan Murdoch