Dear r-devel,
See below:
transform(data.frame(a = 1), 2, 3)
#> a
#> 1 1
transform(data.frame(a = 1), b=2, 3)
#> a b X3
#> 1 1 2 3
We need a small modification to make it work consistently, see below:
transform.data.frame <- function (`_data`, ...) {
e <- eval(substitute(list(...)), `_data`, parent.frame())
tags <- names(e)
## NEW LINE -----------------------------------------------
if (is.null(tags)) tags <- character(length(e))
inx <- match(tags, names(`_data`))
matched <- !is.na(inx)
if (any(matched)) {
`_data`[inx[matched]] <- e[matched]
`_data` <- data.frame(`_data`)
}
if (!all(matched))
do.call("data.frame", c(list(`_data`), e[!matched]))
else `_data`
}
transform(data.frame(a = 1), 2, 3)
#> a X2 X3
#> 1 1 2 3
transform(data.frame(a = 1), b=2, 3)
#> a b X3
#> 1 1 2 3
Thanks,
Antoine
transform.data.frame() ignores unnamed arguments when no named argument is provided
9 messages · Antoine Fabri, Sebastian Meyer, Gabriel Becker +3 more
Note that ?transform.data.frame says arguments need to be named, so you
are testing unspecified behaviour. I guess this falls in a similar
category as the note
If some of the values are not vectors of the appropriate length,
you deserve whatever you get!
Experiments for a related Problem Report
(<https://bugs.r-project.org/show_bug.cgi?id=17890>) showed that
packages bravely ignore the caveats mentioned on the help page,
including to assume recycling the rows of the input data frame. I didn't
yet see any uses of unnamed arguments, though.
That said, I agree that transform.data.frame() should be improved. Maybe
unnamed arguments should always be ignored with a warning. My feeling is
that these would more often be usage errors than intentional, e.g.:
> data.frame(a = 1) |> transform(b = 2, a + 2) # "forgetting" a=
a b X3
1 1 2 3
I also think the implicit check.names=TRUE behaviour should be disabled. In
> list2DF(list(`A-1` = 1)) |> transform(B = 2)
A.1 B
1 1 2
transforming B should not touch the other columns.
I'm less sure about some other forms of undocumented behaviour as
described in Comment 6 of the linked PR.
Sebastian Meyer
Am 02.03.23 um 18:49 schrieb Antoine Fabri:
Dear r-devel,
See below:
transform(data.frame(a = 1), 2, 3)
#> a
#> 1 1
transform(data.frame(a = 1), b=2, 3)
#> a b X3
#> 1 1 2 3
We need a small modification to make it work consistently, see below:
transform.data.frame <- function (`_data`, ...) {
e <- eval(substitute(list(...)), `_data`, parent.frame())
tags <- names(e)
## NEW LINE -----------------------------------------------
if (is.null(tags)) tags <- character(length(e))
inx <- match(tags, names(`_data`))
matched <- !is.na(inx)
if (any(matched)) {
`_data`[inx[matched]] <- e[matched]
`_data` <- data.frame(`_data`)
}
if (!all(matched))
do.call("data.frame", c(list(`_data`), e[!matched]))
else `_data`
}
transform(data.frame(a = 1), 2, 3)
#> a X2 X3
#> 1 1 2 3
transform(data.frame(a = 1), b=2, 3)
#> a b X3
#> 1 1 2 3
Thanks,
Antoine
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Thanks and good point about unspecified behavior. The way it behaves now (when it doesn't ignore) is more consistent with data.frame() though so I prefer that to a "warn and ignore" behaviour: data.frame(a = 1, b = 2, 3) #> a b X3 #> 1 1 2 3 data.frame(a = 1, 2, 3) #> a X2 X3 #> 1 1 2 3 (and in general warnings make for unpleasant debugging so I prefer when we don't add new ones if avoidable) playing a bit more with it, it would make sense to me that the following have the same output: coefficient <- 3 data.frame(value1 = 5) |> transform(coefficient, value2 = coefficient * value1) #> value1 X3 value2 #> 1 5 3 15 data.frame(value1 = 5, coefficient) |> transform(value2 = coefficient * value1) #> value1 coefficient value2 #> 1 5 3 15
On Thu, Mar 2, 2023 at 2:02?PM Antoine Fabri <antoine.fabri at gmail.com> wrote:
Thanks and good point about unspecified behavior. The way it behaves now (when it doesn't ignore) is more consistent with data.frame() though so I prefer that to a "warn and ignore" behaviour: data.frame(a = 1, b = 2, 3) #> a b X3 #> 1 1 2 3 data.frame(a = 1, 2, 3) #> a X2 X3 #> 1 1 2 3 (and in general warnings make for unpleasant debugging so I prefer when we don't add new ones if avoidable)
I find silence to be much more unpleasant in practice when debugging, myself, but that may be a personal preference.
playing a bit more with it, it would make sense to me that the following have the same output: coefficient <- 3 data.frame(value1 = 5) |> transform(coefficient, value2 = coefficient * value1) #> value1 X3 value2 #> 1 5 3 15 data.frame(value1 = 5, coefficient) |> transform(value2 = coefficient * value1) #> value1 coefficient value2 #> 1 5 3 15
I'm not so sure. data.frame() is doing some substitute magic to get the column name coefficient there.
coefficient = 3
data.frame(value1 = 5, coefficient)
value1 coefficient 1 5 3 Beyond that these two pieces of code are doing subtly but crucially different things; in the latter, coefficient is a variable in the data.frame, and when transform resolves that symbol during calculation of value2, it *gets the column in the incoming data.frame*. In the former case, coefficient does not exist in the data.frame, so the symbol is being resolved somewhere else in the scope chain (in this case, the global environment). These happen to be the same, except for the column name , but we can see the difference if we change the code to
coefficient <- 3
data.frame(value1 = 5, coefficient = 4) |> transform(value2 = value1 *
coefficient) value1 coefficient value2 1 5 4 20
data.frame(value1 = 5) |> transform(coefficient = 4, value2 = value1 *
coefficient) value1 coefficient *value2* 1 5 4 *15* Please note that another way this difference could rear its head is if these arent' directly one after eachother in a pipe:
coefficient <- 3
df1 <- data.frame(value1 = 5, coefficient)
coefficient <- 4
df2 <- data.frame(value1 = 5)
df1 |> transform(value2 = value1 * coefficient)
value1 coefficient value2 1 5 3 15
df2 |> transform(coefficient, value2 = value1 * coefficient)
value1 X4 value2 1 5 4 20 Cause you know someday the place where you do that transform and the place where coefficient is initially set are gonna be far away from eachother, so whether you put coefficient into the incoming data, or don't will matter. Best, ~G
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Gabriel Becker
on Thu, 2 Mar 2023 14:37:18 -0800 writes:
> On Thu, Mar 2, 2023 at 2:02?PM Antoine Fabri
> <antoine.fabri at gmail.com> wrote:
>> Thanks and good point about unspecified behavior. The way
>> it behaves now (when it doesn't ignore) is more
>> consistent with data.frame() though so I prefer that to a
>> "warn and ignore" behaviour:
>>
>> data.frame(a = 1, b = 2, 3)
>>
>> #> a b X3
>>
>> #> 1 1 2 3
>>
>>
>> data.frame(a = 1, 2, 3)
>>
>> #> a X2 X3
>>
>> #> 1 1 2 3
>>
>>
>> (and in general warnings make for unpleasant debugging so
>> I prefer when we don't add new ones if avoidable)
>>
> I find silence to be much more unpleasant in practice when
> debugging, myself, but that may be a personal preference.
+1
I also *strongly* disagree with the claim
" in general warnings make for unpleasant debugging "
That may be true for beginners (for whom debugging is often not really
feasible anyway ..), but somewhat experienced useRs should know
about
options(warn = 1) # or
options(warn = 2) # plus options(error = recover) #
or
tryCatch( ..., warning = ..)
or {even more}
Martin
--
Martin Maechler
ETH Zurich and R Core team
Let me expand a bit, I might have expressed myself poorly.
If there is a good reason for a warning I want a warning, and because I
take them seriously I don't want my console cluttered with those that can
be avoided. I strongly believe we should strive to make our code silent,
and I like my console to tell me only what I need to know. In my opinion
many warnings would be better designed as errors, sometimes with an
argument to opt in the behaviour, or a documented way to work around. Some
other warnings should just be documented behavior, because the behavior is
not all that surprising.
Some reasons why I find warnings hard to debug:
- options(warn = 1) is not always enough to spot the source of the warning
- options(warn = 2) fails at every warning, including the ones that are not
interesting to the user and that they may not do anything about, in these
cases you'll have to find a way to shut off the first to get to the second,
and if it's packaged code that's not fun.
- Unlike with errors, traceback() won't help.
- tryCatch() will help you only if you call it at the right place, assuming
you've found it.
- We might also have many harmless warnings triggered through loops and
hiding important ones.
- When you are sure that you are OK with your code despite the warning, say
`as.numeric(c("1", "2", "foo"))`, a workaround might be expensive (here we
could use regex first to ditch the non numeric strings but who does that)
so you're tempted to use `suppressWarnings()`, but then you might be
suppressing other important warnings so you just made your code less safe
because the developper wanted to make it safer (you might say it's on the
user but still, we get suboptimal code that was avoidable).
Of course I might miss some approaches that would make my experience of
debugging warnings more pleasant.
In our precise case I don't find the behavior surprising enough to warrant
more precious red ink since it's close to what we get with data.frame(),
and close to what we get with dplyr::mutate() FWIW, so I'd be personally
happier to have this documented and work silently.
Either way I appreciate you considering the problem.
Thanks,
Antoine
For what it's worth I think the increased emphasis on classed errors should help with this (i.e., it will be easier to filter out errors you know are false positives/irrelevant for your use case).
On Fri, Mar 3, 2023 at 12:17?PM Antoine Fabri <antoine.fabri at gmail.com> wrote:
Let me expand a bit, I might have expressed myself poorly.
If there is a good reason for a warning I want a warning, and because I
take them seriously I don't want my console cluttered with those that can
be avoided. I strongly believe we should strive to make our code silent,
and I like my console to tell me only what I need to know. In my opinion
many warnings would be better designed as errors, sometimes with an
argument to opt in the behaviour, or a documented way to work around. Some
other warnings should just be documented behavior, because the behavior is
not all that surprising.
Some reasons why I find warnings hard to debug:
- options(warn = 1) is not always enough to spot the source of the warning
- options(warn = 2) fails at every warning, including the ones that are not
interesting to the user and that they may not do anything about, in these
cases you'll have to find a way to shut off the first to get to the second,
and if it's packaged code that's not fun.
- Unlike with errors, traceback() won't help.
- tryCatch() will help you only if you call it at the right place, assuming
you've found it.
- We might also have many harmless warnings triggered through loops and
hiding important ones.
- When you are sure that you are OK with your code despite the warning, say
`as.numeric(c("1", "2", "foo"))`, a workaround might be expensive (here we
could use regex first to ditch the non numeric strings but who does that)
so you're tempted to use `suppressWarnings()`, but then you might be
suppressing other important warnings so you just made your code less safe
because the developper wanted to make it safer (you might say it's on the
user but still, we get suboptimal code that was avoidable).
Of course I might miss some approaches that would make my experience of
debugging warnings more pleasant.
In our precise case I don't find the behavior surprising enough to warrant
more precious red ink since it's close to what we get with data.frame(),
and close to what we get with dplyr::mutate() FWIW, so I'd be personally
happier to have this documented and work silently.
Either way I appreciate you considering the problem.
Thanks,
Antoine
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
I am probably mistaken but it looks to me like the design of much of the data.frame infrastructure not only does not insist you give columns names, but even has all kinds of options such as check.names and fix.empty.names https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame During the lifetime of a column, it can get removed, renamed, transfomed in many ways and so on. A data.frame read in from a file such as a .CSV often begins with temporary created names. It is so common, that sometimes not giving a name is a choice and not in any way an error. I have seen some rather odd names in backticks that include spaces and seen duplicate names. The reality is you can index by column number two and maybe no actual name was needed by the one creating or modifying the data. Some placed warnings are welcome as they tend to reflect a possibly serious error. But that error may not easily be at this point versus later in the game. If later the program tries to access the misnamed column, then an error makes sense. Warnings, if overused, get old quickly and you regularly see code written to suppress startup messages or warnings because the same message shown every day becomes something you ignore mentally even if not suppressed. How many times has loading the tidyverse reminded me it is shadowing a few base R functions? How many times have I really cared? What makes some sense to me is to add an argument to some functions BEGGING to be shown the errors of your ways and turn that on as you wish, often after something has gone wrong. -----Original Message----- From: R-devel <r-devel-bounces at r-project.org> On Behalf Of Martin Maechler Sent: Friday, March 3, 2023 10:26 AM To: Gabriel Becker <gabembecker at gmail.com> Cc: Antoine Fabri <antoine.fabri at gmail.com>; R-devel <r-devel at r-project.org> Subject: Re: [Rd] transform.data.frame() ignores unnamed arguments when no named argument is provided
Gabriel Becker
on Thu, 2 Mar 2023 14:37:18 -0800 writes:
> On Thu, Mar 2, 2023 at 2:02?PM Antoine Fabri
> <antoine.fabri at gmail.com> wrote:
>> Thanks and good point about unspecified behavior. The way
>> it behaves now (when it doesn't ignore) is more
>> consistent with data.frame() though so I prefer that to a
>> "warn and ignore" behaviour:
>>
>> data.frame(a = 1, b = 2, 3)
>>
>> #> a b X3
>>
>> #> 1 1 2 3
>>
>>
>> data.frame(a = 1, 2, 3)
>>
>> #> a X2 X3
>>
>> #> 1 1 2 3
>>
>>
>> (and in general warnings make for unpleasant debugging so
>> I prefer when we don't add new ones if avoidable)
>>
> I find silence to be much more unpleasant in practice when
> debugging, myself, but that may be a personal preference.
+1
I also *strongly* disagree with the claim
" in general warnings make for unpleasant debugging "
That may be true for beginners (for whom debugging is often not really
feasible anyway ..), but somewhat experienced useRs should know
about
options(warn = 1) # or
options(warn = 2) # plus options(error = recover) #
or
tryCatch( ..., warning = ..)
or {even more}
Martin
--
Martin Maechler
ETH Zurich and R Core team
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Hi Avi,
On Fri, Mar 3, 2023 at 9:07?PM <avi.e.gross at gmail.com> wrote:
I am probably mistaken but it looks to me like the design of much of the data.frame infrastructure not only does not insist you give columns names, but even has all kinds of options such as check.names and fix.empty.names https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame
I think this is true, but thats for the *construction* of a data.frame, where as, in my opinion from what I can tell, transform is for operating on a data.frame that has already been constructed. I'm not personally convinced the same allowances should be made at this conceptually later stage in data processing.
During the lifetime of a column, it can get removed, renamed, transfomed in many ways and so on. A data.frame read in from a file such as a .CSV often begins with temporary created names. It is so common, that sometimes not giving a name is a choice and not in any way an error. I have seen some rather odd names in backticks that include spaces and seen duplicate names. The reality is you can index by column number two and maybe no actual name was needed by the one creating or modifying the data.
You can but this creates brittle, difficult to maintain code to the extent that I consider this an anti-pattern, and I don't believe I'm alone in that.
Some placed warnings are welcome as they tend to reflect a possibly serious error. But that error may not easily be at this point versus later in the game. If later the program tries to access the misnamed column, then an error makes sense. Warnings, if overused, get old quickly and you regularly see code written to suppress startup messages or warnings because the same message shown every day becomes something you ignore mentally even if not suppressed. How many times has loading the tidyverse reminded me it is shadowing a few base R functions? How many times have I really cared?
I think this is a bad example to make your case on, because symbol masking is actually *really* important. In bioinformatics, Bioconductor is the flagship (which sails upon the sea that R provides), but guess what; dplyr and Bioconductor both define filter, and they do so meaning completely different incompatible things. I have seen code that wanted one version and got the other in both directions, and in neither case is it fun, but without that warning it would be a dystopian nightmarescape that scarcely bears thinking about.
What makes some sense to me is to add an argument to some functions BEGGING to be shown the errors of your ways and turn that on as you wish, often after something has gone wrong.
Flipping this on its head, I wonder, alternatively, if there might be a "strict" mode for transform which errors out on unnamed arguments, instead of providing the current undefined behavior. Best, ~G
-----Original Message----- From: R-devel <r-devel-bounces at r-project.org> On Behalf Of Martin Maechler Sent: Friday, March 3, 2023 10:26 AM To: Gabriel Becker <gabembecker at gmail.com> Cc: Antoine Fabri <antoine.fabri at gmail.com>; R-devel < r-devel at r-project.org> Subject: Re: [Rd] transform.data.frame() ignores unnamed arguments when no named argument is provided
Gabriel Becker
on Thu, 2 Mar 2023 14:37:18 -0800 writes:
> On Thu, Mar 2, 2023 at 2:02?PM Antoine Fabri
> <antoine.fabri at gmail.com> wrote:
>> Thanks and good point about unspecified behavior. The way
>> it behaves now (when it doesn't ignore) is more
>> consistent with data.frame() though so I prefer that to a
>> "warn and ignore" behaviour:
>>
>> data.frame(a = 1, b = 2, 3)
>>
>> #> a b X3
>>
>> #> 1 1 2 3
>>
>>
>> data.frame(a = 1, 2, 3)
>>
>> #> a X2 X3
>>
>> #> 1 1 2 3
>>
>>
>> (and in general warnings make for unpleasant debugging so
>> I prefer when we don't add new ones if avoidable)
>>
> I find silence to be much more unpleasant in practice when
> debugging, myself, but that may be a personal preference.
+1
I also *strongly* disagree with the claim
" in general warnings make for unpleasant debugging "
That may be true for beginners (for whom debugging is often not really
feasible anyway ..), but somewhat experienced useRs should know
about
options(warn = 1) # or
options(warn = 2) # plus options(error = recover) #
or
tryCatch( ..., warning = ..)
or {even more}
Martin
--
Martin Maechler
ETH Zurich and R Core team
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel