The model.weights() and model.offset() functions from the 'stats'
package index possibly-missing elements of a data frame via $, e.g.
x$"(offset)"
x$"(weights)"
This returns NULL without comment when x is a data frame:
x <- data.frame(a=1)
x$"(offset)" ## NULL
x$"(weights)" ## NULL
However, when x is a tibble we get a warning as well:
x <- tibble::as_tibble(x)
x$"(offset)"
## NULL
## Warning message:
## Unknown or uninitialised column: `(offset)`.
I know it's not R-core's responsibility to manage forward
compatibility with tibbles, but in this case [[-indexing would seem to
be better practice in any case.
Might a patch be accepted ... ?
cheers
Ben Bolker
model.weights and model.offset: request for adjustment
5 messages · Ben Bolker, tim@t@yior m@iii@g oii hidde@eieph@@ts@co@uk, Martin Maechler
1 day later
Ben Bolker
on Tue, 1 Feb 2022 21:21:46 -0500 writes:
> The model.weights() and model.offset() functions from the 'stats'
> package index possibly-missing elements of a data frame via $, e.g.
> x$"(offset)"
> x$"(weights)"
> This returns NULL without comment when x is a data frame:
> x <- data.frame(a=1)
> x$"(offset)" ## NULL
> x$"(weights)" ## NULL
> However, when x is a tibble we get a warning as well:
> x <- tibble::as_tibble(x)
> x$"(offset)"
> ## NULL
> ## Warning message:
> ## Unknown or uninitialised column: `(offset)`.
> I know it's not R-core's responsibility to manage forward
> compatibility with tibbles, but in this case [[-indexing would seem to
> be better practice in any case.
Yes, I would agree: we should use [[ instead of $ here
in order to force exact matching just as principle
Importantly, because also mf[["(weights)"]]
will return NULL without a warning for a model/data frame, and
it seems it does so also for tibbles.
> Might a patch be accepted ... ?
That would not be necessary.
There's one remaining problem however:
`$` access is clearly faster than `[[` for small data frames
(because `$` is a primitive function doing everything in C,
whereas `[[` calls the R level data frame method ).
Faster in both cases, i.e., when there *is* a column and when there
is none (and NULL is returned), e.g., for the first case
system.time(for(i in 1:20000) df[["a"]])
user system elapsed 0.064 0.000 0.065
system.time(for(i in 1:20000) df$a)
user system elapsed
0.009 0.000 0.009
So that's probably been the reason why `$` has been prefered?
Martin
> cheers
> Ben Bolker
On 03/02/2022 11:14 Martin Maechler <maechler at stat.math.ethz.ch> wrote:
Ben Bolker
on Tue, 1 Feb 2022 21:21:46 -0500 writes:
> The model.weights() and model.offset() functions from the 'stats'
> package index possibly-missing elements of a data frame via $, e.g.
> x$"(offset)"
> x$"(weights)"
> This returns NULL without comment when x is a data frame:
> x <- data.frame(a=1)
> x$"(offset)" ## NULL
> x$"(weights)" ## NULL
> However, when x is a tibble we get a warning as well:
> x <- tibble::as_tibble(x)
> x$"(offset)"
> ## NULL
> ## Warning message:
> ## Unknown or uninitialised column: `(offset)`.
> I know it's not R-core's responsibility to manage forward
> compatibility with tibbles, but in this case [[-indexing would seem to
> be better practice in any case.
Yes, I would agree: we should use [[ instead of $ here in order to force exact matching just as principle Importantly, because also mf[["(weights)"]] will return NULL without a warning for a model/data frame, and it seems it does so also for tibbles.
> Might a patch be accepted ... ?
That would not be necessary. There's one remaining problem however: `$` access is clearly faster than `[[` for small data frames (because `$` is a primitive function doing everything in C, whereas `[[` calls the R level data frame method ). Faster in both cases, i.e., when there *is* a column and when there is none (and NULL is returned), e.g., for the first case
system.time(for(i in 1:20000) df[["a"]])
user system elapsed 0.064 0.000 0.065
system.time(for(i in 1:20000) df$a)
user system elapsed 0.009 0.000 0.009 So that's probably been the reason why `$` has been prefered?
Would .subset2(df, "a) be preferable? R> df <- mtcars R> system.time(for(i in 1:20000) df[["hp"]]) user system elapsed 0.078 0.000 0.078 R> system.time(for(i in 1:20000) df$hp) user system elapsed 0.011 0.000 0.010 R> system.time(for(i in 1:20000) .subset2(df,"hp")) user system elapsed 0.004 0.000 0.004 Tim
tim taylor
on Thu, 3 Feb 2022 11:30:17 +0000 (GMT) writes:
>> On 03/02/2022 11:14 Martin Maechler <maechler at stat.math.ethz.ch> wrote:
>>
>>
>> >>>>> Ben Bolker
>> >>>>> on Tue, 1 Feb 2022 21:21:46 -0500 writes:
>>
>> > The model.weights() and model.offset() functions from the 'stats'
>> > package index possibly-missing elements of a data frame via $, e.g.
>>
>> > x$"(offset)"
>> > x$"(weights)"
>>
>> > This returns NULL without comment when x is a data frame:
>>
>> > x <- data.frame(a=1)
>> > x$"(offset)" ## NULL
>> > x$"(weights)" ## NULL
>>
>> > However, when x is a tibble we get a warning as well:
>>
>> > x <- tibble::as_tibble(x)
>> > x$"(offset)"
>> > ## NULL
>> > ## Warning message:
>> > ## Unknown or uninitialised column: `(offset)`.
>>
>> > I know it's not R-core's responsibility to manage forward
>> > compatibility with tibbles, but in this case [[-indexing would seem to
>> > be better practice in any case.
>>
>> Yes, I would agree: we should use [[ instead of $ here
>> in order to force exact matching just as principle
>>
>> Importantly, because also mf[["(weights)"]]
>> will return NULL without a warning for a model/data frame, and
>> it seems it does so also for tibbles.
>>
>> > Might a patch be accepted ... ?
>>
>> That would not be necessary.
>>
>> There's one remaining problem however:
>> `$` access is clearly faster than `[[` for small data frames
>> (because `$` is a primitive function doing everything in C,
>> whereas `[[` calls the R level data frame method ).
>>
>> Faster in both cases, i.e., when there *is* a column and when there
>> is none (and NULL is returned), e.g., for the first case
>>
>> > system.time(for(i in 1:20000) df[["a"]])
>> user system elapsed
>> 0.064 0.000 0.065
>> > system.time(for(i in 1:20000) df$a)
>> user system elapsed
>> 0.009 0.000 0.009
>>
>> So that's probably been the reason why `$` has been prefered?
> Would .subset2(df, "a) be preferable?
R> df <- mtcars
R> system.time(for(i in 1:20000) df[["hp"]])
> user system elapsed
> 0.078 0.000 0.078
R> system.time(for(i in 1:20000) df$hp)
> user system elapsed
> 0.011 0.000 0.010
R> system.time(for(i in 1:20000) .subset2(df,"hp"))
> user system elapsed
> 0.004 0.000 0.004
> Tim
Yes, I think that's a very good idea --
notably, as interestingly it seems to work with tibble's very
well, too.
Martin
Martin Maechler
on Thu, 3 Feb 2022 15:21:45 +0100 writes:
tim taylor
on Thu, 3 Feb 2022 11:30:17 +0000 (GMT) writes:
>>> On 03/02/2022 11:14 Martin Maechler <maechler at stat.math.ethz.ch> wrote:
>>>
>>>
>>> >>>>> Ben Bolker
>>> >>>>> on Tue, 1 Feb 2022 21:21:46 -0500 writes:
>>>
>>> > The model.weights() and model.offset() functions from the 'stats'
>>> > package index possibly-missing elements of a data frame via $, e.g.
>>>
>>> > x$"(offset)"
>>> > x$"(weights)"
>>>
>>> > This returns NULL without comment when x is a data frame:
>>>
>>> > x <- data.frame(a=1)
>>> > x$"(offset)" ## NULL
>>> > x$"(weights)" ## NULL
>>>
>>> > However, when x is a tibble we get a warning as well:
>>>
>>> > x <- tibble::as_tibble(x)
>>> > x$"(offset)"
>>> > ## NULL
>>> > ## Warning message:
>>> > ## Unknown or uninitialised column: `(offset)`.
>>>
>>> > I know it's not R-core's responsibility to manage forward
>>> > compatibility with tibbles, but in this case [[-indexing would seem to
>>> > be better practice in any case.
>>>
>>> Yes, I would agree: we should use [[ instead of $ here
>>> in order to force exact matching just as principle
>>>
>>> Importantly, because also mf[["(weights)"]]
>>> will return NULL without a warning for a model/data frame, and
>>> it seems it does so also for tibbles.
>>>
>>> > Might a patch be accepted ... ?
>>>
>>> That would not be necessary.
>>>
>>> There's one remaining problem however:
>>> `$` access is clearly faster than `[[` for small data frames
>>> (because `$` is a primitive function doing everything in C,
>>> whereas `[[` calls the R level data frame method ).
>>>
>>> Faster in both cases, i.e., when there *is* a column and when there
>>> is none (and NULL is returned), e.g., for the first case
>>>
>>> > system.time(for(i in 1:20000) df[["a"]])
>>> user system elapsed
>>> 0.064 0.000 0.065
>>> > system.time(for(i in 1:20000) df$a)
>>> user system elapsed
>>> 0.009 0.000 0.009
>>>
>>> So that's probably been the reason why `$` has been prefered?
>> Would .subset2(df, "a) be preferable?
R> df <- mtcars
R> system.time(for(i in 1:20000) df[["hp"]])
>> user system elapsed
>> 0.078 0.000 0.078
R> system.time(for(i in 1:20000) df$hp)
>> user system elapsed
>> 0.011 0.000 0.010
R> system.time(for(i in 1:20000) .subset2(df,"hp"))
>> user system elapsed
>> 0.004 0.000 0.004
>> Tim
> Yes, I think that's a very good idea --
> notably, as interestingly it seems to work with tibble's very
> well, too.
Interestingly (or not), changing this also fixes a real (rare!) bug:
When digging for a regression test, I've stumbled over an lm() example,
which when modified to use the not so common "(weight)_2" as
*predictor* variable name it started to use that both as
predictor and also as weight (of some kind) such that the fit
changed.
This problem went away after apply the change,
[replacing `a$b` with `.subset2(a,b)]
Now committed to R-devel, svn rev 81650.
If there are no negative effects, this may also be backported to
R-patched.
Thank you both, once more!
Martin