model.weights and model.offset: request for adjustment - R-devel

Tue, Feb 1, 2022 6:21 PM #

The model.weights() and model.offset() functions from the 'stats' 
package index possibly-missing elements of a data frame via $, e.g.

x$"(offset)"
x$"(weights)"

This returns NULL without comment when x is a data frame:

x <- data.frame(a=1)
x$"(offset)"  ## NULL
x$"(weights)"  ## NULL

However, when x is a tibble we get a warning as well:

x <- tibble::as_tibble(x)
x$"(offset)"
## NULL
## Warning message:
## Unknown or uninitialised column: `(offset)`.

    I know it's not R-core's responsibility to manage forward 
compatibility with tibbles, but in this case [[-indexing would seem to 
be better practice in any case.

   Might a patch be accepted ... ?

   cheers
    Ben Bolker

Martin Maechler

Thu, Feb 3, 2022 3:14 AM #

> The model.weights() and model.offset() functions from the 'stats' 
    > package index possibly-missing elements of a data frame via $, e.g.

    > x$"(offset)"
    > x$"(weights)"

    > This returns NULL without comment when x is a data frame:

    > x <- data.frame(a=1)
    > x$"(offset)"  ## NULL
    > x$"(weights)"  ## NULL

    > However, when x is a tibble we get a warning as well:

    > x <- tibble::as_tibble(x)
    > x$"(offset)"
    > ## NULL
    > ## Warning message:
    > ## Unknown or uninitialised column: `(offset)`.

    > I know it's not R-core's responsibility to manage forward 
    > compatibility with tibbles, but in this case [[-indexing would seem to 
    > be better practice in any case.

Yes, I would agree:  we should use  [[ instead of $ here
in order to force exact matching just as principle

Importantly, because  also  mf[["(weights)"]]
will return  NULL without a warning for a model/data frame, and
it seems it does so also for tibbles.

    > Might a patch be accepted ... ?

That would not be necessary.

There's one remaining problem however:
`$` access is clearly faster than `[[` for small data frames
(because `$` is a primitive function doing everything in C, 
 whereas `[[` calls the R level data frame method ).

Faster in both cases, i.e., when there *is* a column and when there
is none (and NULL is returned), e.g., for the first case

user  system elapsed 
  0.064   0.000   0.065

user  system elapsed 
  0.009   0.000   0.009 

So that's probably been the reason why  `$`  has been prefered?


Martin

    > cheers
    > Ben Bolker

tim@t@yior m@iii@g oii hidde@eieph@@ts@co@uk

Thu, Feb 3, 2022 3:30 AM #

On 03/02/2022 11:14 Martin Maechler <maechler at stat.math.ethz.ch> wrote:

Ben Bolker 
    on Tue, 1 Feb 2022 21:21:46 -0500 writes:

    > The model.weights() and model.offset() functions from the 'stats' 
    > package index possibly-missing elements of a data frame via $, e.g.

    > x$"(offset)"
    > x$"(weights)"

    > This returns NULL without comment when x is a data frame:

    > x <- data.frame(a=1)
    > x$"(offset)"  ## NULL
    > x$"(weights)"  ## NULL

    > However, when x is a tibble we get a warning as well:

    > x <- tibble::as_tibble(x)
    > x$"(offset)"
    > ## NULL
    > ## Warning message:
    > ## Unknown or uninitialised column: `(offset)`.

    > I know it's not R-core's responsibility to manage forward 
    > compatibility with tibbles, but in this case [[-indexing would seem to 
    > be better practice in any case.

Yes, I would agree:  we should use  [[ instead of $ here
in order to force exact matching just as principle

Importantly, because  also  mf[["(weights)"]]
will return  NULL without a warning for a model/data frame, and
it seems it does so also for tibbles.

    > Might a patch be accepted ... ?

That would not be necessary.

There's one remaining problem however:
`$` access is clearly faster than `[[` for small data frames
(because `$` is a primitive function doing everything in C, 
 whereas `[[` calls the R level data frame method ).

Faster in both cases, i.e., when there *is* a column and when there
is none (and NULL is returned), e.g., for the first case

system.time(for(i in 1:20000) df[["a"]])

   user  system elapsed 
  0.064   0.000   0.065

system.time(for(i in 1:20000) df$a)

   user  system elapsed 
  0.009   0.000   0.009 

So that's probably been the reason why  `$`  has been prefered?

Would .subset2(df, "a) be preferable?
R> df <- mtcars
R> system.time(for(i in 1:20000) df[["hp"]])
   user  system elapsed 
  0.078   0.000   0.078 
R> system.time(for(i in 1:20000) df$hp)
   user  system elapsed 
  0.011   0.000   0.010 
R> system.time(for(i in 1:20000) .subset2(df,"hp"))
   user  system elapsed 
  0.004   0.000   0.004 
Tim

Martin Maechler

Thu, Feb 3, 2022 6:21 AM #

>> On 03/02/2022 11:14 Martin Maechler <maechler at stat.math.ethz.ch> wrote:

>> 
    >> 
    >> >>>>> Ben Bolker 
    >> >>>>>     on Tue, 1 Feb 2022 21:21:46 -0500 writes:
    >> 
    >> > The model.weights() and model.offset() functions from the 'stats' 
    >> > package index possibly-missing elements of a data frame via $, e.g.
    >> 
    >> > x$"(offset)"
    >> > x$"(weights)"
    >> 
    >> > This returns NULL without comment when x is a data frame:
    >> 
    >> > x <- data.frame(a=1)
    >> > x$"(offset)"  ## NULL
    >> > x$"(weights)"  ## NULL
    >> 
    >> > However, when x is a tibble we get a warning as well:
    >> 
    >> > x <- tibble::as_tibble(x)
    >> > x$"(offset)"
    >> > ## NULL
    >> > ## Warning message:
    >> > ## Unknown or uninitialised column: `(offset)`.
    >> 
    >> > I know it's not R-core's responsibility to manage forward 
    >> > compatibility with tibbles, but in this case [[-indexing would seem to 
    >> > be better practice in any case.
    >> 
    >> Yes, I would agree:  we should use  [[ instead of $ here
    >> in order to force exact matching just as principle
    >> 
    >> Importantly, because  also  mf[["(weights)"]]
    >> will return  NULL without a warning for a model/data frame, and
    >> it seems it does so also for tibbles.
    >> 
    >> > Might a patch be accepted ... ?
    >> 
    >> That would not be necessary.
    >> 
    >> There's one remaining problem however:
    >> `$` access is clearly faster than `[[` for small data frames
    >> (because `$` is a primitive function doing everything in C, 
    >> whereas `[[` calls the R level data frame method ).
    >> 
    >> Faster in both cases, i.e., when there *is* a column and when there
    >> is none (and NULL is returned), e.g., for the first case
    >> 
    >> > system.time(for(i in 1:20000) df[["a"]])
    >> user  system elapsed 
    >> 0.064   0.000   0.065 
    >> > system.time(for(i in 1:20000) df$a)
    >> user  system elapsed 
    >> 0.009   0.000   0.009 
    >> 
    >> So that's probably been the reason why  `$`  has been prefered?

    > Would .subset2(df, "a) be preferable?

    R> df <- mtcars
    R> system.time(for(i in 1:20000) df[["hp"]])
    > user  system elapsed 
    > 0.078   0.000   0.078 
    R> system.time(for(i in 1:20000) df$hp)
    > user  system elapsed 
    > 0.011   0.000   0.010 
    R> system.time(for(i in 1:20000) .subset2(df,"hp"))
    > user  system elapsed 
    > 0.004   0.000   0.004 

    > Tim

Yes, I think that's a very good idea --

notably, as interestingly it seems to work with tibble's very
well, too.

Martin

Martin Maechler

Thu, Feb 3, 2022 9:09 AM #

>>> On 03/02/2022 11:14 Martin Maechler <maechler at stat.math.ethz.ch> wrote:

>>> 
    >>> 
    >>> >>>>> Ben Bolker 
    >>> >>>>>     on Tue, 1 Feb 2022 21:21:46 -0500 writes:
    >>> 
    >>> > The model.weights() and model.offset() functions from the 'stats' 
    >>> > package index possibly-missing elements of a data frame via $, e.g.
    >>> 
    >>> > x$"(offset)"
    >>> > x$"(weights)"
    >>> 
    >>> > This returns NULL without comment when x is a data frame:
    >>> 
    >>> > x <- data.frame(a=1)
    >>> > x$"(offset)"  ## NULL
    >>> > x$"(weights)"  ## NULL
    >>> 
    >>> > However, when x is a tibble we get a warning as well:
    >>> 
    >>> > x <- tibble::as_tibble(x)
    >>> > x$"(offset)"
    >>> > ## NULL
    >>> > ## Warning message:
    >>> > ## Unknown or uninitialised column: `(offset)`.
    >>> 
    >>> > I know it's not R-core's responsibility to manage forward 
    >>> > compatibility with tibbles, but in this case [[-indexing would seem to 
    >>> > be better practice in any case.
    >>> 
    >>> Yes, I would agree:  we should use  [[ instead of $ here
    >>> in order to force exact matching just as principle
    >>> 
    >>> Importantly, because  also  mf[["(weights)"]]
    >>> will return  NULL without a warning for a model/data frame, and
    >>> it seems it does so also for tibbles.
    >>> 
    >>> > Might a patch be accepted ... ?
    >>> 
    >>> That would not be necessary.
    >>> 
    >>> There's one remaining problem however:
    >>> `$` access is clearly faster than `[[` for small data frames
    >>> (because `$` is a primitive function doing everything in C, 
    >>> whereas `[[` calls the R level data frame method ).
    >>> 
    >>> Faster in both cases, i.e., when there *is* a column and when there
    >>> is none (and NULL is returned), e.g., for the first case
    >>> 
    >>> > system.time(for(i in 1:20000) df[["a"]])
    >>> user  system elapsed 
    >>> 0.064   0.000   0.065 
    >>> > system.time(for(i in 1:20000) df$a)
    >>> user  system elapsed 
    >>> 0.009   0.000   0.009 
    >>> 
    >>> So that's probably been the reason why  `$`  has been prefered?

    >> Would .subset2(df, "a) be preferable?

    R> df <- mtcars
    R> system.time(for(i in 1:20000) df[["hp"]])
    >> user  system elapsed 
    >> 0.078   0.000   0.078 
    R> system.time(for(i in 1:20000) df$hp)
    >> user  system elapsed 
    >> 0.011   0.000   0.010 
    R> system.time(for(i in 1:20000) .subset2(df,"hp"))
    >> user  system elapsed 
    >> 0.004   0.000   0.004 

    >> Tim

    > Yes, I think that's a very good idea --

    > notably, as interestingly it seems to work with tibble's very
    > well, too.

Interestingly (or not), changing this also fixes a real (rare!) bug:
When digging for a regression test, I've stumbled over an lm() example,
which when modified to use the not so common  "(weight)_2" as
*predictor* variable name it started to use that both as
predictor and also as weight (of some kind) such that the fit
changed.

This problem went away after apply the change,
[replacing `a$b` with `.subset2(a,b)]

Now committed to R-devel, svn rev 81650.

If there are no negative effects, this may also be backported to
R-patched.

Thank you both, once more!
Martin