Skip to content

update.default: fall back on model.frame in case that the data frame is not in the parent environment

5 messages · Duncan Murdoch, Thaler, Thorn, LAUSANNE, Applied Mathematics

#
Dear all,

Suppose the following code:

--------------8<--------------
mm <- function(datf) {
  lm(y ~ x, data = datf)
}
mydatf <- data.frame(x = rep(1:2, 10), y = rnorm(20, rep(1:2, 10)))

l <- mm(mydatf)
-------------->8--------------

If I want to update l now without providing the data argument an error
occurs:

--------------8<--------------
Error in inherits(x, "data.frame") : object 'datf' not found
-------------->8--------------

and I've to provide the data argument explicitly:
--------------8<--------------
update(l, . ~ ., data = mydatf)
update(l, . ~ ., data = model.frame(l))
-------------->8--------------

While the first work-around is additionally error prone (what if I
change the name of mydatf earlier in the file? In the best case I just
get an error if mydatf is not defined), both options are kind of
semantically questionable (I do not want to _update_ the data argument
of the lm object it should remain untouched).

So my suggestion would be that update falls back on the data stored in
model.frame in case that the data argument in the lm call cannot be
resolved in the parent.frame of update, which can be easily achieved by
adding just four lines to update.default:

--------------8<--------------
update.default <- function (object, formula., ..., evaluate = TRUE) {
    call <- object$call
    if (is.null(call)) 
        stop("need an object with call component")
    extras <- match.call(expand.dots = FALSE)$...
    if (!missing(formula.)) 
        call$formula <- update.formula(formula(object), formula.)
    if (length(extras)) {
        existing <- !is.na(match(names(extras), names(call)))
        for (a in names(extras)[existing]) call[[a]] <- extras[[a]]
        if (any(!existing)) {
            call <- c(as.list(call), extras[!existing])
            call <- as.call(call)
        }
    }
    if (!is.null(call$data)) {
        if (!exists(as.character(call$data), envir = parent.frame()))
            call$data <- model.frame(object)
    }
    if (evaluate) 
        eval(call, parent.frame())
    else call
}
-------------->8--------------

This is just a quick dirty hack which works fine here (with an ugly
drawback that in the standard output of lm I now see the lengthy
explicit data.frame statement) but I'm sure there are some cracks out
there who could take it over from here and beautify this idea.

I don't see any problems with this proposition regarding old code, but
if I'm wrong and there are some reasons not to touch update.default in
the way I was proposing please let me know. Any other feedback is highly
appreciated too.

Thanks for sharing your thoughts with me.

KR,

-Thorn
#
It looks to me as though your proposal would allow update to remove 
variables, but would give erroneous results when adding them.  For example:

mm <- function(datf) {
   lm(y ~ x, data = datf)
}
mydatf <- data.frame(x = rep(1:2, 10), y = rnorm(20, rep(1:2, 10)), z = 
rnorm(20))

l <- mm(mydatf)
update(l, . ~ . + z)   # This fails, z is not found

z <- rnorm(20)
update(l, . ~ . + z)   # This finds the wrong z, without a warning

I'd rather get the "datf not found" error than wrong results.

Duncan Murdoch
On 02/08/2011 7:48 AM, Thaler, Thorn, LAUSANNE, Applied Mathematics wrote:
#
On 02/08/2011 9:41 AM, Duncan Murdoch wrote:
... of course, the standard code will give wrong results if there's 
another variable named "datf" in the global environment, so the status 
quo isn't ideal either.

Duncan Murdoch
#
=
Good point. So let me rephrase the initial problem:

1.) An lm object is fitted somewhere with some data, which resides
somewhere in the memory.
2.) An ideal update function would know where the original data is
(rather than assuming that it is stored 
  a.) in the parent frame
  b.) under the name given in the call slot of the lm object)
    
While from my point of view assumption a.) seems to be reasonable,
assumption b.) is kind of awkward as pointed out, because it makes it
kind of cumbersome to update models, which were created inside a
function (which should not be a too rare use case).

Thus, I've to questions:
1.) Is it somehow possible to retrieve the original data.frame with
which an lm is fitted just from the knowledge of the fit? I fear that
model.frame is the best I have. 
2.) Is there any other way of making update aware of where to look for
the model building data?

By the way, another work-around I was just thinking of is to use

mm <- function(datf) {
   l <- lm(y ~ x, data = datf)
   call <- l$call 
   call$data <- substitute(datf)
   l$call <- call
   l   
}

which solves my issue (and with which I can very well live with), but I
was wondering whether you see any chance that update could be made
smarter? Thanks for your input.


KR,

-Thorn
#
On 02/08/2011 10:48 AM, Thaler,Thorn,LAUSANNE,Applied Mathematics wrote:
I don't think so.  You can get the environment in which the formula was 
created from the "terms" component of the result; that's the second 
place lm() will look.  The first place it will look is in the explicitly 
specified data variable, and you can get its name, but I don't think the 
result object necessarily stores the full "data" argument or the 
environment in which to look it up.  (In your example, you can look up 
"datf" in environment(l$terms) and get it, but that wouldn't work if the 
formula had also been specified as an argument to mm().)
I would suggest something simpler:  return a list containing both l and 
datf, and pass datf to update.  You can attach a class to that list to 
hide some of the ugliness if you like.

Duncan Murdoch