Dear r-developers: I am struggling with some fundamental aspects of model.frame(). Conceptually, I think of a flow from data -> model.frame() -> model.matrix; the data contain _input variables_, while model.matrix contains _predictor variables_: data have been transformed, splines and polynomials have been expanded into their corresponding multi-dimensional bases, and factors have been expanded into appropriate sets of dummy variables depending on their contrasts. I originally thought of model.frame() as containing input variables as well (but with only the variables needed by the model, and with cases containing NAs handled according to the relevant na.action setting), but that's not quite true. While factors are retained as-is, splines and polynomials and parameter transformations are evaluated. For example d <- data.frame(x=1:10,y=1:10) model.frame(y~log(x),d) produces a model frame with columns 'y', 'log(x)' (not 'y', 'x'). This makes it hard (impossible?) to use the model frame to re-evaluate the existing formula in a model, e.g. m <- lm(y~log(x),d) update(m,data=model.frame(m)) ## Error in eval(expr, envir, enclos) : object 'x' not found It seems to me that this is a reasonable thing to want to do (i.e. use the model frame as a stored copy of the data that can be used for additional model operations); otherwise, I either need to carry along an additional copy of the data in a slot, or hope that the model is still living in an environment where it can find a copy of the original data. Does anyone have any insights into the original design choices, or suggestions about how they have handled this within their own code? Do you just add an additional data slot to the model? I've considered trying to write some kind of 'augmented' model frame, that would contain the equivalent of setdiff(all.vars(formula),model.frame(m)) [i.e. all input variables that appeared in the formula but not in the model frame ...]. thanks Ben Bolker
model.frame(), model.matrix(), and derived predictor variables
5 messages · Ben Bolker, Gabriel Becker
7 days later
Bump: just trying one more time to see if anyone had thoughts on this (so far it's just <crickets> ...) -------- Original Message -------- Subject: model.frame(), model.matrix(), and derived predictor variables Date: Sat, 17 Aug 2013 12:19:58 -0400 From: Ben Bolker <bbolker at gmail.com> To: R-devel at stat.math.ethz.ch <R-devel at stat.math.ethz.ch> Dear r-developers: I am struggling with some fundamental aspects of model.frame(). Conceptually, I think of a flow from data -> model.frame() -> model.matrix; the data contain _input variables_, while model.matrix contains _predictor variables_: data have been transformed, splines and polynomials have been expanded into their corresponding multi-dimensional bases, and factors have been expanded into appropriate sets of dummy variables depending on their contrasts. I originally thought of model.frame() as containing input variables as well (but with only the variables needed by the model, and with cases containing NAs handled according to the relevant na.action setting), but that's not quite true. While factors are retained as-is, splines and polynomials and parameter transformations are evaluated. For example d <- data.frame(x=1:10,y=1:10) model.frame(y~log(x),d) produces a model frame with columns 'y', 'log(x)' (not 'y', 'x'). This makes it hard (impossible?) to use the model frame to re-evaluate the existing formula in a model, e.g. m <- lm(y~log(x),d) update(m,data=model.frame(m)) ## Error in eval(expr, envir, enclos) : object 'x' not found It seems to me that this is a reasonable thing to want to do (i.e. use the model frame as a stored copy of the data that can be used for additional model operations); otherwise, I either need to carry along an additional copy of the data in a slot, or hope that the model is still living in an environment where it can find a copy of the original data. Does anyone have any insights into the original design choices, or suggestions about how they have handled this within their own code? Do you just add an additional data slot to the model? I've considered trying to write some kind of 'augmented' model frame, that would contain the equivalent of setdiff(all.vars(formula),model.frame(m)) [i.e. all input variables that appeared in the formula but not in the model frame ...]. thanks Ben Bolker
3 days later
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20130828/2c8f67f6/attachment.pl>
On 13-08-28 05:43 PM, Gabriel Becker wrote:
Ben, It works for me ...
x = rpois(100, 5) + 1 y = rnorm(100, x) d = data.frame(x,y) m <- lm(y~log(x),d) update(m,data=model.frame(m))
Call:
lm(formula = y ~ log(x), data = model.frame(m))
Coefficients:
(Intercept) log(x)
-4.010 5.817
That's because x and y are still lying around in your global environment. If you rm(x); rm(y) then it won't work any more. And it wouldn't have worked if you had constructed your model frame directly as d = data.frame(x=rpois(100,5)+1) d = transform(d,y=rnorm(100,x))
You can also re-fit using the model.matrix directly. In your example, the model frame can be passed directly to lm.fit /lm.wfit.
Yes, if I want to refit the same model. But if I want to do
something else with the model (e.g. try fitting vs. x instead of log(x),
or some other function of x) then it doesn't work.
cheers
Ben
~G
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.1
On Sat, Aug 24, 2013 at 7:40 PM, Ben Bolker <bbolker at gmail.com
<mailto:bbolker at gmail.com>> wrote:
Bump: just trying one more time to see if anyone had thoughts on this
(so far it's just <crickets> ...)
-------- Original Message --------
Subject: model.frame(), model.matrix(), and derived predictor variables
Date: Sat, 17 Aug 2013 12:19:58 -0400
From: Ben Bolker <bbolker at gmail.com <mailto:bbolker at gmail.com>>
To: R-devel at stat.math.ethz.ch <mailto:R-devel at stat.math.ethz.ch>
<R-devel at stat.math.ethz.ch <mailto:R-devel at stat.math.ethz.ch>>
Dear r-developers:
I am struggling with some fundamental aspects of model.frame().
Conceptually, I think of a flow from data -> model.frame() ->
model.matrix; the data contain _input variables_, while model.matrix
contains _predictor variables_: data have been transformed, splines and
polynomials have been expanded into their corresponding
multi-dimensional bases, and factors have been expanded into appropriate
sets of dummy variables depending on their contrasts.
I originally thought of model.frame() as containing input variables as
well (but with only the variables needed by the model, and with cases
containing NAs handled according to the relevant na.action setting), but
that's not quite true. While factors are retained as-is, splines and
polynomials and parameter transformations are evaluated. For example
d <- data.frame(x=1:10,y=1:10)
model.frame(y~log(x),d)
produces a model frame with columns 'y', 'log(x)' (not 'y', 'x').
This makes it hard (impossible?) to use the model frame to re-evaluate
the existing formula in a model, e.g.
m <- lm(y~log(x),d)
update(m,data=model.frame(m))
## Error in eval(expr, envir, enclos) : object 'x' not found
It seems to me that this is a reasonable thing to want to do
(i.e. use the model frame as a stored copy of the data that
can be used for additional model operations); otherwise, I
either need to carry along an additional copy of the data in
a slot, or hope that the model is still living in an environment
where it can find a copy of the original data.
Does anyone have any insights into the original design choices,
or suggestions about how they have handled this within their own
code? Do you just add an additional data slot to the model? I've
considered trying to write some kind of 'augmented' model frame, that
would contain the equivalent of
that appeared in the formula but not in the model frame ...].
setdiff(all.vars(formula),model.frame(m)) [i.e. all input variables
that appeared in the formula but not in the model frame ...].
thanks
Ben Bolker
______________________________________________
R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
--
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20130829/dbb01303/attachment.pl>