Skip to content

Proposal: model.data

6 messages · Brian G. Peterson, Paul Johnson, Martin Maechler

#
Greetings:

I'm still working on functions to make it easier for students to
interpret regression predictions.  I am working out a scheme to
more easily create "newdata" objects (for use in predict functions).
This has been done before in packages like Zelig, Effects,
rms, and others. Nothing is "exactly right," though. Stata users keep
flashing their "predicted probabity tables" at me and I'd like
something like that to use with R.

I'm proposing here a function model.data that receives a regression
and creates a dataframe of raw data from which newdata objects
can be constructed. This follows a suggestion that Bill Dunlap made
to me in response to a question I posted in r-help.

While studying termplot code, I saw the "carrier" function approach
to deducing the "raw" predictors.  However, it does not always work.

Here is one problem. termplot mistakes 10 in log(10 + x1) for a variable

Example:

dat <- data.frame(x1 = rnorm(100), x2 = rpois(100, lambda=7))
STDE <- 10
dat$y <- 1.2 * log(10 + dat$x1) + 2.3 * dat$x2 + rnorm(100, sd = STDE)

m1 <- lm( y ~ log(10 + x1)  + x2, data=dat)
termplot(m1)

## See the trouble? termplot thinks 10 is the term to plot.

Another problem is that predict( type="terms") does not behave
sensibly sometimes. RHS of formula that have nonlinear transformations
are misunderstood as separate terms.

##Example:
dat$y2 <- 1.2 * log(10 + dat$x1) + 2.3 * dat$x1^2 + rnorm(100, sd = STDE)

m2 <- lm( y2 ~ log(10 + x1)  + sin(x1), data=dat)
summary(m2)

predict(m2, type="terms")

## Output:
## log(10 + x1)     sin(x1)
## 1     1.50051781 -2.04871711
## 2    -0.14707391  0.31131124

What I wish would happen instead is one "correct" prediction
for each value of x1. This should be the output:

predict(m2, newdata = data.frame(x1 = dat$x1))

## > predict(m2, newdata = data.frame(x1 = dat$x1))
##       1        2        3        4        5        6        7        8
## 17.78563 18.49806 17.50719 19.70093 17.45071 19.69718 18.84137 18.89971

The "fix" I'm testing now is the following new function, "model.data".
which tries to re-create the data object that would be
consistent with a fitted model. This follows a suggestion from
Bill Dunlap in r-help on 2012-04-22



##' Creates a "raw" (UNTRANSFORMED) data frame equivalent
##' to the input data that would be required to fit the given model.
##'
##' Unlike model.frame and model.matrix, this does not return transformed
##' variables.
##'
##' @param model A fitted regression model in which the data argument
##' is specified. This function will fail if the model was not fit
##' with the data option.
##' @return A data frame
##' @export
##' @author Paul E. Johnson <pauljohn@@ku.edu>
##' @example inst/examples/model.data-ex.R
model.data <- function(model){
    fmla <- formula(model)
    allnames <- all.vars(fmla) ## all variable names
    ## indep variables, includes d in poly(x,d)
    ivnames <- all.vars(formula(delete.response(terms(model))))
    ## datOrig: original data frame
    datOrig <-  eval(model$call$data, environment(formula(model)))
    if (is.null(datOrig))stop("model.data: input model has no data frame")
    ## datOrig: almost right, but includes d in poly(x, d)
    dat <- get_all_vars(fmla, datOrig)
    ## Get rid of "d" and other "non variable" variable names that are
not in datOrig:
    keepnames <- intersect(names(dat), names(datOrig))
    ## Keep only rows actually used in model fit, and the correct columns
    dat <- dat[ row.names(model$model) , keepnames]
    ## keep ivnames that exist in datOrig
    attr(dat, "ivnames") <- intersect(ivnames, names(datOrig))
    invisible(dat)
}


This works for the test cases like log(10+x) and so forth:

## Examples:

head(m1.data <- model.data(m1))

head(m2.data <- model.data(m2))

## > head(m1.data <- model.data(m1))
##          y          x1 x2
## 1 18.53846  0.46176539  8
## 2 28.24759  0.09720934  7
## 3 23.88184  0.67602556  9
## 4 23.50130 -0.74877054  8
## 5 25.81714  1.02555255  5
## 6 24.75052 -0.69659539  6
## > head(m2.data <- model.data(m2))
##          y          x1
## 1 18.53846  0.46176539
## 2 28.24759  0.09720934
## 3 23.88184  0.67602556
## 4 23.50130 -0.74877054
## 5 25.81714  1.02555255
## 6 24.75052 -0.69659539


d <- 2
m4 <- lm(y ~ poly(x1,d), data=dat)

head(m4.data <- model.data(m4))

##          y          x1
## 1 18.53846  0.46176539
## 2 28.24759  0.09720934
## 3 23.88184  0.67602556

Another strength of this approach is that the return object has an
attribute "ivnames".  If R's termplot used model.dat instead of the
carrier functions, this would make for a much tighter set of code.

What flaws do you see in this?

One flaw is that I did not know how to re-construct data from the
parent environment, so I insist the regression model has to have a
data argument. Is this necessary, or can one of the R experts help.

Another possible flaw: I'm keeping the columns from the data frame
that are needed to re-construct the model.frame, and to match
the rows, I'm using row.names for the model.frame.

Are there other formulae that will confuse this scheme?

If somebody in R Core would like this and think about putting it, or
something like it, into the base, then many chores involving predicted
values would become much easier.

PJ
#
On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote:
Why does this need to be in base?  Implement it in a package.
If it works, and is additive, people will use it.  Look at 'reshape' or
'xts' or 'Matrix' just to name a few examples of widely used packages.

Regards,

   - Brian
#
Greetings:
On Thu, May 3, 2012 at 11:36 AM, Brian G. Peterson <brian at braverock.com> wrote:
I can't use it to fix termplot unless it is in base.

Or are you suggesting I create my own "termplot" replacement?

  
    
#
On Thu, 2012-05-03 at 12:09 -0500, Paul Johnson wrote:
I was suggesting that you create a package that has all the features
that you think it needs.  

If you have a *patch* for termplot that would fix what you perceive to
be its problems, and not break existing code, then the usual method
would be to propose that.  It seems, though, that you are proposing more
significant changes to functionality, and it seems as though that would
run a risk of breaking backwards compatibility, which is usually a bad
idea.

Regards,

   - Brian
#
On Thu, May 3, 2012 at 12:19 PM, Brian G. Peterson <brian at braverock.com> wrote:
So, nobody agrees with me that R base should have model.data?  Too
bad!  You could save us a lot of effort.  I've found different efforts
to get the same work done in termplot and all of the implementations
of predict. And just about every regression related package has its
own approach.

If there were one good way that would always work, just think how
convenient it would be for all those function & package writers.  Oh,
well. I'm not saying that my model.data is perfect, just that I wish
there were a perfect one :)

Yesterday, I realized that predict.nls probably has to deal with this
problem so I studied that and have yet another version of model.data
to propose to you.  I'm using this in my regression-support package
rockchalk, so you don't need to give me Brian Peterson's advice to
"put it in a package".

The idea here is to take variables from a fitted model's data if it
can find them, and then grab variables from the parent environment IF
they have the correct length.  This means we ignore variables like d
in poly(x, d) because the variable d is not of the same length as the
variables in the model.

##' Creates a "raw" (UNTRANSFORMED) data frame equivalent
##' to the input data that would be required to fit the given model.
##'
##' Unlike model.frame and model.matrix, this does not return transformed
##' variables.
##'
##' @param model A fitted regression model in which the data argument
##' is specified. This function will fail if the model was not fit
##' with the data option.
##' @return A data frame
##' @export
##' @author Paul E. Johnson <pauljohn@@ku.edu>
##' @example inst/examples/model.data-ex.R
model.data <- function(model){
    #from nls, returns -1 for missing variables
    lenVar <- function(var, data) tryCatch(length(eval(as.name(var),
                         data, env)), error = function(e) -1)
    fmla <- formula(model)
    varNames <- all.vars(fmla) ## all variable names
    ## varNames includes d in poly(x,d), possibly other "constants"
    ## varNamesRHS <- all.vars(formula(delete.response(terms(model))))
    ## previous same as nls way?
    fmla2 <- fmla
    fmla2[[2L]] <- 0
    varNamesRHS <- all.vars(fmla2)
    varNamesLHS <- setdiff(varNames, varNamesRHS)
    env <- environment(fmla)
    if (is.null(env))
        env <- parent.frame()

    dataOrig <-  eval(model$call$data, environment(formula(model)))
    rndataOrig <- row.names(dataOrig)
    n <- sapply(varNames, lenVar, data=dataOrig)
    targetLength <- length(eval(as.name(varNamesLHS[1]), dataOrig, env))
    varNames <- varNames[ n == targetLength ]
    ldata <- lapply(varNames, function(x) eval(as.name(x), dataOrig, env))
    names(ldata) <- varNames
    data <- data.frame(ldata[varNames])
    if (!is.null(rndataOrig)) row.names(data) <- rndataOrig
    ## remove rows listed in model's na.action
    ## TODO: question: what else besides OMIT might be called for?
    if ( !is.null(model$na.action)){
        data <- data[ -as.vector(model$na.action),  ]
    }
    ## keep varNamesRHS that exist in datOrig
    attr(data, "varNamesRHS") <- setdiff(colnames(data), varNamesLHS)
    invisible(data)
}

And some example output:
y x1       x2         x3
1 -0.8086741  7 44.59614 -1.6193283
2  1.0011198  9 69.47693  0.5483979
3  0.4560525  8 50.53590  0.1952822
6  0.6417692  4 52.77954 -0.2509466
7 -0.4150210  5 56.91171  1.6993467
8 -0.4595757  6 58.23795 -0.3442988
y x1
1  56.336279  7
2  47.823205  5
3   9.296108  6
4  16.213508  5
5 -16.922331  3
6  10.639724  7
[1] "x1"
y x1
1  56.336279  7
2  47.823205  5
3   9.296108  6
4  16.213508  5
5 -16.922331  3
6  10.639724  7
[1] "x1"
y x1       x2
1  56.336279  7 56.27965
2  47.823205  5 50.02144
3   9.296108  6 52.84378
4  16.213508  5 39.98221
5 -16.922331  3 43.82778
6  10.639724  7 58.28194
[1] "x1" "x2"
y x1
1  56.336279  7
2  47.823205  5
3   9.296108  6
4  16.213508  5
5 -16.922331  3
6  10.639724  7
[1] "x1"
y x1
1  56.336279  7
2  47.823205  5
3   9.296108  6
4  16.213508  5
5 -16.922331  3
6  10.639724  7
[1] "x1"
y x1
1  56.336279  7
2  47.823205  5
3   9.296108  6
4  16.213508  5
5 -16.922331  3
6  10.639724  7
[1] "x1"
y x1
1  56.336279  7
2  47.823205  5
3   9.296108  6
4  16.213508  5
5 -16.922331  3
6  10.639724  7
$numerics
         x1        y
0%    1.000 -31.9800
25%   5.000   0.9636
50%   6.000  13.8400
75%   7.000  27.5300
100% 12.000  71.1100
mean  5.933  14.8400
sd    2.336  20.4300
var   5.456 417.3000
NA's  0.000   0.0000
N    90.000  90.0000

$factors
NULL
[1] "x1"
y x1
1  56.336279  7
2  47.823205  5
3   9.296108  6
4  16.213508  5
5 -16.922331  3
6  10.639724  7
$numerics
         x1        y
0%    1.000 -31.9800
25%   5.000   0.9636
50%   6.000  13.8400
75%   7.000  27.5300
100% 12.000  71.1100
mean  5.933  14.8400
sd    2.336  20.4300
var   5.456 417.3000
NA's  0.000   0.0000
N    90.000  90.0000

$factors
NULL
[1] "x1"
y x1       x2         x3
2  47.823205  5 50.02144 -2.4669386
3   9.296108  6 52.84378  0.4847158
4  16.213508  5 39.98221 -0.9379723
5 -16.922331  3 43.82778  3.3307333
7  26.084587  3 49.15181  0.2204558
8   4.392061  8 45.65280  0.8762108
$numerics
         x1     x2       x3        y
0%    1.000 27.630 -2.55600 -31.9800
25%   4.000 42.370 -0.52690   0.7905
50%   6.000 49.150  0.04162  11.5800
75%   7.000 54.390  0.71310  26.8900
100% 12.000 72.390  3.33100  71.1100
mean  5.883 48.800  0.02186  13.2300
sd    2.362  8.847  1.08900  20.0400
var   5.578 78.270  1.18700 401.8000
NA's  0.000  0.000  0.00000   0.0000
N    77.000 77.000 77.00000  77.0000

$factors
NULL
[1] "x1" "x2" "x3"
$numerics
         x1     x2        y
0%    1.000 27.630 -31.9800
25%   4.000 42.630   0.8032
50%   6.000 49.710  12.2500
75%   7.000 54.910  27.2300
100% 12.000 72.390  71.1100
mean  5.918 49.220  14.0800
sd    2.372  8.808  20.0000
var   5.624 77.590 399.9000
NA's  0.000  0.000   0.0000
N    85.000 85.000  85.0000

$factors
NULL
[1] "x1" "x2"
y x1       x2
1  56.336279  7 56.27965
2  47.823205  5 50.02144
3   9.296108  6 52.84378
4  16.213508  5 39.98221
5 -16.922331  3 43.82778
7  26.084587  3 49.15181
$numerics
         x1     x2        y
0%    1.000 27.630 -31.9800
25%   4.000 42.630   0.8032
50%   6.000 49.710  12.2500
75%   7.000 54.910  27.2300
100% 12.000 72.390  71.1100
mean  5.918 49.220  14.0800
sd    2.372  8.808  20.0000
var   5.624 77.590 399.9000
NA's  0.000  0.000   0.0000
N    85.000 85.000  85.0000

$factors
NULL
[1] "x1" "x2"
y x1       x2        x10         x11
1  56.336279  7 56.27965 -0.4562360 -0.27005578
2  47.823205  5 50.02144 -1.4031737 -1.11355194
3   9.296108  6 52.84378 -0.2816067 -0.08319405
4  16.213508  5 39.98221  0.3353308  0.17284191
5 -16.922331  3 43.82778 -0.5184262  0.61754378
7  26.084587  3 49.15181  0.9037821  0.02170007
[1] 85  5
$numerics
         x1     x2        y
0%    1.000 27.630 -31.9800
25%   4.000 42.630   0.8032
50%   6.000 49.710  12.2500
75%   7.000 54.910  27.2300
100% 12.000 72.390  71.1100
mean  5.918 49.220  14.0800
sd    2.372  8.808  20.0000
var   5.624 77.590 399.9000
NA's  0.000  0.000   0.0000
N    85.000 85.000  85.0000

$factors
NULL
[1] "x1"  "x2"  "x10" "x11"

  
    
#
PJ> Greetings: On Thu, May 3, 2012 at 11:36 AM, Brian
PJ> G. Peterson <brian at braverock.com> wrote:
>> On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote:
>>> If somebody in R Core would like this and think about
    >>> putting it, or something like it, into the base, then
    >>> many chores involving predicted values would become much
    >>> easier.
    >>> 
    >> Why does this need to be in base? ?Implement it in a
    >> package.
    >>> 
    >> If it works, and is additive, people will use it. ?Look
    >> at 'reshape' or 'xts' or 'Matrix' just to name a few
    >> examples of widely used packages.
    >> 

    PJ> I can't use it to fix termplot unless it is in base.

Indeed.  I think it would be useful to "branch this topic"
into one part where it is about fixing
     termplot()  and  predict(*, type = "terms")
because -- as it happens -- I've got this on my todo list to
look at, for a couple of weeks now.
I also think there are "infelicities" in there
that could/should be improved.

    PJ> Or are you suggesting I create my own "termplot"
    PJ> replacement?

not necessarily.. ;-)

Martin