predict.lm(...,type="terms") question - R-help

Fri, Aug 31, 2012 10:40 PM #

On Aug 31, 2012, at 3:48 PM, jjthaden wrote:

This was the code you offered:

#Ludbrook's data set S1 (except renaming
#his 'x' as 'concn' and his 'y' as 'area')
S1 <- data.frame(
     area = c(2.4,2.6,6.0,6.5,8.9,),
     concn = c(1.1,4,5,8.5,8.5))

There's an extra comma in the "area" vector.

I don't really know  ....  this appears to be a different problem than  
you posed earlier. If you would learn to post with context, we might  
not be in the position of trying to read your mind.

Yes. I have no trouble doing so.

 >  x <- rnorm(15); x2=rnorm(15)
 >  y <- x + x2 +rnorm(15)
 >  fit <- lm(y ~ x+x2)
 >  new <- data.frame(x2 = seq(-3.5, 3.5, 0.5)  )
 >  predict(fit, newdata=new,type="terms", terms="x2")
            x2
1  -4.9230035
2  -4.1850588
3  -3.4471141
4  -2.7091694
5  -1.9712248
6  -1.2332801
7  -0.4953354
8   0.2426093
9   0.9805539
10  1.7184986
11  2.4564433
12  3.1943880
13  3.9323326
14  4.6702773
15  5.4082220
attr(,"constant")
[1] 0.4501911

Hmmm ... Nabble ... definitely part of the problem.

David Winsemius, MD
Alameda, CA, USA

Rui Barradas

Sat, Sep 1, 2012 4:33 AM #

Hello,

John is right, there seems to be an error in predict.lm. It can be made 
to work but if the model is fitted with lm(...,data) then it messes 
things up. Apparently predict.lm disregards the data argument and uses 
whatever it finds in the global environment. In the examples below, 
after running David's x <- rnorm(15) example (case 1 is a simplified 
version of it), case 3 shows the bug, there are 15 predictions for 4 new 
values. Then we must backtrack to case 2 and also consider it a bug.

Examples:

# ------------------- 1
# David's example simplified, only one regressor.
# And with set.seed()

set.seed(4409)
x <- rnorm(15)
y <- x + rnorm(15)

fit <- lm(y ~ x)
new <- data.frame(y = seq(-3.5, 3.5, 0.5))
predict(fit, newdata = new, type = "terms")
# no error is thrown.

# ------------------- 2
# Original post, 'area' and 'concn' are now 'y' and 'x'
# Remove 'x' and 'y' from the global environment.
rm(x, y)
dat <- data.frame(
     y = c(4875, 8172, 18065, 34555),
     x = c(25, 50, 125, 250))
new <- data.frame(y = c(8172, 10220, 11570, 24150))

model <- lm(y ~ x, data = dat)
predict(model, newdata = new, type = "terms")
# Error in eval(expr, envir, enclos) : object 'x' not found

# ------------------- 3
# Original post.
# Do NOT remove 'x' and 'y' from the global environment.
set.seed(4409)
x <- rnorm(15)
y <- x + rnorm(15)

dat <- data.frame(
     y = c(4875, 8172, 18065, 34555),
     x = c(25, 50, 125, 250))
new <- data.frame(y = c(8172, 10220, 11570, 24150))

model <- lm(y ~ x, data = dat)
predict(model, newdata = new, type = "terms")
# Warning message:
# 'newdata' had 4 rows but variable(s) found have 15 rows

#------------------- 4
# Original post, no data argument to lm.

y <- c(4875, 8172, 18065, 34555)
x <- c(25, 50, 125, 250)
new <- data.frame(y = c(8172, 10220, 11570, 24150))

model <- lm(y ~ x)
predict(model, newdata = new, type = "terms")
# no warning or error is thrown, predictions seem allright

I haven't looked at the code for predict.lm, and really don't know where 
the error is.
Bug report?

Rui Barradas

Em 01-09-2012 06:40, David Winsemius escreveu:

On Aug 31, 2012, at 3:48 PM, jjthaden wrote:

On Aug 30, 2012, at 4:35 PM,  David Windemius wrote:

David said my newdata data frame 'new' must have a column named 
'area'.
It did. Nonetheless predict.lm throws an error with type = "terms" and
newdata = new. I see nothing in the predict.lm documentation that
bars this usage. Is there a bug?

After correcting the error in your definition of the 'area' vector I
get no error from predict.lm().

David.

What error did you correct?

This was the code you offered:

#Ludbrook's data set S1 (except renaming
#his 'x' as 'concn' and his 'y' as 'area')
S1 <- data.frame(
    area = c(2.4,2.6,6.0,6.5,8.9,),
    concn = c(1.1,4,5,8.5,8.5))

There's an extra comma in the "area" vector.

The newdata data frames in my examples have been pretty simple, and have
defined the 'area' vector simply, for instance,
new <- data.frame(area = c(8172, 10220, 11570, 24150))
new
#        area
#    1  8172
#    2 10220
#    3 11570
#    4 24150

Is something wrong with this?

I don't really know  ....  this appears to be a different problem than 
you posed earlier. If you would learn to post with context, we might 
not be in the position of trying to read your mind.

 Were you able to make predict.lm work with newdata = new and type = 
"terms"?

Yes. I have no trouble doing so.

 x <- rnorm(15); x2=rnorm(15)
 y <- x + x2 +rnorm(15)
 fit <- lm(y ~ x+x2)
 new <- data.frame(x2 = seq(-3.5, 3.5, 0.5)  )
 predict(fit, newdata=new,type="terms", terms="x2")

           x2
1  -4.9230035
2  -4.1850588
3  -3.4471141
4  -2.7091694
5  -1.9712248
6  -1.2332801
7  -0.4953354
8   0.2426093
9   0.9805539
10  1.7184986
11  2.4564433
12  3.1943880
13  3.9323326
14  4.6702773
15  5.4082220
attr(,"constant")
[1] 0.4501911

-John

Sent from the (NOT) R help mailing list (and NOT) archive at Nabble.com.

Hmmm ... Nabble ... definitely part of the problem.

David Winsemius

Sat, Sep 1, 2012 6:38 PM #

On Sep 1, 2012, at 4:33 AM, Rui Barradas wrote:

That is not how I would describe the "problem". See below.

Why should predict not complain when it is offered a newdata argument  
that does no contain a vector of values for "x"? The whole point of  
the terms method of prediction is to offer estimates for specific  
values of items on the RHS of the formula. The OP seems to have  
trouble understanding that point. Putting in a vector with the name of  
the LHS item makes no sense to me. I certainly cannot see that any  
particular behavior for this pathological input is described for  
predict.lm in its help page, but throwing an error seems perfectly  
reasonable to me.

David.

>
> # ------------------- 3
> # Original post.
> # Do NOT remove 'x' and 'y' from the global environment.
> set.seed(4409)
> x <- rnorm(15)
> y <- x + rnorm(15)
>
> dat <- data.frame(
>    y = c(4875, 8172, 18065, 34555),
>    x = c(25, 50, 125, 250))
> new <- data.frame(y = c(8172, 10220, 11570, 24150))
>
> model <- lm(y ~ x, data = dat)
> predict(model, newdata = new, type = "terms")
> # Warning message:
> # 'newdata' had 4 rows but variable(s) found have 15 rows
>
> #------------------- 4
> # Original post, no data argument to lm.
>
> y <- c(4875, 8172, 18065, 34555)
> x <- c(25, 50, 125, 250)
> new <- data.frame(y = c(8172, 10220, 11570, 24150))
>
> model <- lm(y ~ x)
> predict(model, newdata = new, type = "terms")
> # no warning or error is thrown, predictions seem allright
>
> I haven't looked at the code for predict.lm, and really don't know  
> where the error is.
> Bug report?
>
> Rui Barradas
>
> Em 01-09-2012 06:40, David Winsemius escreveu:
>>
>> On Aug 31, 2012, at 3:48 PM, jjthaden wrote:
>>
>>> On Aug 30, 2012, at 4:35 PM,  David Windemius wrote:
>>>
>>>>> David said my newdata data frame 'new' must have a column named  
>>>>> 'area'.
>>>>> It did. Nonetheless predict.lm throws an error with type =  
>>>>> "terms" and
>>>>> newdata = new. I see nothing in the predict.lm documentation that
>>>>> bars this usage. Is there a bug?
>>>>
>>>> After correcting the error in your definition of the 'area'  
>>>> vector I
>>>> get no error from predict.lm().
>>>
>>>> David.
>>>
>>> What error did you correct?
>>
>> This was the code you offered:
>>
>> #Ludbrook's data set S1 (except renaming
>> #his 'x' as 'concn' and his 'y' as 'area')
>> S1 <- data.frame(
>>    area = c(2.4,2.6,6.0,6.5,8.9,),
>>    concn = c(1.1,4,5,8.5,8.5))
>>
>> There's an extra comma in the "area" vector.
>>
>>>
>>> The newdata data frames in my examples have been pretty simple,  
>>> and have
>>> defined the 'area' vector simply, for instance,
>>> new <- data.frame(area = c(8172, 10220, 11570, 24150))
>>> new
>>> #        area
>>> #    1  8172
>>> #    2 10220
>>> #    3 11570
>>> #    4 24150
>>>
>>> Is something wrong with this?
>>
>> I don't really know  ....  this appears to be a different problem  
>> than you posed earlier. If you would learn to post with context, we  
>> might not be in the position of trying to read your mind.
>>
>>> Were you able to make predict.lm work with newdata = new and type  
>>> = "terms"?
>>
>> Yes. I have no trouble doing so.
>>
>> >  x <- rnorm(15); x2=rnorm(15)
>> >  y <- x + x2 +rnorm(15)
>> >  fit <- lm(y ~ x+x2)
>> >  new <- data.frame(x2 = seq(-3.5, 3.5, 0.5)  )
>> >  predict(fit, newdata=new,type="terms", terms="x2")
>>           x2
>> 1  -4.9230035
>> 2  -4.1850588
>> 3  -3.4471141
>> 4  -2.7091694
>> 5  -1.9712248
>> 6  -1.2332801
>> 7  -0.4953354
>> 8   0.2426093
>> 9   0.9805539
>> 10  1.7184986
>> 11  2.4564433
>> 12  3.1943880
>> 13  3.9323326
>> 14  4.6702773
>> 15  5.4082220
>> attr(,"constant")
>> [1] 0.4501911
>>
>>
>>>
>>> -John
>>>
>>
>>> Sent from the (NOT) R help mailing list (and NOT) archive at  
>>> Nabble.com.
>>
>> Hmmm ... Nabble ... definitely part of the problem.
>>
>

David Winsemius, MD
Alameda, CA, USA

Peter Dalgaard

Sat, Sep 1, 2012 11:35 PM #

On Sep 2, 2012, at 03:38 , David Winsemius wrote:

Yes. Lots of confusion going on here. 

First, data= is _always_ used as the _first_ place to look for variables, if things are not in it, search continues into the formula's environment. To be slightly perverse, notice that even this works:

Call:
lm(formula = y ~ x, data = d)

Coefficients:
(Intercept)            x  
    -0.2760       0.2328  

Secondly, what is predict(..., type="terms") supposed to have to do with inverting a regression equation? That's just not what it does, it only splits the prediction formula into its constituent terms.

Thirdly; no, you do not invert a regression equation by regressing y on x. That only works if you can be sure that your new (x, y) are sampled from the same population as the data, which is not going to be the case if you are fitting to data with, say, selected equispaced x values. There's a whole literature on how to do this properly, Google e.g. "inverse calibration" for enlightenment.

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

John

Sun, Sep 2, 2012 10:07 AM #

Thank you all. My muddle about predict.lm(..., type = "terms") was evident even in my first sentence of my original posting

the answer being that I cannot; that new response values, if included in newdata, will simply be ignored by predict.lm, as well they should.

As for the calibration issue, I am reviewing literature now as suggested.

Though predict.lm performed to spec (no bug), may I suggest a minor
change to ?predict.lm text?

Existing: 
 newdata  An optional data frame in which to 
          look for variables with  
          which to predict. If omitted, 
          the fitted values are used.
Proposed: 
 newdata  An optional data frame in which to 
          look for new values of terms with 
          which to predict. If omitted, the 
          fitted values are used.

-John Thaden, Ph.D.
 College Station, TX

--- On Sun, 9/2/12, peter dalgaard <pdalgd at gmail.com> wrote:

From: peter dalgaard <pdalgd at gmail.com>
Subject: Re: [R] predict.lm(...,type="terms") question
To: "David Winsemius" <dwinsemius at comcast.net>
Cc: "Rui Barradas" <ruipbarradas at sapo.pt>, r-help at r-project.org, "jjthaden" <jjthaden at flash.net>
Date: Sunday, September 2, 2012, 1:35 AM

On Sep 2, 2012, at 03:38 , David Winsemius wrote:

Why should predict not complain when it is offered a

newdata argument that does no contain a vector of values for
"x"? The whole point of the terms method of prediction is to
offer estimates for specific values of items on the RHS of
the formula. The OP seems to have trouble understanding that
point. Putting in a vector with the name of the LHS item
makes no sense to me. I certainly cannot see that any
particular behavior for this pathological input is described
for predict.lm in its help page, but throwing an error seems
perfectly reasonable to me.

Yes. Lots of confusion going on here. 

First, data= is _always_ used as the _first_ place to look
for variables, if things are not in it, search continues
into the formula's environment. To be slightly perverse,
notice that even this works:

y <- rnorm(10)
x <- rnorm(10)
d <- data.frame(z=rnorm(9))
lm(y ~ x, d)

Call:
lm(formula = y ~ x, data = d)

Coefficients:
(Intercept)? ? ? ? ? ? x?

? ? -0.2760? ?
???0.2328? 

Secondly, what is predict(..., type="terms") supposed to
have to do with inverting a regression equation? That's just
not what it does, it only splits the prediction formula into
its constituent terms.

Thirdly; no, you do not invert a regression equation by
regressing y on x. That only works if you can be sure that
your new (x, y) are sampled from the same population as the
data, which is not going to be the case if you are fitting
to data with, say, selected equispaced x values. There's a
whole literature on how to do this properly, Google e.g.
"inverse calibration" for enlightenment.? 

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk?
Priv: PDalgd at gmail.com

Rui Barradas

Sun, Sep 2, 2012 11:25 AM #

Hello,

In the mean time, I've "discovered" an R package that might do the job, 
chemCal. It only implements first order linear regression (weighted), 
and like I'd said in my first reply, and was repeated several times, the 
standard errors are not reversed, the prediction bands follow Massart et 
al. (1988).
Take a look at it, it's a very simple package.

Rui Barradas

Em 02-09-2012 18:07, John Thaden escreveu:

Thank you all. My muddle about predict.lm(..., type = "terms") was evident even in my first sentence of my original posting

How can I actually use the output of
predict.lm(..., type="terms") to predict
new term values from new response values?

the answer being that I cannot; that new response values, if included in newdata, will simply be ignored by predict.lm, as well they should.

As for the calibration issue, I am reviewing literature now as suggested.

Though predict.lm performed to spec (no bug), may I suggest a minor
change to ?predict.lm text?

Existing:
  newdata  An optional data frame in which to
           look for variables with
           which to predict. If omitted,
           the fitted values are used.
Proposed:
  newdata  An optional data frame in which to
           look for new values of terms with
           which to predict. If omitted, the
           fitted values are used.

-John Thaden, Ph.D.
  College Station, TX

--- On Sun, 9/2/12, peter dalgaard <pdalgd at gmail.com> wrote:

From: peter dalgaard <pdalgd at gmail.com>
Subject: Re: [R] predict.lm(...,type="terms") question
To: "David Winsemius" <dwinsemius at comcast.net>
Cc: "Rui Barradas" <ruipbarradas at sapo.pt>, r-help at r-project.org, "jjthaden" <jjthaden at flash.net>
Date: Sunday, September 2, 2012, 1:35 AM

On Sep 2, 2012, at 03:38 , David Winsemius wrote:

Why should predict not complain when it is offered a

newdata argument that does no contain a vector of values for
"x"? The whole point of the terms method of prediction is to
offer estimates for specific values of items on the RHS of
the formula. The OP seems to have trouble understanding that
point. Putting in a vector with the name of the LHS item
makes no sense to me. I certainly cannot see that any
particular behavior for this pathological input is described
for predict.lm in its help page, but throwing an error seems
perfectly reasonable to me.

Yes. Lots of confusion going on here.

First, data= is _always_ used as the _first_ place to look
for variables, if things are not in it, search continues
into the formula's environment. To be slightly perverse,
notice that even this works:

y <- rnorm(10)
x <- rnorm(10)
d <- data.frame(z=rnorm(9))
lm(y ~ x, d)

Call:
lm(formula = y ~ x, data = d)

Coefficients:
(Intercept)            x

     -0.2760
    0.2328

Secondly, what is predict(..., type="terms") supposed to
have to do with inverting a regression equation? That's just
not what it does, it only splits the prediction formula into
its constituent terms.

Thirdly; no, you do not invert a regression equation by
regressing y on x. That only works if you can be sure that
your new (x, y) are sampled from the same population as the
data, which is not going to be the case if you are fitting
to data with, say, selected equispaced x values. There's a
whole literature on how to do this properly, Google e.g.
"inverse calibration" for enlightenment.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk
Priv: PDalgd at gmail.com