Skip to content

predict.glm returns different results for the same model

6 messages · Avraham Adler, Joris Meys, Duncan Murdoch +2 more

#
Hi all,

Very surprising (to me!) and mystifying result from predict.glm(): the
predictions vary depending on whether or not I use ns() or
splines::ns(). Reprex follows:

library(splines)

set.seed(12345)
dat <- data.frame(claim = rbinom(1000, 1, 0.5))
mns <- c(3.4, 3.6)
sds <- c(0.24, 0.35)
dat$wind <- exp(rnorm(nrow(dat), mean = mns[dat$claim + 1], sd =
sds[dat$claim + 1]))
dat <- dat[order(dat$wind), ]

m1 <- glm(claim ~ ns(wind, df = 5), data = dat, family = binomial)
m2 <- glm(claim ~ splines::ns(wind, df = 5), data = dat, family = binomial)

# The model coefficients are the same
unname(coef(m1))
#> [1]  0.5194712 -0.8687737 -0.6803954  4.0838947  2.3908674  4.1564128
unname(coef(m2))
#> [1]  0.5194712 -0.8687737 -0.6803954  4.0838947  2.3908674  4.1564128

# But the predictions are not!
newdf <- data.frame(wind = seq(min(dat$wind), max(dat$wind), length = 5))
unname(predict(m1, newdata = newdf))
#> [1] 0.51947119 0.03208719 2.82548847 3.90883496 4.06743266
unname(predict(m2, newdata = newdf))
#> [1]  0.5194712 -0.5666554 -0.1731268  2.8134844  3.9295814

Is this a bug?

(Motivating issue from this ggplot2 bug report:
https://github.com/tidyverse/ggplot2/issues/2426)

Thanks!

Hadley
#
Can?t copy from my computer as gmail is blocked at work but if it helps,
the ?predvars? attribute if the terms object is not the same.

Avi
On Fri, Apr 27, 2018 at 9:25 AM Hadley Wickham <h.wickham at gmail.com> wrote:

            
#
Hi Hadley,

This is related to how the terms are constructed. If you look at terms(m1)
versus terms(m2), you see that in the case of m1 the knots are added to the
attribute predvars. Contrary, when using splines::ns() this doesn't happen.
Compare:

mf <- model.frame(claim ~ ns(wind, df = 5), data = dat)
mt <-  terms(mf)

with

mf2 <- model.frame(claim ~ splines::ns(wind, df = 5), data = dat)
mt2 <-  terms(mf)

The culprit is actually in makepredictcall.ns, specifically:

    if (as.character(call)[1L] != "ns")
        return(call)

Obviously the call starting with predict::ns() will cause the function to
return the call unaltered. In the case of ns() it processes the knot
information. And that difference causes the difference in predictions.

So it is a bug imho. Especially since I don't understand the logic. The
variable has class 'ns', so makepredictcall() dispatches correctly. That
extra check in the function makepredictcall.ns seems unnecessary to me and
might be simply removed.

Cheers
Joris
On Fri, Apr 27, 2018 at 3:25 PM, Hadley Wickham <h.wickham at gmail.com> wrote:

            

  
    
#
On 27/04/2018 9:25 AM, Hadley Wickham wrote:
The two objects m1 and m2 differ more than they should, so this looks 
like a problem in glm, not just in predict.glm.

 > attr(m1$terms, "predvars")
list(claim, ns(wind, knots = c(25.4756277492997, 30.2270250736796,
35.4093171222502, 43.038645381669), Boundary.knots = c(12.9423820390783,
108.071583734075), intercept = FALSE))

 > attr(m2$terms, "predvars")
list(claim, splines::ns(wind, df = 5))

This appears to be due to a bug in the splines package.  There, the 
function splines:::makepredictcall.ns looks like this:

makepredictcall.ns <- function(var, call)
{
     if(as.character(call)[1L] != "ns") return(call)
     at <- attributes(var)[c("knots", "Boundary.knots", "intercept")]
     xxx <- call[1L:2L]
     xxx[names(at)] <- at
     xxx
}

The test fails for m2, because as.character(call)[1L] is "splines::ns" 
instead of "ns". I'll see if I can work out a better test and submit a 
patch.

Duncan Murdoch
#
On Fri, Apr 27, 2018 at 7:28 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
Great, thanks!
#

        
> On 27/04/2018 9:25 AM, Hadley Wickham wrote:
>> Hi all,
    >> 
    >> Very surprising (to me!) and mystifying result from predict.glm(): the
    >> predictions vary depending on whether or not I use ns() or
    >> splines::ns(). Reprex follows: >
    >> library(splines)
    >> 
    >> set.seed(12345)
    >> dat <- data.frame(claim = rbinom(1000, 1, 0.5))
    >> mns <- c(3.4, 3.6)
    >> sds <- c(0.24, 0.35)
    >> dat$wind <- exp(rnorm(nrow(dat), mean = mns[dat$claim + 1], sd =
    >> sds[dat$claim + 1]))
    >> dat <- dat[order(dat$wind), ]
    >> 
    >> m1 <- glm(claim ~ ns(wind, df = 5), data = dat, family = binomial)
    >> m2 <- glm(claim ~ splines::ns(wind, df = 5), data = dat, family = binomial)
    >> 
    >> # The model coefficients are the same
    >> unname(coef(m1))
    >> #> [1]  0.5194712 -0.8687737 -0.6803954  4.0838947  2.3908674  4.1564128
    >> unname(coef(m2))
    >> #> [1]  0.5194712 -0.8687737 -0.6803954  4.0838947  2.3908674  4.1564128
    >> 
    >> # But the predictions are not!
    >> newdf <- data.frame(wind = seq(min(dat$wind), max(dat$wind), length = 5))
    >> unname(predict(m1, newdata = newdf))
    >> #> [1] 0.51947119 0.03208719 2.82548847 3.90883496 4.06743266
    >> unname(predict(m2, newdata = newdf))
    >> #> [1]  0.5194712 -0.5666554 -0.1731268  2.8134844  3.9295814
    >> 
    >> Is this a bug?

    > The two objects m1 and m2 differ more than they should, so this looks 
    > like a problem in glm, not just in predict.glm.

    >> attr(m1$terms, "predvars")
    > list(claim, ns(wind, knots = c(25.4756277492997, 30.2270250736796,
    > 35.4093171222502, 43.038645381669), Boundary.knots = c(12.9423820390783,
    > 108.071583734075), intercept = FALSE))

    >> attr(m2$terms, "predvars")
    > list(claim, splines::ns(wind, df = 5))

    > This appears to be due to a bug in the splines package.  There, the 
    > function splines:::makepredictcall.ns looks like this:

    > makepredictcall.ns <- function(var, call)
    > {
    > if(as.character(call)[1L] != "ns") return(call)
    > at <- attributes(var)[c("knots", "Boundary.knots", "intercept")]
    > xxx <- call[1L:2L]
    > xxx[names(at)] <- at
    > xxx
    > }

    > The test fails for m2, because as.character(call)[1L] is "splines::ns" 
    > instead of "ns". I'll see if I can work out a better test and submit a 
    > patch.

    > Duncan Murdoch

Thank you, Duncan, for the good diagnosis and the patch!

I will deal with it.

Two things however (both *not* needing a new patch; I'll deal
    	   with it after dinner ;-)

1) I'd like to commit that ASAP, but do want to add a *minimal*
   Reprex for regression test, rather than the above, notably as I
   see that the only place our code calls  makepredictcall() is in
   model.frame.default(), so I thought this should not really be
   related to glm at all, and I was right.

2) Reading the help page ?makepredictcall
    --- yes, a good idea, even for experts, believe me, ((notably when it is from	 a base package, not produced from roxy.. ;-\)) ---
   reveals what I vaguely remembered:
   The function was introduced to fix a bug first encountered in
      lm(. ~  poly(.))

  and indeed:

  -> That help page actually contains an indirect closer to minimal reprex.
  -> I will also patch the  makepredictcall.poly() method.

  [ as I said: after dinner .. ;-)]

--
Martin