Skip to content

Issue with predict() for glm models

10 messages · jrausch@nd.edu, Uwe Ligges, Paul Johnson +2 more

#
Hello everyone, 

I am having a problem using the predict (or the predict.glm) function in R.
Basically, I run the glm model on a "training" data set and try to obtain
predictions for a set of new predictors from a "test" data set (i.e., not the
predictors that were utilized to obtain the glm parameter estimates).
Unfortunately, every time that I attempt this, I obtain the predictions for the
predictors that were used to fit the glm model. I have looked at the R mailing
list archives and don't believe I am making the same mistakes that have been
made in the past and also have tried to closely follow the predict.glm example
in the help file. Here is an example of what I am trying to do: 

########################################################
set.seed(545345)

################
# Necessary Variables # 
################

p <- 2
train.n <- 20
test.n <- 25 
mean.vec.1 <- c(1,1)
mean.vec.2 <- c(0,0)

Sigma.1 <- matrix(c(1,.5,.5,1),p,p)
Sigma.2 <- matrix(c(1,.5,.5,1),p,p)

###############
# Load MASS Library #
###############

library(MASS)

###################################
# Data to Parameters for Logistic Regression Model #
###################################

train.data.1 <- mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1)
train.data.2 <- mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2)
train.class.var <- as.factor(c(rep(1,train.n),rep(2,train.n)))
predictors.train <- rbind(train.data.1,train.data.2)

##############################################
# Test Data Where Predictions for Probabilities Using Logistic Reg.  #
# From Training Data are of Interest                                          #
############################################## 

test.data.1 <- mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1)
test.data.2 <- mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2)
predictors.test <- rbind(test.data.1,test.data.2)

##############################
# Run Logistic Regression on Training Data #
##############################

log.reg <- glm(train.class.var~predictors.train,
family=binomial(link="logit"))
log.reg

#> log.reg

#Call:  glm(formula = train.class.var ~ predictors.train, family =
#binomial(link = "logit")) 
#
#Coefficients:
#      (Intercept)  predictors.train1  predictors.train2  
#           0.5105            -0.2945            -1.0811  
#
#Degrees of Freedom: 39 Total (i.e. Null);  37 Residual
#Null Deviance:      55.45 
#Residual Deviance: 41.67        AIC: 47.67 

###########################
# Predicted Probabilities for Test Data #
###########################

New.Data <- data.frame(predictors.train1=predictors.test[,1],
predictors.train2=predictors.test[,2])

logreg.pred.prob.test <- predict.glm(log.reg,New.Data,type="response")
logreg.pred.prob.test

#logreg.pred.prob.test
# [1] 0.51106406 0.15597423 0.04948404 0.03863875 0.35587589 0.71331091
# [7] 0.17320087 0.14176632 0.30966718 0.61878952 0.12525988 0.21271139
#[13] 0.70068113 0.18340723 0.10295501 0.44591568 0.72285161 0.31499339
#[19] 0.65789420 0.42750139 0.14435889 0.93008117 0.70798465 0.80109005
#[25] 0.89161472 0.47480625 0.56520952 0.63981834 0.57595189 0.60075882
#[31] 0.96493393 0.77015507 0.87643986 0.62973986 0.63043351 0.45398955
#[37] 0.80855782 0.90835588 0.54809117 0.11568637
########################################################

Of course, notice that the vector for the predicted probabilities has only 40
elements, while the "New.Data" has 50 elements (since n.test has 25 per group
for 2 groups) and thus should have 50 predicted probabilities. As it turns out,
the output is for the training data predictors and not for the "New.Data" as I
would like it to be. I should also note that I have made sure that the names
for the predictors in the "New.Data" are the same as the names for the
predictors within the glm object (i.e., within "log.reg") as this is what is
done in the example for predict.glm() within the help files. 

Could some one help me understand either what I am doing incorrectly or what
problems there might be within the predict() function? I should mention that I
tried the same program using predict.glm() and obtained the same problematic
results. 

Thanks and take care, 

Joe 


Joe Rausch, M.A. 
Psychology Liaison 
Lab for Social Research 
917 Flanner Hall 
University of Notre Dame 
Notre Dame, IN 46556
(574) 631-3910

"If we knew what it was we were doing, it would not be called research, would
it?"
- Albert Einstein
#
jrausch at nd.edu wrote:

            
Well, you haven't specified the "data" argument, but given the two 
variables directly. Exactly those variables will be used in the 
predict() step below! If you want the predict() step to work, use 
something like:

   train <- data.frame(class = train.class.var,
                       predictors = predictors.train)
   log.reg <- glm(class ~ ., data = train,
                  family=binomial(link="logit"))
Instead, use:

   test <- data.frame(predictors = predictors.test)
   predict(log.reg, newdata = test, type="response")


note also: please call the generic predict() rather than its glm method.


Uwe Ligges
#
Dear Uwe,

Unless I've somehow messed this up, as I mentioned yesterday, what you
suggest doesn't seem to work when the predictor is a matrix. Here's a
simplified example:
1            2            3            4            5
6 
  1.81224443  -5.92955128   1.98718051 -10.05331521   2.65065555
-2.50635812 
           7            8            9           10           11
12 
  5.63728698  -0.94845276  -3.61657377  -1.63141320   5.03417372
1.80400271 
          13           14           15           16           17
18 
  9.32876273  -5.32723406   5.29373023  -3.90822713 -10.95065186
4.90038016 

 . . .

           97           98           99          100 
 -6.92509812   0.59357486  -1.17205723   0.04209578 


Note that there are 100 rather than 10 predicted values.

But with individuals predictors (rather than a matrix),
1          2          3          4          5          6          7

 6.5723823  0.6356392  4.0291018 -4.7914650  2.1435485 -3.1738096 -2.8261585

         8          9         10 
-1.5255329 -4.7087592  4.0619290 

works as expected (?).

Regards,
 John
#
John Fox wrote:

            
Dear John,

the questioner had a 2 column matrix with 40 and one with 50 
observations (not a 100 column matrix with 2 observation) and for those 
matrices it works ...

Best,
Uwe
#
Dear Uwe,
Indeed, and in my example the matrix predictor X has 2 columns and 100 rows;
I did screw up the matrix for the "new" data to be used for predictions (in
the example I sent today but not yesterday), but even when this is done
right -- where the new data has 10 rows and 2 columns -- there are 100 (not
10) predicted values:
1            2            3            4            5
6 
  5.75238091   0.31874587  -3.00515893  -3.77282121  -1.97511221
0.54712914 
           7            8            9           10           11
12 
  1.85091226   4.38465524  -0.41028694  -1.53942869   0.57613555
-1.82761518 

 . . .

          91           92           93           94           95
96 
  0.36210780   1.71358713  -9.63612775  -4.54257576  -5.29740468
2.64363405 
          97           98           99          100 
 -4.45478627  -2.44973209   2.51587537  -4.09584837 

Actually, I now see the source of the problem:

The data frames dat and new don't contain a matrix named "X"; rather the
matrix is split columnwise:
[1] "y"   "X.1" "X.2"
[1] "X.1" "X.2"

Consequently, both glm and predict pick up the X in the global environment
(since there is none in dat or new), which accounts for why there are 100
predicted values.

Using list() rather than data.frame() produces the originally expected
behaviour:
1          2          3          4          5          6          7

 5.9373064  0.3687360 -8.3793045  0.7645584 -2.6773842  2.4130547  0.7387318

         8          9         10 
-0.4347916  8.4678728 -0.8976054 

Regards,
 John
#
John Fox wrote:

            
John,

note that I used glm(y ~ .) (the dot!),
because the names are automatically chosen to be X.1 and X.2, hence you 
cannot use "X" in the formula in this case ...

Best,
Uwe
#
I have a follow up question that fits with this thread.

Can you force an overlaid plot showing predicted values to follow the 
scaling of the axes of the plot over which it is laid?

Here is an example based on linear regression, just for clarity.  I have 
followed the procedure described below to create predictions and now 
want to plot the predicted values "on top" of a small section of the x-y 
scatterplot.

x <- rnorm(100, 10, 10)
e <- rnorm(100, 0, 5)
y <- 5 + 10 *x + e

myReg1 <- lm (y~x)
plot(x,y)
newX <- seq(1,10,1)
myPred <- predict(myReg1,data.frame(x=newX))

Now, if I do this, I get 2 graphs "overlaid" but their axes do not "line 
up".

par(new=T)
plot(newX,myPred$fit)

The problem is that the second one uses the "whole width" of the graph 
space, when I'd rather just have it go from the small subset where its x 
is defined, from 1 to 10.  Its stretching the range (1,10) for newX to 
use the same scale that goes from (-15, 35) where it plots x

I know abline() can do this for lm, but for some other kinds of models, 
no  lines() method is provided, and so I am doing this the old fashioned 
way.

pj
John Fox wrote:

  
    
#
On Thu, 2004-09-23 at 12:02, Paul Johnson wrote:
Paul,

Instead of using plot() for the second set of points, use points():

x <- rnorm(100, 10, 10)
e <- rnorm(100, 0, 5)
y <- 5 + 10 * x + e

myReg1 <- lm (y ~ x)
plot(x, y)

newX <- seq(1, 10, 1)
myPred <- predict(myReg1, data.frame(x = newX))

points(newX, myPred$fit, pch = 19)


This will preserve the axis scaling. If you use plot() without
explicitly indicating xlim and ylim, it will automatically scale the
axes based upon your new data, even if you indicated that the underlying
plot should not be cleared.

Alternatively, you could also use the lines() function, which will draw
point to point lines:

lines(newX, myPred$fit, col = "red")

If you want fitted lines and prediction/confidence intervals, you could
use a function like matlines(), presuming that a predict method exists
for the model type you want to use.

There is an example of using this in "R Help Desk" in R News Vol 3
Number 2 (October 2003), in the first example, with a standard linear
regression model.

HTH,

Marc Schwartz
#
Dear Uwe,
. . .
OK -- I see. I did notice that you used . in the formula but didn't make the
proper connection!

Thanks,
 John
#
John Fox wrote:

            
Sorry, my first reply was too short and imprecisely.
Thank you to help clarifying things.

Uwe