drop1() seems to give unexpected results compare to anova()
Thomas P C Chu wrote:
Dear all,
I have been trying to investigate the behaviour of different weights
in weighted regression for a dataset with lots of missing data. As a
start I simulated some data using the following:
library(MASS)
N <- 200
sigma <- matrix(c(1, .5, .5, 1), nrow = 2)
sim.set <- as.data.frame(mvrnorm(N, c(0, 0), sigma))
colnames(sim.set) <- c('x1', 'x2') # x1 & x2 are correlated
sim.set$x3 <- rnorm(N, 0, 1) # x3 is independent
sim.set$x4 <- rnorm(N, 0, 1) # x4 is red herring
sim.set$y = 4 * sim.set$x1 + 5 * sim.set$x2 + 6 * sim.set$x3 # y is
outcome
This is your problem: There is no noise term in there. As a result, F and t tests are basically 0/0 except for errors introduced by roundoff. These errors can affect numerators and denominators differently when different numerical methods are used, and it is really a toss-up whether this results in a significant test or not.
I then checked the correlation of my simulated data and fitted a linear regression to check if y = 4 * x1 + 5 * x2 + 6 * x3 indeed. round(cor(sim.set), 2) summary(model <- lm(y ~ ., data = sim.set)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.423e-17 1.256e-16 -4.320e-01 0.666 x1 4.000e+00 1.388e-16 2.881e+16 <2e-16 *** x2 5.000e+00 1.441e-16 3.470e+16 <2e-16 *** x3 6.000e+00 1.188e-16 5.051e+16 <2e-16 *** x4 -8.150e-18 1.165e-16 -7.000e-02 0.944 anova(model) Df Sum Sq Mean Sq F value Pr(>F) x1 1 8686.1 8686.1 2.8218e+33 <2e-16 *** x2 1 5568.7 5568.7 1.8091e+33 <2e-16 *** x3 1 7852.1 7852.1 2.5509e+33 <2e-16 *** x4 1 1.507e-32 1.507e-32 4.9000e-03 0.9443 Residuals 195 6.002e-28 3.078e-30 All was well so far, as x4 was identified as not significant and its coeff was almost 0 (because I made it so in the first place). Now I expected it to be dropped in stepwise: step(model, direction = 'both', test = 'F') drop1(model, test = 'F') dropterm(model, test = 'F') Df Sum of Sq RSS AIC F value Pr(F) <none> 6.002e-28 -13585.7 x1 1 2555.1 2555.1 517.5 8.3006e+32 < 2.2e-16 *** x2 1 3707.0 3707.0 591.9 1.2043e+33 < 2.2e-16 *** x3 1 7851.9 7851.9 742.0 2.5508e+33 < 2.2e-16 *** x4 1 2.118e-27 2.718e-27 -13285.6 6.8806e+02 < 2.2e-16 *** Neither of those 3 lines of commands managed to drop x4 and its P value magically decreased from 0.94 to almost 0! I am also baffled by how R calculated those RSS. However, if I fitted the smaller model and compared it with the original one by hand, I got the expected answer: summary(model2 <- lm(y ~ x1 + x2 + x3, data = sim.set)) anova(model, model2, test = 'F') Model 1: y ~ x1 + x2 + x3 + x4 Model 2: y ~ x1 + x2 + x3 Res.Df RSS Df Sum of Sq F Pr(>F) 1 195 6.0025e-28 2 196 6.0026e-28 -1 -2.0000e-32 0.0049 0.9443 Interestingly, if I had started from a null model I ended up with y = 4 * x1 + 5 * x2 + 6 * x3 and R did not add x4 into the model as expected. summary(model1 <- lm(y ~ 1, data = sim.set)) step(model1, direction = 'both', scope = .~. + x1 + x2 + x3 + x4, test = 'F') add1(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F') addterm(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F') Df Sum of Sq RSS AIC F value Pr(F) <none> 22107.0 943.1 x1 1 8686.1 13420.8 845.2 128.1478 <2e-16 *** x2 1 11658.7 10448.3 795.2 220.9377 <2e-16 *** x3 1 11045.4 11061.6 806.6 197.7096 <2e-16 *** x4 1 13.4 22093.6 944.9 0.1199 0.7295 I'm not sure what is going on. I am running R 2.7.1 on Ubuntu Linux, with all components up to date. Thank you in advance for all your thoughts and replies. Yours sincerely, Thomas P C Chu University of Leicester LE1 7RH UK
________________________________________________________________________ AOL Email goes Mobile! You can now read your AOL Emails whilst on the move. Sign up for a free AOL Email account with unlimited storage today. ************************************** See what's new at http://www.aol.com ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907