[FORGED] glm and stepAIC selects too many effects
On 06/06/17 18:08, Marc Girondot via R-help wrote:
This is a question at the border between stats and r.
When I do a glm with many potential effects, and select a model using
stepAIC, many independent variables are selected even if there are no
relationship between dependent variable and the effects (all are random
numbers).
Do someone has a solution to prevent this effect ? Is it related to
Bonferoni correction ?
Is there is a ratio of independent vs number of observations that is
safe for stepAIC ?
Thanks
Marc
Example of code. When 2 independent variables are included, no effect is
selected, when 11 are included, 7 to 8 are selected.
x <- rnorm(15, 15, 2)
A <- rnorm(15, 20, 5)
B <- rnorm(15, 20, 5)
C <- rnorm(15, 20, 5)
D <- rnorm(15, 20, 5)
E <- rnorm(15, 20, 5)
F <- rnorm(15, 20, 5)
G <- rnorm(15, 20, 5)
H <- rnorm(15, 20, 5)
I <- rnorm(15, 20, 5)
J <- rnorm(15, 20, 5)
K <- rnorm(15, 20, 5)
df <- data.frame(x=x, A=A, B=B, C=C, D=D,
E=E, F=F, G=G, H=H, I=I, J=J,
K=K)
G1 <- glm(formula = x ~ A + B,
data=df, family = gaussian(link = "identity"))
g1 <- stepAIC(G1)
summary(g1)
G2 <- glm(formula = x ~ A + B + C + D + E + F + G + H + I + J + K,
data=df, family = gaussian(link = "identity"))
g2 <- stepAIC(G2)
summary(g2)
IMHO there's nothing much that you can do about this. Trying to get the data to select a model is always fraught with peril. The phenomenon that you have observed has been remarked on before; see Alan Miller's book "Subset Selection in Regression" (Chapman and Hall, 1990), page 12 (first paragraph of section 1.4). However you might find some of Miller's recommendations to be at least a *bit* useful. cheers, Rolf Turner
Technical Editor ANZJS Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276