Skip to content

Question about variable selection

6 messages · Liaw, Andy, Wensui Liu, John Fox +1 more

#
That depends on whether the IV could have some significant interactions with
other Ivs not considered in the bivariate analysis.  E.g.,
Call:
lm(formula = y ~ iv[, 1])

Residuals:
     Min       1Q   Median       3Q      Max 
-4.06259 -1.06048 -0.02377  1.05901  4.04315 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.01908    0.41482   7.278 2.09e-07 ***
iv[, 1]      0.01417    0.29332   0.048    0.962    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.074 on 23 degrees of freedom
Multiple R-Squared: 0.0001014,  Adjusted R-squared: -0.04337 
F-statistic: 0.002333 on 1 and 23 DF,  p-value: 0.9619
Call:
lm(formula = y ~ iv[, 1] * iv[, 2])

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22390 -0.08894 -0.01279  0.13525  0.17608 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      3.019083   0.026330 114.665   <2e-16 ***
iv[, 1]          0.014167   0.018618   0.761    0.455    
iv[, 2]         -0.005486   0.018618  -0.295    0.771    
iv[, 1]:iv[, 2]  0.992865   0.013165  75.418   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.1316 on 21 degrees of freedom
Multiple R-Squared: 0.9963,     Adjusted R-squared: 0.9958 
F-statistic:  1896 on 3 and 21 DF,  p-value: < 2.2e-16 




Andy

From: Wensui Liu
#
Dear Wensui and Andy,

When the explanatory variables are correlated it's perfectly possible for
the marginal relationship between and X and Y to be zero and a partial
relationship nonzero (even in the absence of interactions) -- this is simply
a reflection of the more general point that partial and marginal
relationships can differ.

Regards,
 John

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
--------------------------------
#
Dear Wensui,

I don't think that it's possible to answer these questions mechanically,
especially if you're interested in the "true" relationship between the
response and a set of explanatory variables. If, however, you have a pure
prediction problem, then variable selection is a more reasonable approach,
as long as it's done carefully (in my opinion). 

I don't see how resampling and repeatedly examining the marginal
relationship between Y and an X is relevant to the question of whether there
is a partial relationship in the absence of a marginal relationship. (This
is close to what Wittgenstein once called buying two copies of the same
newspaper to see whether what was said in the first one is true.) After all,
as I said (and as you understand), the partial and marginal relationship can
differ -- so evidence about the marginal relationship is not necessarily
relevant to inference about the partial relationship. (As well,
bootstrapping a linear least-squares regression likely isn't going to give
you much additional information anyway.)

Regards,
 John

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
--------------------------------
#
Dear Wensui,

What you are asking about is called in psychology a "suppressor" 
variable: a predictor variable unrelated to the criterion but 
correlated with the other predictors. (X1 in the following example) 
Although it has a zero relationship with the DV, it does "really" 
help to predict the DV by removing extraneous variance from the other 
IVs.  (I am not going to touch the Wittgenstein issue of truth here). 
Should it be included in the predictor set? Yes.  Is there any easy 
way to find all possible suppressors? No.


Consider the following:

#demonstration of "suppressor effects"
library(mvtnorm)
sigma <- matrix(c(1,.5,0,.5,1,.5,0,.5,1),ncol=3)
my.data <- data.frame(rmvnorm(1000,sigma=sigma))
names(my.data) <- c("X1", "X2", "Y")
round(cor(my.data),2)
summary(lm(Y~ X1 + X2,data= my.data))

which produces
       X1   X2     Y
X1  1.00 0.45 -0.04
X2  0.45 1.00  0.51
Y  -0.04 0.51  1.00
Call:
lm(formula = Y ~ X1 + X2, data = my.data)

Residuals:
      Min       1Q   Median       3Q      Max
-2.09350 -0.58069  0.02280  0.53436  3.02017

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.02807    0.02557   1.098    0.273   
X1          -0.32849    0.02813 -11.680   <2e-16 ***
X2           0.65666    0.02861  22.951   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8081 on 997 degrees of freedom
Multiple R-Squared: 0.3465,	Adjusted R-squared: 0.3452
F-statistic: 264.4 on 2 and 997 DF,  p-value: < 2.2e-16
At 3:22 PM -0500 2/18/06, John Fox wrote:
.... (discussion of interaction from Andy Liaw)