Skip to content

summary vs anova

3 messages · Brent Pedersen, David Winsemius, Peter Dalgaard

#
Hi, I'm sure this is simple, but I haven't been able to find this in TFM,
say I have some data in R like this (pasted here:
http://pastebin.com/raw.php?i=sjS9Zkup):

  > head(df)
    gender age smokes disease    Y
  1 female  65   ever control 0.18
  2 female  77  never control 0.12
  3   male  40         state1 0.11
  4 female  67   ever control 0.20
  5   male  63   ever  state1 0.16
  6 female  26  never  state1 0.13

where unique(disease) == c("control", "state1", "state2")
and unique(smokes) == c("ever", "never", "", "current")

I then fit a linear model like:

    > model = lm(Y ~ smokes + disease + age + gender, data=df)

And I want to understand the difference between:

    > print(summary(model))
    Call:
    lm(formula = Y ~ smokes + disease + age + gender, data = df)

    Residuals:
         Min       1Q   Median       3Q      Max
    -0.22311 -0.08108 -0.03483  0.05604  0.46507

    Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
    (Intercept)    0.1206825  0.0521368   2.315   0.0211 *
    smokescurrent  0.0150641  0.0444466   0.339   0.7348
    smokesever     0.0498764  0.0326254   1.529   0.1271
    smokesnever    0.0394109  0.0349142   1.129   0.2597
    diseasestate1  0.0018739  0.0176817   0.106   0.9157
    diseasestate2 -0.0009858  0.0178651  -0.055   0.9560
    age            0.0002841  0.0006290   0.452   0.6518
    gendermale     0.1164889  0.0128748   9.048   <2e-16 ***
    ---
    Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

    Residual standard error: 0.1257 on 397 degrees of freedom
    Multiple R-squared: 0.1933, Adjusted R-squared: 0.1791
    F-statistic: 13.59 on 7 and 397 DF,  p-value: 8.975e-16


and:

  > anova(model)
  Analysis of Variance Table

  Response: Y
             Df Sum Sq Mean Sq F value  Pr(>F)
  smokes      3 0.1536 0.05120  3.2397 0.02215 *
  disease     2 0.0129 0.00647  0.4096 0.66420
  age         1 0.0431 0.04310  2.7270 0.09946 .
  gender      1 1.2937 1.29373 81.8634 < 2e-16 ***
  Residuals 397 6.2740 0.01580
  ---
  Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

I understand (hopefully correctly) that anova() tests by adding each covariate
to the model in order it is specified in the formula.

More specific questions are:

1) How do the p-values for smokes* in summary(model) relate to the
   Pr(>F) for smokes in anova
2) what do the p-values for each of those smokes* mean exactly?
3) the summary above shows the values for diseasestate1 and diseasestate2
   how can I get the p-value for diseasecontrol? (or, e.g. genderfemale)

thanks.
#
On Dec 19, 2011, at 9:09 AM, Brent Pedersen wrote:

            
One of the reason this is not in TFM is that these are questions that  
should be available in any first course on regression textbook.
All of which are general statistics questions which you are asked to  
post in forums or lists that expect such questions ... and not to r- 
help.
-------------------

David Winsemius, MD
West Hartford, CT
#
On Dec 19, 2011, at 15:09 , Brent Pedersen wrote:

            
It's not _that_ simple. You likely need TFtextbook rather than TFM. Most (but not all) will go into at least some detail of coding categorical variables using dummy variables.
Yes. Note, however, that categorical variables cause more than one dummy covariate to be added.
If the last Pr(>F) corresponds to a single-df term, then F=t^2 for that term (only), and the p value is the same. If the last Pr(>F)  is for a k-df term, it corresponds to simultaneously testing that the corresponding k regression coefficients are _all_ zero;  the joint p value can not in general be calculated from tests on individual coefficients. However, they at least test related hypotheses.  

p values higher up the list in anova() test for hypotheses in models obtained after removal of subsequent factors, so are not in general comparable to the t tests in summary().

If you use drop1(...., test="F") instead of anova(), then you avoid the sequential aspect and all 1-df tests correspond to t-tests in the summary table.
In the default parametrization, they correspond to comparisons between the stated level and the reference (first) level of the factor. In different contrast parametrizations, the interpretation will differ; the only complete advice is that you need to understand the relation between the factor levels and the rows of the design matrix.
You can't. It would correspond to a comparison of that level with itself.