Hi, I'm sure this is simple, but I haven't been able to find this in TFM, say I have some data in R like this (pasted here: http://pastebin.com/raw.php?i=sjS9Zkup): > head(df) gender age smokes disease Y 1 female 65 ever control 0.18 2 female 77 never control 0.12 3 male 40 state1 0.11 4 female 67 ever control 0.20 5 male 63 ever state1 0.16 6 female 26 never state1 0.13 where unique(disease) == c("control", "state1", "state2") and unique(smokes) == c("ever", "never", "", "current") I then fit a linear model like: > model = lm(Y ~ smokes + disease + age + gender, data=df) And I want to understand the difference between: > print(summary(model)) Call: lm(formula = Y ~ smokes + disease + age + gender, data = df) Residuals: Min 1Q Median 3Q Max -0.22311 -0.08108 -0.03483 0.05604 0.46507 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1206825 0.0521368 2.315 0.0211 * smokescurrent 0.0150641 0.0444466 0.339 0.7348 smokesever 0.0498764 0.0326254 1.529 0.1271 smokesnever 0.0394109 0.0349142 1.129 0.2597 diseasestate1 0.0018739 0.0176817 0.106 0.9157 diseasestate2 -0.0009858 0.0178651 -0.055 0.9560 age 0.0002841 0.0006290 0.452 0.6518 gendermale 0.1164889 0.0128748 9.048 <2e-16 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.1257 on 397 degrees of freedom Multiple R-squared: 0.1933, Adjusted R-squared: 0.1791 F-statistic: 13.59 on 7 and 397 DF, p-value: 8.975e-16 and: > anova(model) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) smokes 3 0.1536 0.05120 3.2397 0.02215 * disease 2 0.0129 0.00647 0.4096 0.66420 age 1 0.0431 0.04310 2.7270 0.09946 . gender 1 1.2937 1.29373 81.8634 < 2e-16 *** Residuals 397 6.2740 0.01580 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 I understand (hopefully correctly) that anova() tests by adding each covariate to the model in order it is specified in the formula. More specific questions are: 1) How do the p-values for smokes* in summary(model) relate to the Pr(>F) for smokes in anova 2) what do the p-values for each of those smokes* mean exactly? 3) the summary above shows the values for diseasestate1 and diseasestate2 how can I get the p-value for diseasecontrol? (or, e.g. genderfemale) thanks.
summary vs anova
3 messages · Brent Pedersen, David Winsemius, Peter Dalgaard
On Dec 19, 2011, at 9:09 AM, Brent Pedersen wrote:
Hi, I'm sure this is simple, but I haven't been able to find this in TFM, say I have some data in R like this (pasted here: http://pastebin.com/raw.php?i=sjS9Zkup):
One of the reason this is not in TFM is that these are questions that should be available in any first course on regression textbook.
head(df)
gender age smokes disease Y
1 female 65 ever control 0.18
2 female 77 never control 0.12
3 male 40 state1 0.11
4 female 67 ever control 0.20
5 male 63 ever state1 0.16
6 female 26 never state1 0.13
where unique(disease) == c("control", "state1", "state2")
and unique(smokes) == c("ever", "never", "", "current")
I then fit a linear model like:
model = lm(Y ~ smokes + disease + age + gender, data=df)
And I want to understand the difference between:
print(summary(model))
Call:
lm(formula = Y ~ smokes + disease + age + gender, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.22311 -0.08108 -0.03483 0.05604 0.46507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1206825 0.0521368 2.315 0.0211 *
smokescurrent 0.0150641 0.0444466 0.339 0.7348
smokesever 0.0498764 0.0326254 1.529 0.1271
smokesnever 0.0394109 0.0349142 1.129 0.2597
diseasestate1 0.0018739 0.0176817 0.106 0.9157
diseasestate2 -0.0009858 0.0178651 -0.055 0.9560
age 0.0002841 0.0006290 0.452 0.6518
gendermale 0.1164889 0.0128748 9.048 <2e-16 ***
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Residual standard error: 0.1257 on 397 degrees of freedom
Multiple R-squared: 0.1933, Adjusted R-squared: 0.1791
F-statistic: 13.59 on 7 and 397 DF, p-value: 8.975e-16
and:
anova(model)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
smokes 3 0.1536 0.05120 3.2397 0.02215 *
disease 2 0.0129 0.00647 0.4096 0.66420
age 1 0.0431 0.04310 2.7270 0.09946 .
gender 1 1.2937 1.29373 81.8634 < 2e-16 ***
Residuals 397 6.2740 0.01580
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
I understand (hopefully correctly) that anova() tests by adding each
covariate
to the model in order it is specified in the formula.
More specific questions are:
All of which are general statistics questions which you are asked to post in forums or lists that expect such questions ... and not to r- help.
1) How do the p-values for smokes* in summary(model) relate to the Pr(>F) for smokes in anova 2) what do the p-values for each of those smokes* mean exactly? 3) the summary above shows the values for diseasestate1 and diseasestate2 how can I get the p-value for diseasecontrol? (or, e.g. genderfemale) ^^^^^^^^^^^^^^^^^^^^^^^
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
------------------- David Winsemius, MD West Hartford, CT
On Dec 19, 2011, at 15:09 , Brent Pedersen wrote:
Hi, I'm sure this is simple, but I haven't been able to find this in TFM,
It's not _that_ simple. You likely need TFtextbook rather than TFM. Most (but not all) will go into at least some detail of coding categorical variables using dummy variables.
[snip] I understand (hopefully correctly) that anova() tests by adding each covariate to the model in order it is specified in the formula.
Yes. Note, however, that categorical variables cause more than one dummy covariate to be added.
More specific questions are: 1) How do the p-values for smokes* in summary(model) relate to the Pr(>F) for smokes in anova
If the last Pr(>F) corresponds to a single-df term, then F=t^2 for that term (only), and the p value is the same. If the last Pr(>F) is for a k-df term, it corresponds to simultaneously testing that the corresponding k regression coefficients are _all_ zero; the joint p value can not in general be calculated from tests on individual coefficients. However, they at least test related hypotheses. p values higher up the list in anova() test for hypotheses in models obtained after removal of subsequent factors, so are not in general comparable to the t tests in summary(). If you use drop1(...., test="F") instead of anova(), then you avoid the sequential aspect and all 1-df tests correspond to t-tests in the summary table.
2) what do the p-values for each of those smokes* mean exactly?
In the default parametrization, they correspond to comparisons between the stated level and the reference (first) level of the factor. In different contrast parametrizations, the interpretation will differ; the only complete advice is that you need to understand the relation between the factor levels and the rows of the design matrix.
3) the summary above shows the values for diseasestate1 and diseasestate2 how can I get the p-value for diseasecontrol? (or, e.g. genderfemale)
You can't. It would correspond to a comparison of that level with itself.
thanks.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com