The web page http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/
now includes files:
plot.lm.RData: Image for file for plot6.lm, a version of plot.lm in
which
David Firth's Cook's distance vs leverage/(1-leverage) plot is plot 6.
The tick labels are in units of leverage, and the contour labels are
in units of absolute values of the standardized residual.
plot6.lm.Rd file: A matching help file
Comments will be welcome.
Another issue, discussed recently on r-help, is that when the model
formula is long, the default sub.caption=deparse(x$call) is broken
into multiple text elements and overwrites. The only clean and
simple way that I can see to handle is to set a default that tests
whether the formula is broken into multiple text elements, and if it is
then omit it. Users can then use their own imaginative skills, and
such suggestions as have been made on r-help, to construct
whatever form of labeling best suits their case, their imaginative
skills and their coding skills.
John Maindonald.
On 25 Apr 2005, at 8:00 PM, David Firth wrote:
From: David Firth <d.firth@warwick.ac.uk>
Date: 24 April 2005 10:23:51 PM
To: John Maindonald <john.maindonald@anu.edu.au>
Cc: r-devel@stat.math.ethz.ch
Subject: Re: [Rd] Enhanced version of plot.lm()
On 24 Apr 2005, at 05:37, John Maindonald wrote:
I'd not like to lose the signs of the residuals. Also, as
plots 1-3 focus on residuals, there is less of a mental
leap in moving to residuals vs leverage; residuals vs
leverage/(1-leverage) would also be in the same spirit.
Yes, I know what you mean. Mental leaps are a matter of
taste...pitfalls, etc, come to mind.
Maybe, one way or another, both plots (residuals vs
a function of leverage, and the plot from Hinkley et al)
should go in. The easiest way to do this is to add a
further which=6. I will do this if the consensus is that
this is the right way to go. In any case, I'll add the
Hinkley et al reference (author of the contribution that
includes p.74?) to the draft help page.
Sorry, I should have given the full reference, which (in BibTeX format
from CIS) is
@inproceedings{Firt:gene:1991,
author = {Firth, D.},
title = {Generalized Linear Models},
year = {1991},
booktitle = {Statistical Theory and Modelling. In Honour of Sir
David Cox, FRS},
editor = {Hinkley, D. V. and Reid, N. and Snell, E. J.},
publisher = {Chapman \& Hall Ltd},
pages = {55--82},
keywords = {Analysis of deviance; Likelihood}
}
David
John Maindonald email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
"JMd" == John Maindonald <john.maindonald@anu.edu.au>
on Tue, 26 Apr 2005 15:44:26 +1000 writes:
JMd> The web page http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/
JMd> now includes files:
JMd> plot.lm.RData: Image for file for plot6.lm, a version of plot.lm in
JMd> which
JMd> David Firth's Cook's distance vs leverage/(1-leverage) plot is plot 6.
JMd> The tick labels are in units of leverage, and the contour labels are
JMd> in units of absolute values of the standardized residual.
JMd> plot6.lm.Rd file: A matching help file
JMd> Comments will be welcome.
Thank you John!
The *.Rd has the new references and a new example but
is not quite complete: the \usage{} has only 4 captions,
\arguments{ .. \item{which} ..} only mentions '1:5' --- but
never mind.
One of the new examples is
## Replace Cook's distance plot by Residual-Leverage plot
plot(lm.SR, which=c(1:3, 5))
and -- conceptually I'd really like to change the default from
'which = 1:4' to the above
'which=c(1:3, 5))'
This would be non-compatible though for all those that have
always used the current default 1:4.
OTOH, "MASS" or Peter Dalgaard's book don't mention plot(<lm fit> )
or at least don't show it's result.
What do others think?
How problematic would a change be in the default plots that
plot.lm() produces?
JMd> Another issue, discussed recently on r-help, is that when the model
JMd> formula is long, the default sub.caption=deparse(x$call) is broken
JMd> into multiple text elements and overwrites.
good point!
JMd> The only clean and simple way that I can see to handle
JMd> is to set a default that tests whether the formula is
JMd> broken into multiple text elements, and if it is then
JMd> omit it. Users can then use their own imaginative
JMd> skills, and such suggestions as have been made on
JMd> r-help, to construct whatever form of labeling best
JMd> suits their case, their imaginative skills and their
JMd> coding skills.
Hmm, yes, but I think we (R programmers) could try a bit harder
to provide a reasonable default, e.g., something along
cap <- deparse(x$call, width.cutoff = 500)[1]
if((nc <- nchar(cap)) > 53)
cap <- paste(substr(cap, 1, 50), "....", substr(cap, nc-2, nc))
{untested; some of the details will differ;
and the '53', '50' could depend on par("..") measures}
JMd> John Maindonald.
JMd> On 25 Apr 2005, at 8:00 PM, David Firth wrote:
>> From: David Firth <d.firth@warwick.ac.uk>
>> Date: 24 April 2005 10:23:51 PM
>> To: John Maindonald <john.maindonald@anu.edu.au>
>> Cc: r-devel@stat.math.ethz.ch
>> Subject: Re: [Rd] Enhanced version of plot.lm()
>>
>>
>> On 24 Apr 2005, at 05:37, John Maindonald wrote:
>>
>>> I'd not like to lose the signs of the residuals. Also, as
>>> plots 1-3 focus on residuals, there is less of a mental
>>> leap in moving to residuals vs leverage; residuals vs
>>> leverage/(1-leverage) would also be in the same spirit.
>>
>> Yes, I know what you mean. Mental leaps are a matter of
>> taste...pitfalls, etc, come to mind.
>>
>>>
>>> Maybe, one way or another, both plots (residuals vs
>>> a function of leverage, and the plot from Hinkley et al)
>>> should go in. The easiest way to do this is to add a
>>> further which=6. I will do this if the consensus is that
>>> this is the right way to go. In any case, I'll add the
>>> Hinkley et al reference (author of the contribution that
>>> includes p.74?) to the draft help page.
>>
>> Sorry, I should have given the full reference, which (in BibTeX format
>> from CIS) is
>>
>> @inproceedings{Firt:gene:1991,
>> author = {Firth, D.},
>> title = {Generalized Linear Models},
>> year = {1991},
>> booktitle = {Statistical Theory and Modelling. In Honour of Sir
>> David Cox, FRS},
>> editor = {Hinkley, D. V. and Reid, N. and Snell, E. J.},
>> publisher = {Chapman \& Hall Ltd},
>> pages = {55--82},
>> keywords = {Analysis of deviance; Likelihood}
>> }
>>
>> David
>>
JMd> John Maindonald email: john.maindonald@anu.edu.au
JMd> phone : +61 2 (6125)3473 fax : +61 2(6125)5549
JMd> Centre for Bioinformation Science, Room 1194,
JMd> John Dedman Mathematical Sciences Building (Building 27)
JMd> Australian National University, Canberra ACT 0200.
JMd> ______________________________________________
JMd> R-devel@stat.math.ethz.ch mailing list
JMd> https://stat.ethz.ch/mailman/listinfo/r-devel
"JMd" == John Maindonald <john.maindonald@anu.edu.au>
on Tue, 26 Apr 2005 15:44:26 +1000 writes:
JMd> The web page
JMd> http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/ now
JMd> includes files: plot.lm.RData: Image for file for
JMd> plot6.lm, a version of plot.lm in which David Firth's
JMd> Cook's distance vs leverage/(1-leverage) plot is plot
JMd> 6. The tick labels are in units of leverage, and the
JMd> contour labels are in units of absolute values of the
JMd> standardized residual.
JMd> plot6.lm.Rd file: A matching help file
JMd> Comments will be welcome.
JMd> Another issue, discussed recently on r-help, is that
JMd> when the model formula is long, the default
JMd> sub.caption=deparse(x$call) is broken into multiple
JMd> text elements and overwrites. The only clean and
JMd> simple way that I can see to handle is to set a default
JMd> that tests whether the formula is broken into multiple
JMd> text elements, and if it is then omit it. Users can
JMd> then use their own imaginative skills, and such
JMd> suggestions as have been made on r-help, to construct
JMd> whatever form of labeling best suits their case, their
JMd> imaginative skills and their coding skills.
JMd> John Maindonald.
JMd> On 25 Apr 2005, at 8:00 PM, David Firth wrote:
>> From: David Firth <d.firth@warwick.ac.uk> Date: 24 April
>> 2005 10:23:51 PM To: John Maindonald
>> <john.maindonald@anu.edu.au> Cc:
>> r-devel@stat.math.ethz.ch Subject: Re: [Rd] Enhanced
>> version of plot.lm()
>>
>>
>> On 24 Apr 2005, at 05:37, John Maindonald wrote:
>>
>>> I'd not like to lose the signs of the residuals. Also,
>>> as plots 1-3 focus on residuals, there is less of a
>>> mental leap in moving to residuals vs leverage;
>>> residuals vs leverage/(1-leverage) would also be in the
>>> same spirit.
>> Yes, I know what you mean. Mental leaps are a matter of
>> taste...pitfalls, etc, come to mind.
>>
>>> Maybe, one way or another, both plots (residuals vs a
>>> function of leverage, and the plot from Hinkley et al)
>>> should go in. The easiest way to do this is to add a
>>> further which=6. I will do this if the consensus is
>>> that this is the right way to go. In any case, I'll add
>>> the Hinkley et al reference (author of the contribution
>>> that includes p.74?) to the draft help page.
>> Sorry, I should have given the full reference, which (in
>> BibTeX format from CIS) is
>>
>> @inproceedings{Firt:gene:1991, author = {Firth, D.},
>> title = {Generalized Linear Models}, year = {1991},
>> booktitle = {Statistical Theory and Modelling. In Honour
>> of Sir David Cox, FRS}, editor = {Hinkley, D. V. and
>> Reid, N. and Snell, E. J.}, publisher = {Chapman \& Hall
>> Ltd}, pages = {55--82}, keywords = {Analysis of deviance;
>> Likelihood} }
>>
>> David
>>
JMd> John Maindonald email: john.maindonald@anu.edu.au phone
JMd> : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for
JMd> Bioinformation Science, Room 1194, John Dedman
JMd> Mathematical Sciences Building (Building 27) Australian
JMd> National University, Canberra ACT 0200.
JMd> ______________________________________________
JMd> R-devel@stat.math.ethz.ch mailing list
JMd> https://stat.ethz.ch/mailman/listinfo/r-devel
Martin Maechler <maechler@stat.math.ethz.ch> writes:
This would be non-compatible though for all those that have
always used the current default 1:4.
OTOH, "MASS" or Peter Dalgaard's book don't mention plot(<lm fit> )
or at least don't show it's result.
Ummm, check page 183...
O__ ---- Peter Dalgaard Blegdamsvej 3
c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907
Martin Maechler <maechler@stat.math.ethz.ch> writes:
This would be non-compatible though for all those that have
always used the current default 1:4.
OTOH, "MASS" or Peter Dalgaard's book don't mention plot(<lm fit> )
or at least don't show it's result.
Ummm, check page 183...
OTOH MASS does not because of S-PLUS/R differences.
Brian D. Ripley, ripley@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
"MM" == Martin Maechler <maechler@stat.math.ethz.ch>
on Tue, 26 Apr 2005 12:13:38 +0200 writes:
"JMd" == John Maindonald <john.maindonald@anu.edu.au>
on Tue, 26 Apr 2005 15:44:26 +1000 writes:
JMd> The web page http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/
JMd> now includes files:
JMd> plot.lm.RData: Image for file for plot6.lm, a version of plot.lm in
JMd> which
JMd> David Firth's Cook's distance vs leverage/(1-leverage) plot is plot 6.
JMd> The tick labels are in units of leverage, and the contour labels are
JMd> in units of absolute values of the standardized residual.
JMd> plot6.lm.Rd file: A matching help file
JMd> Comments will be welcome.
MM> Thank you John!
MM> The *.Rd has the new references and a new example but
MM> is not quite complete: the \usage{} has only 4 captions,
MM> \arguments{ .. \item{which} ..} only mentions '1:5' --- but
MM> never mind.
MM> One of the new examples is
MM> ## Replace Cook's distance plot by Residual-Leverage plot
MM> plot(lm.SR, which=c(1:3, 5))
MM> and -- conceptually I'd really like to change the default from
MM> 'which = 1:4' to the above
MM> 'which=c(1:3, 5))'
MM> This would be non-compatible though for all those that have
MM> always used the current default 1:4.
MM> OTOH, "MASS" or Peter Dalgaard's book don't mention plot(<lm fit> )
MM> or at least don't show it's result.
MM> What do others think?
MM> How problematic would a change be in the default plots that
MM> plot.lm() produces?
JMd> Another issue, discussed recently on r-help, is that when the model
JMd> formula is long, the default sub.caption=deparse(x$call) is broken
JMd> into multiple text elements and overwrites.
MM> good point!
JMd> The only clean and simple way that I can see to handle
JMd> is to set a default that tests whether the formula is
JMd> broken into multiple text elements, and if it is then
JMd> omit it. Users can then use their own imaginative
JMd> skills, and such suggestions as have been made on
JMd> r-help, to construct whatever form of labeling best
JMd> suits their case, their imaginative skills and their
JMd> coding skills.
MM> Hmm, yes, but I think we (R programmers) could try a bit harder
MM> to provide a reasonable default, e.g., something along
MM> cap <- deparse(x$call, width.cutoff = 500)[1]
MM> if((nc <- nchar(cap)) > 53)
MM> cap <- paste(substr(cap, 1, 50), "....", substr(cap, nc-2, nc))
MM> {untested; some of the details will differ;
MM> and the '53', '50' could depend on par("..") measures}
In the mean time, I came to quite a nice way of doing this:
if(is.null(sub.caption)) { ## construct a default:
cal <- x$call
if (!is.na(m.f <- match("formula", names(cal)))) {
cal <- cal[c(1, m.f)]
names(cal)[2] <- "" # drop " formula = "
}
cc <- deparse(cal, 80)
nc <- nchar(cc[1])
abbr <- length(cc) > 1 || nc > 75
sub.caption <-
if(abbr) paste(substr(cc[1], 1, min(75,nc)), "...") else cc[1]
}
I'm about to commit the current proposal(s) to R-devel,
**INCLUDING** changing the default from
'which = 1:4' to 'which = c(1:3,5)
and ellicit feedback starting from there.
One thing I think I would like is to use color for the Cook's
contours in the new 4th plot.
Martin
<.............. lots deleted ..........>
Martin Maechler <maechler@stat.math.ethz.ch> writes:
I'm about to commit the current proposal(s) to R-devel,
**INCLUDING** changing the default from
'which = 1:4' to 'which = c(1:3,5)
and ellicit feedback starting from there.
One thing I think I would like is to use color for the Cook's
contours in the new 4th plot.
Hmm. First try running example(plot.lm) with the modified function and
tell me which observation has the largest Cook's D. With the suggested
new 4th plot it is very hard to tell whether obs #49 is potentially or
actually influential. Plots #1 and #3 are very close to conveying the
same information though...
O__ ---- Peter Dalgaard Blegdamsvej 3
c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907
"PD" == Peter Dalgaard <p.dalgaard@biostat.ku.dk>
on 27 Apr 2005 16:54:02 +0200 writes:
PD> Martin Maechler <maechler@stat.math.ethz.ch> writes:
>> I'm about to commit the current proposal(s) to R-devel,
>> **INCLUDING** changing the default from
>> 'which = 1:4' to 'which = c(1:3,5)
>>
>> and ellicit feedback starting from there.
>>
>> One thing I think I would like is to use color for the Cook's
>> contours in the new 4th plot.
PD> Hmm. First try running example(plot.lm) with the modified function and
PD> tell me which observation has the largest Cook's D. With the suggested
PD> new 4th plot it is very hard to tell whether obs #49 is potentially or
PD> actually influential. Plots #1 and #3 are very close to conveying the
PD> same information though...
I shouldn't be teaching here, and I know that I'm getting into fighted
territory (regression diagnostics; robustness; "The" Truth, etc,etc)
but I believe there is no unique way to define "actually influential"
(hence I don't believe that it's extremely useful to know
exactly which Cook's D is largest).
Partly because there are many statistics that can be derived from a
multiple regression fit all of which are influenced in some way.
AFAIK, all observation-influence measures g(i) are functions of
(r_i, h_{ii}) and the latter are the quantities that "regression
users" should really know {without consulting a text book} and
that are generalizable {e.g. to "linear smoothers" such as
gam()s (for "non-estimated" smoothing parameter)}.
Martin
On 28 Apr 2005, at 1:30 AM, Martin Maechler wrote:
"PD" == Peter Dalgaard <p.dalgaard@biostat.ku.dk>
on 27 Apr 2005 16:54:02 +0200 writes:
PD> Martin Maechler <maechler@stat.math.ethz.ch> writes:
I'm about to commit the current proposal(s) to R-devel,
**INCLUDING** changing the default from
'which = 1:4' to 'which = c(1:3,5)
and ellicit feedback starting from there.
One thing I think I would like is to use color for the Cook's
contours in the new 4th plot.
PD> Hmm. First try running example(plot.lm) with the modified
function and
PD> tell me which observation has the largest Cook's D. With the
suggested
PD> new 4th plot it is very hard to tell whether obs #49 is
potentially or
PD> actually influential. Plots #1 and #3 are very close to
conveying the
PD> same information though...
I shouldn't be teaching here, and I know that I'm getting into fighted
territory (regression diagnostics; robustness; "The" Truth, etc,etc)
but I believe there is no unique way to define "actually influential"
(hence I don't believe that it's extremely useful to know
exactly which Cook's D is largest).
Partly because there are many statistics that can be derived from a
multiple regression fit all of which are influenced in some way.
AFAIK, all observation-influence measures g(i) are functions of
(r_i, h_{ii}) and the latter are the quantities that "regression
users" should really know {without consulting a text book} and
that are generalizable {e.g. to "linear smoothers" such as
gam()s (for "non-estimated" smoothing parameter)}.
Martin
I agree with Martin. I like the idea of using color (red?) for
the new Cook's contours. People who want (fairly) precise
comparisons of the Cook's statistics can still use the present
plot #4, perhaps as a follow-up to the new plot #5.
It would be possible to label the Cookwise most extreme
points with the actual values (to perhaps 2sig figures, i.e.,
labeling on both sides of such points), but this would add
what I consider is unnecessary clutter to the graph.
John.
John Maindonald email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
Dear John et al.,
Curiously, Georges Monette (at York University in Toronto) and I were just
talking last week about influence-statistic contours, and I wrote a couple
of functions to show these for Cook's D and for covratio as functions of
hat-values and studentized residuals. These differ a bit from the ones
previously discussed here in that they show rule-of-thumb cut-offs for D and
covratio, along with Bonferroni critical values for studentized residuals.
I've attached a file with these functions, even though they're not that
polished.
More generally, I wonder whether it's not best to supply plots like these as
separate functions rather than as a do-it-all plot method for lm objects.
Regards,
John
--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox
--------------------------------
-----Original Message-----
From: r-devel-bounces@stat.math.ethz.ch
[mailto:r-devel-bounces@stat.math.ethz.ch] On Behalf Of John
Maindonald
Sent: Wednesday, April 27, 2005 7:54 PM
To: Martin Maechler
Cc: David Firth; Werner Stahel; r-devel@stat.math.ethz.ch;
Peter Dalgaard
Subject: Re: [Rd] Enhanced version of plot.lm()
On 28 Apr 2005, at 1:30 AM, Martin Maechler wrote:
"PD" == Peter Dalgaard <p.dalgaard@biostat.ku.dk>
on 27 Apr 2005 16:54:02 +0200 writes:
PD> Martin Maechler <maechler@stat.math.ethz.ch> writes:
I'm about to commit the current proposal(s) to R-devel,
**INCLUDING** changing the default from 'which = 1:4' to 'which =
c(1:3,5)
and ellicit feedback starting from there.
One thing I think I would like is to use color for the Cook's
contours in the new 4th plot.
PD> Hmm. First try running example(plot.lm) with the modified
function and
PD> tell me which observation has the largest Cook's D.
With the
suggested
PD> new 4th plot it is very hard to tell whether obs #49 is
potentially or
PD> actually influential. Plots #1 and #3 are very close to
conveying the
PD> same information though...
I shouldn't be teaching here, and I know that I'm getting
but I believe there is no unique way to define "actually
influential"
(hence I don't believe that it's extremely useful to know exactly
which Cook's D is largest).
Partly because there are many statistics that can be derived from a
multiple regression fit all of which are influenced in some way.
AFAIK, all observation-influence measures g(i) are
functions of (r_i,
h_{ii}) and the latter are the quantities that "regression users"
should really know {without consulting a text book} and that are
generalizable {e.g. to "linear smoothers" such as gam()s (for
"non-estimated" smoothing parameter)}.
Martin
I agree with Martin. I like the idea of using color (red?)
for the new Cook's contours. People who want (fairly)
precise comparisons of the Cook's statistics can still use
the present plot #4, perhaps as a follow-up to the new plot #5.
It would be possible to label the Cookwise most extreme
points with the actual values (to perhaps 2sig figures, i.e.,
labeling on both sides of such points), but this would add
what I consider is unnecessary clutter to the graph.
John.
John Maindonald email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194, John Dedman
Mathematical Sciences Building (Building 27) Australian
National University, Canberra ACT 0200.
NB also the mention of a possible addition to stats: vif()
Dear John -
I think users can cope with six plots offered by one function,
with four of them given by default, and the two remaining
plots alternative ways of presenting the information in the
final default plot. The idea of plot.lm() was to provide a
set of plots that would serve most basic purposes.
It may be reasonable to have a suite of plots for
examining residuals and influence. I'd suggest
trying to follow the syntax and labeling conventions
as for plot.lm(), unless these seem inappropriate.
While on such matters, there is a function vif() in DAAG,
and a more comprehensive function vif() in car. One of
these, probably yours if you are willing, should go into
stats. There's one addition that I'd make; allow a model
matrix as parameter, as an optional alternative to giving
the model object.
Regards
John M.
On 28 Apr 2005, at 10:39 PM, John Fox wrote:
Dear John et al.,
Curiously, Georges Monette (at York University in Toronto) and I were
just
talking last week about influence-statistic contours, and I wrote a
couple
of functions to show these for Cook's D and for covratio as functions
of
hat-values and studentized residuals. These differ a bit from the ones
previously discussed here in that they show rule-of-thumb cut-offs for
D and
covratio, along with Bonferroni critical values for studentized
residuals.
I've attached a file with these functions, even though they're not that
polished.
More generally, I wonder whether it's not best to supply plots like
these as
separate functions rather than as a do-it-all plot method for lm
objects.
Regards,
John
--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox
--------------------------------
-----Original Message-----
From: r-devel-bounces@stat.math.ethz.ch
[mailto:r-devel-bounces@stat.math.ethz.ch] On Behalf Of John
Maindonald
Sent: Wednesday, April 27, 2005 7:54 PM
To: Martin Maechler
Cc: David Firth; Werner Stahel; r-devel@stat.math.ethz.ch;
Peter Dalgaard
Subject: Re: [Rd] Enhanced version of plot.lm()
On 28 Apr 2005, at 1:30 AM, Martin Maechler wrote:
"PD" == Peter Dalgaard <p.dalgaard@biostat.ku.dk>
on 27 Apr 2005 16:54:02 +0200 writes:
PD> Martin Maechler <maechler@stat.math.ethz.ch> writes:
I'm about to commit the current proposal(s) to R-devel,
**INCLUDING** changing the default from 'which = 1:4' to 'which =
c(1:3,5)
and ellicit feedback starting from there.
One thing I think I would like is to use color for the Cook's
contours in the new 4th plot.
PD> Hmm. First try running example(plot.lm) with the modified
function and
PD> tell me which observation has the largest Cook's D.
With the
suggested
PD> new 4th plot it is very hard to tell whether obs #49 is
potentially or
PD> actually influential. Plots #1 and #3 are very close to
conveying the
PD> same information though...
I shouldn't be teaching here, and I know that I'm getting
but I believe there is no unique way to define "actually
influential"
(hence I don't believe that it's extremely useful to know exactly
which Cook's D is largest).
Partly because there are many statistics that can be derived from a
multiple regression fit all of which are influenced in some way.
AFAIK, all observation-influence measures g(i) are
functions of (r_i,
h_{ii}) and the latter are the quantities that "regression users"
should really know {without consulting a text book} and that are
generalizable {e.g. to "linear smoothers" such as gam()s (for
"non-estimated" smoothing parameter)}.
Martin
I agree with Martin. I like the idea of using color (red?)
for the new Cook's contours. People who want (fairly)
precise comparisons of the Cook's statistics can still use
the present plot #4, perhaps as a follow-up to the new plot #5.
It would be possible to label the Cookwise most extreme
points with the actual values (to perhaps 2sig figures, i.e.,
labeling on both sides of such points), but this would add
what I consider is unnecessary clutter to the graph.
John.
John Maindonald email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194, John Dedman
Mathematical Sciences Building (Building 27) Australian
National University, Canberra ACT 0200.
John Maindonald email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
-----Original Message-----
From: John Maindonald [mailto:john.maindonald@anu.edu.au]
Sent: Thursday, April 28, 2005 6:47 PM
To: John Fox
Cc: 'Werner Stahel'; 'Peter Dalgaard';
<r-devel@stat.math.ethz.ch>; 'David Firth'; 'Martin Maechler'
Subject: Re: [Rd] Enhanced version of plot.lm()
NB also the mention of a possible addition to stats: vif()
Dear John -
I think users can cope with six plots offered by one
function, with four of them given by default, and the two
remaining plots alternative ways of presenting the
information in the final default plot. The idea of plot.lm()
was to provide a set of plots that would serve most basic purposes.
I rather like added-variable plots for examining influence and leverage on
coefficients.
It may be reasonable to have a suite of plots for examining
residuals and influence. I'd suggest trying to follow the
syntax and labeling conventions as for plot.lm(), unless
these seem inappropriate.
I don't have strong feelings about this -- I certainly don't think that the
suggestion is inappropriate.
While on such matters, there is a function vif() in DAAG, and
a more comprehensive function vif() in car. One of these,
probably yours if you are willing, should go into stats.
I'd have no objection to that.
There's one addition that I'd make; allow a model matrix as
parameter, as an optional alternative to giving the model object.
That seems reasonable -- for linear models, anyway. The current approach
works (at least arguably) for generalized linear models as well. My only
hesitation is that having just the model matrix doesn't insure that the
model is a linear model. With this caveat, I should be able to handle model
matrices by adding a matrix method to vif (and perhaps printing a warning).
I'll probably do that when I next revise the car package.
Thanks for the suggestion.
John
Regards
John M.
On 28 Apr 2005, at 10:39 PM, John Fox wrote:
Dear John et al.,
Curiously, Georges Monette (at York University in Toronto)
and I were
just talking last week about influence-statistic contours,
and I wrote
a couple of functions to show these for Cook's D and for
covratio as
functions of hat-values and studentized residuals. These
differ a bit
from the ones previously discussed here in that they show
rule-of-thumb cut-offs for D and covratio, along with Bonferroni
critical values for studentized residuals.
I've attached a file with these functions, even though they're not
that polished.
More generally, I wonder whether it's not best to supply plots like
these as separate functions rather than as a do-it-all plot
method for
lm objects.
Regards,
John
--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox
--------------------------------
-----Original Message-----
From: r-devel-bounces@stat.math.ethz.ch
[mailto:r-devel-bounces@stat.math.ethz.ch] On Behalf Of John
Maindonald
Sent: Wednesday, April 27, 2005 7:54 PM
To: Martin Maechler
Cc: David Firth; Werner Stahel; r-devel@stat.math.ethz.ch; Peter
Dalgaard
Subject: Re: [Rd] Enhanced version of plot.lm()
On 28 Apr 2005, at 1:30 AM, Martin Maechler wrote:
"PD" == Peter Dalgaard <p.dalgaard@biostat.ku.dk>
on 27 Apr 2005 16:54:02 +0200 writes:
PD> Martin Maechler <maechler@stat.math.ethz.ch> writes:
I'm about to commit the current proposal(s) to R-devel,
**INCLUDING** changing the default from 'which = 1:4'
to 'which =
c(1:3,5)
and ellicit feedback starting from there.
One thing I think I would like is to use color for the Cook's
contours in the new 4th plot.
PD> Hmm. First try running example(plot.lm) with the modified
function and
PD> tell me which observation has the largest Cook's D.
With the
suggested
PD> new 4th plot it is very hard to tell whether obs #49 is
potentially or
PD> actually influential. Plots #1 and #3 are very close to
conveying the
PD> same information though...
I shouldn't be teaching here, and I know that I'm getting
but I believe there is no unique way to define "actually
influential"
(hence I don't believe that it's extremely useful to know exactly
which Cook's D is largest).
Partly because there are many statistics that can be
derived from a
multiple regression fit all of which are influenced in some way.
AFAIK, all observation-influence measures g(i) are
functions of (r_i,
h_{ii}) and the latter are the quantities that "regression users"
should really know {without consulting a text book} and that are
generalizable {e.g. to "linear smoothers" such as gam()s (for
"non-estimated" smoothing parameter)}.
Martin
I agree with Martin. I like the idea of using color
(red?) for the
new Cook's contours. People who want (fairly) precise
comparisons of
the Cook's statistics can still use the present plot #4,
perhaps as a
follow-up to the new plot #5.
It would be possible to label the Cookwise most extreme
points with
the actual values (to perhaps 2sig figures, i.e., labeling on both
sides of such points), but this would add what I consider is
unnecessary clutter to the graph.
John.
John Maindonald email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194, John Dedman
Mathematical Sciences Building (Building 27) Australian National
University, Canberra ACT 0200.
John Maindonald email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194, John Dedman
Mathematical Sciences Building (Building 27) Australian
National University, Canberra ACT 0200.
-----Original Message-----
From: John Maindonald [mailto:john.maindonald@anu.edu.au]
Sent: Thursday, April 28, 2005 6:47 PM
To: John Fox
Cc: 'Werner Stahel'; 'Peter Dalgaard';
<r-devel@stat.math.ethz.ch>; 'David Firth'; 'Martin Maechler'
Subject: Re: [Rd] Enhanced version of plot.lm()
NB also the mention of a possible addition to stats: vif()
Dear John -
I think users can cope with six plots offered by one
function, with four of them given by default, and the two
remaining plots alternative ways of presenting the
information in the final default plot. The idea of plot.lm()
was to provide a set of plots that would serve most basic purposes.
I rather like added-variable plots for examining influence and
leverage on
coefficients.
I think that plots of this type are compulsory. termplot() is
a pretty good start.
John M.
John Maindonald email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
Dear John,
I agree that component+residual (partial residual) plots, as produced by
termplot(), should be examined, but these are distinct from added-variable
(partial regression) plots.
Regards,
John
--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox
--------------------------------
-----Original Message-----
From: John Maindonald [mailto:john.maindonald@anu.edu.au]
Sent: Friday, April 29, 2005 2:51 AM
To: John Fox
Cc: r-devel@stat.math.ethz.ch
Subject: Re: [Rd] Enhanced version of plot.lm()
On 29 Apr 2005, at 10:08 AM, John Fox wrote:
Dear John
-----Original Message-----
From: John Maindonald [mailto:john.maindonald@anu.edu.au]
Sent: Thursday, April 28, 2005 6:47 PM
To: John Fox
Cc: 'Werner Stahel'; 'Peter Dalgaard';
<r-devel@stat.math.ethz.ch>;
'David Firth'; 'Martin Maechler'
Subject: Re: [Rd] Enhanced version of plot.lm()
NB also the mention of a possible addition to stats: vif()
Dear John -
I think users can cope with six plots offered by one
function, with
four of them given by default, and the two remaining plots
alternative ways of presenting the information in the
final default
plot. The idea of plot.lm() was to provide a set of plots
that would
serve most basic purposes.
I rather like added-variable plots for examining influence and
leverage on coefficients.
I think that plots of this type are compulsory. termplot()
is a pretty good start.
John M.
John Maindonald email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194, John Dedman
Mathematical Sciences Building (Building 27) Australian
National University, Canberra ACT 0200.