Skip to content

Enhanced version of plot.lm()

13 messages · Martin Maechler, Peter Dalgaard, Brian Ripley +2 more

#
The web page http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/
now includes files:
plot.lm.RData: Image for file for plot6.lm, a version of plot.lm in 
which
   David Firth's Cook's distance vs leverage/(1-leverage) plot is plot 6.
   The tick labels are in units of leverage, and the contour labels are
   in units of absolute values of the standardized residual.

plot6.lm.Rd file: A matching help file

Comments will be welcome.

Another issue, discussed recently on r-help, is that when the model
formula is long, the default sub.caption=deparse(x$call) is broken
into multiple text elements and overwrites.  The only clean and
simple way that I can see to handle is to set a default that tests
whether the formula is broken into multiple text elements, and if it is
then omit it.  Users can then use their own imaginative skills, and
such suggestions as have been made on r-help, to construct
whatever form of labeling best suits their case, their imaginative
skills and their coding skills.

John Maindonald.
On 25 Apr 2005, at 8:00 PM, David Firth wrote:

            
John Maindonald             email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
#
JMd> The web page http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/
    JMd> now includes files:
    JMd> plot.lm.RData: Image for file for plot6.lm, a version of plot.lm in 
    JMd> which
    JMd> David Firth's Cook's distance vs leverage/(1-leverage) plot is plot 6.
    JMd> The tick labels are in units of leverage, and the contour labels are
    JMd> in units of absolute values of the standardized residual.

    JMd> plot6.lm.Rd file: A matching help file

    JMd> Comments will be welcome.

Thank you John!

The *.Rd has the new references and a new example but
is not quite complete: the \usage{} has only 4 captions,
\arguments{ .. \item{which} ..}  only mentions '1:5' --- but
never mind.

One of the new examples is

    ## Replace Cook's distance plot by Residual-Leverage plot
    plot(lm.SR, which=c(1:3, 5))

and -- conceptually I'd really like to change the default from
'which = 1:4' to the above
'which=c(1:3, 5))' 

This would be non-compatible though for all those that have
always used the current default 1:4. 
OTOH, "MASS" or Peter Dalgaard's book don't mention  plot(<lm fit> )
or at least don't show it's result.

What do others think?
How problematic would a change be in the default plots that
plot.lm() produces?


    JMd> Another issue, discussed recently on r-help, is that when the model
    JMd> formula is long, the default sub.caption=deparse(x$call) is broken
    JMd> into multiple text elements and overwrites.  
good point!

    JMd>  The only clean and simple way that I can see to handle
    JMd> is to set a default that tests whether the formula is
    JMd> broken into multiple text elements, and if it is then
    JMd> omit it.  Users can then use their own imaginative
    JMd> skills, and such suggestions as have been made on
    JMd> r-help, to construct whatever form of labeling best
    JMd> suits their case, their imaginative skills and their
    JMd> coding skills.

Hmm, yes, but I think we (R programmers) could try a bit harder
to provide a reasonable default, e.g., something along
 
 cap <- deparse(x$call, width.cutoff = 500)[1]
 if((nc <- nchar(cap)) > 53)	    
     cap <- paste(substr(cap, 1, 50), "....", substr(cap, nc-2, nc))

{untested;  some of the details will differ;
 and the '53', '50' could depend on par("..") measures}


    JMd> John Maindonald.
JMd> On 25 Apr 2005, at 8:00 PM, David Firth wrote:
>> From: David Firth <d.firth@warwick.ac.uk>
    >> Date: 24 April 2005 10:23:51 PM
    >> To: John Maindonald <john.maindonald@anu.edu.au>
    >> Cc: r-devel@stat.math.ethz.ch
    >> Subject: Re: [Rd] Enhanced version of plot.lm()
    >> 
    >>
>> On 24 Apr 2005, at 05:37, John Maindonald wrote:
>> 
    >>> I'd not like to lose the signs of the residuals. Also, as
    >>> plots 1-3 focus on residuals, there is less of a mental
    >>> leap in moving to residuals vs leverage; residuals vs
    >>> leverage/(1-leverage) would also be in the same spirit.
    >> 
    >> Yes, I know what you mean.  Mental leaps are a matter of 
    >> taste...pitfalls, etc, come to mind.
    >> 
    >>> 
    >>> Maybe, one way or another, both plots (residuals vs
    >>> a function of leverage, and the plot from Hinkley et al)
    >>> should go in.  The easiest way to do this is to add a
    >>> further which=6.  I will do this if the consensus is that
    >>> this is the right way to go.  In any case, I'll add the
    >>> Hinkley et al reference (author of the contribution that
    >>> includes p.74?) to the draft help page.
    >> 
    >> Sorry, I should have given the full reference, which (in BibTeX format 
    >> from CIS) is
    >> 
    >> @inproceedings{Firt:gene:1991,
    >> author = {Firth, D.},
    >> title = {Generalized Linear Models},
    >> year = {1991},
    >> booktitle = {Statistical Theory and Modelling. In Honour of Sir 
    >> David Cox, FRS},
    >> editor = {Hinkley, D. V. and Reid, N. and Snell, E. J.},
    >> publisher = {Chapman \& Hall Ltd},
    >> pages = {55--82},
    >> keywords = {Analysis of deviance; Likelihood}
    >> }
    >> 
    >> David
    >> 
    JMd> John Maindonald             email: john.maindonald@anu.edu.au
    JMd> phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
    JMd> Centre for Bioinformation Science, Room 1194,
    JMd> John Dedman Mathematical Sciences Building (Building 27)
    JMd> Australian National University, Canberra ACT 0200.

    JMd> ______________________________________________
    JMd> R-devel@stat.math.ethz.ch mailing list
    JMd> https://stat.ethz.ch/mailman/listinfo/r-devel
JMd> The web page
    JMd> http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/ now
    JMd> includes files: plot.lm.RData: Image for file for
    JMd> plot6.lm, a version of plot.lm in which David Firth's
    JMd> Cook's distance vs leverage/(1-leverage) plot is plot
    JMd> 6.  The tick labels are in units of leverage, and the
    JMd> contour labels are in units of absolute values of the
    JMd> standardized residual.

    JMd> plot6.lm.Rd file: A matching help file

    JMd> Comments will be welcome.

    JMd> Another issue, discussed recently on r-help, is that
    JMd> when the model formula is long, the default
    JMd> sub.caption=deparse(x$call) is broken into multiple
    JMd> text elements and overwrites.  The only clean and
    JMd> simple way that I can see to handle is to set a default
    JMd> that tests whether the formula is broken into multiple
    JMd> text elements, and if it is then omit it.  Users can
    JMd> then use their own imaginative skills, and such
    JMd> suggestions as have been made on r-help, to construct
    JMd> whatever form of labeling best suits their case, their
    JMd> imaginative skills and their coding skills.

    JMd> John Maindonald.
JMd> On 25 Apr 2005, at 8:00 PM, David Firth wrote:
>> From: David Firth <d.firth@warwick.ac.uk> Date: 24 April
    >> 2005 10:23:51 PM To: John Maindonald
    >> <john.maindonald@anu.edu.au> Cc:
    >> r-devel@stat.math.ethz.ch Subject: Re: [Rd] Enhanced
    >> version of plot.lm()
    >> 
    >>
>> On 24 Apr 2005, at 05:37, John Maindonald wrote:
>> 
    >>> I'd not like to lose the signs of the residuals. Also,
    >>> as plots 1-3 focus on residuals, there is less of a
    >>> mental leap in moving to residuals vs leverage;
    >>> residuals vs leverage/(1-leverage) would also be in the
    >>> same spirit.
    >>  Yes, I know what you mean.  Mental leaps are a matter of
    >> taste...pitfalls, etc, come to mind.
    >> 
    >>>  Maybe, one way or another, both plots (residuals vs a
    >>> function of leverage, and the plot from Hinkley et al)
    >>> should go in.  The easiest way to do this is to add a
    >>> further which=6.  I will do this if the consensus is
    >>> that this is the right way to go.  In any case, I'll add
    >>> the Hinkley et al reference (author of the contribution
    >>> that includes p.74?) to the draft help page.
    >>  Sorry, I should have given the full reference, which (in
    >> BibTeX format from CIS) is
    >> 
    >> @inproceedings{Firt:gene:1991, author = {Firth, D.},
    >> title = {Generalized Linear Models}, year = {1991},
    >> booktitle = {Statistical Theory and Modelling. In Honour
    >> of Sir David Cox, FRS}, editor = {Hinkley, D. V. and
    >> Reid, N. and Snell, E. J.}, publisher = {Chapman \& Hall
    >> Ltd}, pages = {55--82}, keywords = {Analysis of deviance;
    >> Likelihood} }
    >> 
    >> David
    >> 
    JMd> John Maindonald email: john.maindonald@anu.edu.au phone
    JMd> : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for
    JMd> Bioinformation Science, Room 1194, John Dedman
    JMd> Mathematical Sciences Building (Building 27) Australian
    JMd> National University, Canberra ACT 0200.

    JMd> ______________________________________________
    JMd> R-devel@stat.math.ethz.ch mailing list
    JMd> https://stat.ethz.ch/mailman/listinfo/r-devel
#
Martin Maechler <maechler@stat.math.ethz.ch> writes:
Ummm, check page 183...
#
On Tue, 26 Apr 2005, Peter Dalgaard wrote:

            
OTOH MASS does not because of S-PLUS/R differences.
1 day later
#
JMd> The web page http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/
    JMd> now includes files:
    JMd> plot.lm.RData: Image for file for plot6.lm, a version of plot.lm in 
    JMd> which
    JMd> David Firth's Cook's distance vs leverage/(1-leverage) plot is plot 6.
    JMd> The tick labels are in units of leverage, and the contour labels are
    JMd> in units of absolute values of the standardized residual.

    JMd> plot6.lm.Rd file: A matching help file

    JMd> Comments will be welcome.

    MM> Thank you John!

    MM> The *.Rd has the new references and a new example but
    MM> is not quite complete: the \usage{} has only 4 captions,
    MM> \arguments{ .. \item{which} ..}  only mentions '1:5' --- but
    MM> never mind.

    MM> One of the new examples is

    MM> ## Replace Cook's distance plot by Residual-Leverage plot
    MM> plot(lm.SR, which=c(1:3, 5))

    MM> and -- conceptually I'd really like to change the default from
    MM> 'which = 1:4' to the above
    MM> 'which=c(1:3, 5))' 

    MM> This would be non-compatible though for all those that have
    MM> always used the current default 1:4. 
    MM> OTOH, "MASS" or Peter Dalgaard's book don't mention  plot(<lm fit> )
    MM> or at least don't show it's result.

    MM> What do others think?
    MM> How problematic would a change be in the default plots that
    MM> plot.lm() produces?


    JMd> Another issue, discussed recently on r-help, is that when the model
    JMd> formula is long, the default sub.caption=deparse(x$call) is broken
    JMd> into multiple text elements and overwrites.  
    MM> good point!

    JMd> The only clean and simple way that I can see to handle
    JMd> is to set a default that tests whether the formula is
    JMd> broken into multiple text elements, and if it is then
    JMd> omit it.  Users can then use their own imaginative
    JMd> skills, and such suggestions as have been made on
    JMd> r-help, to construct whatever form of labeling best
    JMd> suits their case, their imaginative skills and their
    JMd> coding skills.

    MM> Hmm, yes, but I think we (R programmers) could try a bit harder
    MM> to provide a reasonable default, e.g., something along
 
    MM> cap <- deparse(x$call, width.cutoff = 500)[1]
    MM> if((nc <- nchar(cap)) > 53)	    
    MM>   cap <- paste(substr(cap, 1, 50), "....", substr(cap, nc-2, nc))

    MM> {untested;  some of the details will differ;
    MM> and the '53', '50' could depend on par("..") measures}

In the mean time, I came to quite a nice way of doing this:

    if(is.null(sub.caption)) { ## construct a default:
        cal <- x$call
        if (!is.na(m.f <- match("formula", names(cal)))) {
            cal <- cal[c(1, m.f)]
            names(cal)[2] <- "" # drop  " formula = "
        }
        cc <- deparse(cal, 80)
        nc <- nchar(cc[1])
        abbr <- length(cc) > 1 || nc > 75
        sub.caption <-
            if(abbr) paste(substr(cc[1], 1, min(75,nc)), "...") else cc[1]
    }


I'm about to commit the current proposal(s) to R-devel,
**INCLUDING** changing the default from 
	      'which = 1:4' to 'which = c(1:3,5)

and ellicit feedback starting from there.

One thing I think I would like is to use color for the Cook's
contours in the new 4th plot.

Martin


	<.............. lots deleted ..........>
#
Martin Maechler <maechler@stat.math.ethz.ch> writes:
Hmm. First try running example(plot.lm) with the modified function and
tell me which observation has the largest Cook's D. With the suggested
new 4th plot it is very hard to tell whether obs #49 is potentially or
actually influential. Plots #1 and #3 are very close to conveying the
same information though...
#
PD> Martin Maechler <maechler@stat.math.ethz.ch> writes:
    >> I'm about to commit the current proposal(s) to R-devel,
    >> **INCLUDING** changing the default from 
    >> 'which = 1:4' to 'which = c(1:3,5)
    >> 
    >> and ellicit feedback starting from there.
    >> 
    >> One thing I think I would like is to use color for the Cook's
    >> contours in the new 4th plot.

    PD> Hmm. First try running example(plot.lm) with the modified function and
    PD> tell me which observation has the largest Cook's D. With the suggested
    PD> new 4th plot it is very hard to tell whether obs #49 is potentially or
    PD> actually influential. Plots #1 and #3 are very close to conveying the
    PD> same information though...

I shouldn't be teaching here, and I know that I'm getting into fighted
territory (regression diagnostics; robustness; "The" Truth, etc,etc)
but I believe there is no unique way to define "actually influential"
(hence I don't believe that it's extremely useful to know
exactly which Cook's D is largest).

Partly because there are many statistics that can be derived from a
multiple regression fit all of which are influenced in some way. 
AFAIK, all observation-influence measures g(i) are functions of
(r_i, h_{ii}) and the latter are the quantities that "regression
users" should really know {without consulting a text book} and
that are generalizable {e.g. to "linear smoothers" such as
gam()s (for "non-estimated" smoothing parameter)}.

Martin
#
On 28 Apr 2005, at 1:30 AM, Martin Maechler wrote:

            
I agree with Martin.  I like the idea of using color (red?) for
the new Cook's contours.  People who want (fairly) precise
comparisons of the Cook's statistics can still use the present
plot #4, perhaps as a follow-up to the new plot #5.
It would be possible to label the Cookwise most extreme
points with the actual values (to perhaps 2sig figures, i.e.,
labeling on both sides of such points), but this would add
what I consider is unnecessary clutter to the graph.

John.

John Maindonald             email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
#
Dear John et al.,

Curiously, Georges Monette (at York University in Toronto) and I were just
talking last week about influence-statistic contours, and I wrote a couple
of functions to show these for Cook's D and for covratio as functions of
hat-values and studentized residuals. These differ a bit from the ones
previously discussed here in that they show rule-of-thumb cut-offs for D and
covratio, along with Bonferroni critical values for studentized residuals. 

I've attached a file with these functions, even though they're not that
polished.

More generally, I wonder whether it's not best to supply plots like these as
separate functions rather than as a do-it-all plot method for lm objects.

Regards,
 John

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
--------------------------------
#
NB also the mention of a possible addition to stats: vif()

Dear John -
I think users can cope with six plots offered by one function,
with four of them given by default, and the two remaining
plots alternative ways of presenting the information in the
final default plot.  The idea of plot.lm() was to provide a
set of plots that would serve most basic purposes.

It may be reasonable to have a suite of plots for
examining residuals and influence.  I'd suggest
trying to follow the syntax and labeling conventions
as for plot.lm(), unless these seem inappropriate.

While on such matters, there is a function vif() in DAAG,
and a more comprehensive function vif() in car.  One of
these, probably yours if you are willing, should go into
stats.  There's one addition that I'd make; allow a model
matrix as parameter, as an optional alternative to giving
the model object.
Regards
John M.
On 28 Apr 2005, at 10:39 PM, John Fox wrote:

            
John Maindonald             email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
#
Dear John
I rather like added-variable plots for examining influence and leverage on
coefficients.
I don't have strong feelings about this -- I certainly don't think that the
suggestion is inappropriate.
I'd have no objection to that.
That seems reasonable -- for linear models, anyway. The current approach
works (at least arguably) for generalized linear models as well. My only
hesitation is that having just the model matrix doesn't insure that the
model is a linear model. With this caveat, I should be able to handle model
matrices by adding a matrix method to vif (and perhaps printing a warning).
I'll probably do that when I next revise the car package.

Thanks for the suggestion.
 John
#
On 29 Apr 2005, at 10:08 AM, John Fox wrote:

            
I think that plots of this type are compulsory.  termplot() is
a pretty good start.

John M.
John Maindonald             email: john.maindonald@anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
#
Dear John,

I agree that component+residual (partial residual) plots, as produced by
termplot(), should be examined, but these are distinct from added-variable
(partial regression) plots.

Regards,
 John

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
--------------------------------