outlier - R-help | R Mailing Lists

kan Liu

Tue, Jun 17, 2003 9:35 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20030617/836628ae/attachment.pl

Brian Ripley

Tue, Jun 17, 2003 9:51 AM #

On Tue, 17 Jun 2003, kan Liu wrote:

Easy: you don't.  It make no sense to consider R^2 after arbitrary outlier 
removal: if I remove all but two points I get R^2 = 1!

R^2 is normally used to measure the success of a multiple regression, but 
as you mention two variables, did you just mean the Pearson 
product-moment correlation?  It makes more sense to use a robust measure 
of correlation, as in cov.rob (package lqs) or even Spearman or Kendall 
measures (cov.test in package ctest).

If you intended to do this for a multiple regression, you need to do some 
sort of robust regression and a use a robust measure of fit.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Spencer Graves

Tue, Jun 17, 2003 10:08 AM #

It is also wise to make scatterplots, as shown by the famous examples 
produced of 4 scatterplots with the same R^2, where the first shows the 
standard ellipsoid pattern implied by the assumptions while the other 
three indicate very clearly that the assumptions are incorrect.  See 
Anscombe (1973) "Graphs in Statistical Analysis", The American 
Statistician, 27: 17-22, reproduced in, e.g., du Toit, Steyn and Stumpf 
(1986) Graphical Exploratory Data Analysis (Springer).

hth.  spencer graves

Prof Brian Ripley wrote:

kan Liu

Tue, Jun 17, 2003 2:24 PM #

Hi, many thanks for your advice. I appreciate very
much. Maybe I can make the question more clear: I want
to evaluate the correlation between two variables: one
is the actual outputs of a system, another is the
predicted values of the outputs of the system using
neural networks. When I made scatterplots in excel, I
can get the linear equation and the corresponding
R-squared. In the bottom of the page
http://www.statsoftinc.com/textbook/stathome.html, it
mentioned that sometimes outliers will affect
correlation coefficient biasly. So I thought it might
be worth to remove outlier before  calculating
R-squared in R. It seems to be a bad idea according to
your comments. Now can you make comments on how to
evaluate the performance of the neural network model
in predicting the actual outputs?

Kan

--- Spencer Graves <spencer.graves at PDF.COM> wrote:

	  It is also wise to make scatterplots, as shown by
the famous examples 
produced of 4 scatterplots with the same R^2, where
the first shows the 
standard ellipsoid pattern implied by the
assumptions while the other 
three indicate very clearly that the assumptions are
incorrect.  See 
Anscombe (1973) "Graphs in Statistical Analysis",
The American 
Statistician, 27: 17-22, reproduced in, e.g., du
Toit, Steyn and Stumpf 
(1986) Graphical Exploratory Data Analysis
(Springer).

hth.  spencer graves

Prof Brian Ripley wrote:

On Tue, 17 Jun 2003, kan Liu wrote:

I want to calculate the R-squared between two

variables. Can you advice

me how to identify and remove the outliers before

performing R-squared

calculation?


Easy: you don't.  It make no sense to consider R^2

after arbitrary outlier

removal: if I remove all but two points I get R^2

= 1!

R^2 is normally used to measure the success of a

multiple regression, but

as you mention two variables, did you just mean

the Pearson

product-moment correlation?  It makes more sense

to use a robust measure

of correlation, as in cov.rob (package lqs) or

even Spearman or Kendall

measures (cov.test in package ctest).

If you intended to do this for a multiple

regression, you need to do some

sort of robust regression and a use a robust

measure of fit.

__________________________________

SBC Yahoo! DSL - Now only $29.95 per month!

Brian Ripley

Tue, Jun 17, 2003 2:30 PM #

On Tue, 17 Jun 2003, kan Liu wrote:

Yes. That's the whole point of robust methods: compensate rather than 
reject.

If you are interested in correlation coefficients, use cov.rob. However,
this is predicted vs actual, and you probably do want to penalize bad
predictions, not reject them.  It's up to you to choose a suitable loss
function for your application.  In particular, if the predicted values
were always 1e-45 times the actual values minus 1e310, the correlation
would be one and the predictions would be derisory.

Kan 

--- Spencer Graves <spencer.graves at PDF.COM> wrote:

	  It is also wise to make scatterplots, as shown by
the famous examples 
produced of 4 scatterplots with the same R^2, where
the first shows the 
standard ellipsoid pattern implied by the
assumptions while the other 
three indicate very clearly that the assumptions are
incorrect.  See 
Anscombe (1973) "Graphs in Statistical Analysis",
The American 
Statistician, 27: 17-22, reproduced in, e.g., du
Toit, Steyn and Stumpf 
(1986) Graphical Exploratory Data Analysis
(Springer).

hth.  spencer graves

Prof Brian Ripley wrote:

On Tue, 17 Jun 2003, kan Liu wrote:

I want to calculate the R-squared between two

variables. Can you advice

me how to identify and remove the outliers before

performing R-squared

calculation?


Easy: you don't.  It make no sense to consider R^2

after arbitrary outlier

removal: if I remove all but two points I get R^2

= 1!

R^2 is normally used to measure the success of a

multiple regression, but

as you mention two variables, did you just mean

the Pearson

product-moment correlation?  It makes more sense

to use a robust measure

of correlation, as in cov.rob (package lqs) or

even Spearman or Kendall

measures (cov.test in package ctest).

If you intended to do this for a multiple

regression, you need to do some

sort of robust regression and a use a robust

measure of fit.

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

kan Liu

Wed, Jun 18, 2003 3:09 AM #

which R environmen variable can be used to point to
x1.R so that it can be sourced in any directory?

Kan

________________________________________________________________________
Want to chat instantly with your online friends?  Get the FREE Yahoo!
Messenger http://uk.messenger.yahoo.com/

Duncan Murdoch

Wed, Jun 18, 2003 3:53 AM #

On Wed, 18 Jun 2003 11:09:41 +0100 (BST), you wrote:

I don't think there is such a thing.  source() looks in the current
working directory, it doesn't use environment variables to do a wider
search.

Of course, you can always make up your own variable, and then do
something like

source(paste(getenv('MYDIR'),'/x1.R',sep=''))

and that will work from anywhere if MYDIR is defined properly.

Duncan Murdoch

kan Liu

Wed, Jun 18, 2003 6:11 AM #

I wrote a .R file (see below)to calculate robust
measure of correlation using cov.rob. I got different
correlation coefficients (0.70, 0.79, 0.63, ...) when
I run the file different times. Can you tell me what
this means or what is wrong in using cov.rob?
-------------

library(lqs)
a <-
c(5.41,4.67,5.88,2.38,4.79,5.30,1.94,3.40,5.05,3.31,5.88,4.92,5.08,4.58,4.59,4.77,5.25,3.77,2.88,5.30,5.32,2.56,4.29,5.54,4.53,3.51,4.93,2.49,2.85,5.04,2.51,2.60,3.58,2.11,1.70,5.20,5.08,4.48,3.96,4.87,4.98,2.56,1.69,4.28,1.70,2.91,5.37,2.16,3.04,1.69,1.88,5.36,1.70,3.81,1.70,5.88,3.52)
p <-
c(5.30,4.78,4.79,0.62,4.32,2.33,0.64,3.14,3.06,4.73,5.72,2.21,4.81,1.74,4.93,4.74,5.81,3.88,3.03,4.72,5.79,3.43,4.07,5.93,2.26,3.70,5.32,4.56,1.52,2.54,0.26,2.79,3.67,4.44,1.46,4.26,4.49,5.29,3.26,3.87,3.12,3.97,3.49,0.45,0.76,4.49,5.29,1.94,4.69,2.80,2.75,5.16,0.74,5.81,1.46,5.24,4.00)
ap <- cbind(a,p)
cov.rob(ap, cor=TRUE)

Brian Ripley

Wed, Jun 18, 2003 6:38 AM #

Please do read the help page. which explains this is a random algorithm.
In your example you can try cov.rob(ap, cor=TRUE, nsamp="exact")

On Wed, 18 Jun 2003, kan Liu wrote:

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595