An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-help/attachments/20030617/836628ae/attachment.pl
outlier
9 messages · kan Liu, Spencer Graves, Duncan Murdoch +1 more
On Tue, 17 Jun 2003, kan Liu wrote:
I want to calculate the R-squared between two variables. Can you advice me how to identify and remove the outliers before performing R-squared calculation?
Easy: you don't. It make no sense to consider R^2 after arbitrary outlier removal: if I remove all but two points I get R^2 = 1! R^2 is normally used to measure the success of a multiple regression, but as you mention two variables, did you just mean the Pearson product-moment correlation? It makes more sense to use a robust measure of correlation, as in cov.rob (package lqs) or even Spearman or Kendall measures (cov.test in package ctest). If you intended to do this for a multiple regression, you need to do some sort of robust regression and a use a robust measure of fit.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
It is also wise to make scatterplots, as shown by the famous examples produced of 4 scatterplots with the same R^2, where the first shows the standard ellipsoid pattern implied by the assumptions while the other three indicate very clearly that the assumptions are incorrect. See Anscombe (1973) "Graphs in Statistical Analysis", The American Statistician, 27: 17-22, reproduced in, e.g., du Toit, Steyn and Stumpf (1986) Graphical Exploratory Data Analysis (Springer). hth. spencer graves
Prof Brian Ripley wrote:
On Tue, 17 Jun 2003, kan Liu wrote:
I want to calculate the R-squared between two variables. Can you advice me how to identify and remove the outliers before performing R-squared calculation?
Easy: you don't. It make no sense to consider R^2 after arbitrary outlier removal: if I remove all but two points I get R^2 = 1! R^2 is normally used to measure the success of a multiple regression, but as you mention two variables, did you just mean the Pearson product-moment correlation? It makes more sense to use a robust measure of correlation, as in cov.rob (package lqs) or even Spearman or Kendall measures (cov.test in package ctest). If you intended to do this for a multiple regression, you need to do some sort of robust regression and a use a robust measure of fit.
Hi, many thanks for your advice. I appreciate very much. Maybe I can make the question more clear: I want to evaluate the correlation between two variables: one is the actual outputs of a system, another is the predicted values of the outputs of the system using neural networks. When I made scatterplots in excel, I can get the linear equation and the corresponding R-squared. In the bottom of the page http://www.statsoftinc.com/textbook/stathome.html, it mentioned that sometimes outliers will affect correlation coefficient biasly. So I thought it might be worth to remove outlier before calculating R-squared in R. It seems to be a bad idea according to your comments. Now can you make comments on how to evaluate the performance of the neural network model in predicting the actual outputs? Kan
--- Spencer Graves <spencer.graves at PDF.COM> wrote:
It is also wise to make scatterplots, as shown by the famous examples produced of 4 scatterplots with the same R^2, where the first shows the standard ellipsoid pattern implied by the assumptions while the other three indicate very clearly that the assumptions are incorrect. See Anscombe (1973) "Graphs in Statistical Analysis", The American Statistician, 27: 17-22, reproduced in, e.g., du Toit, Steyn and Stumpf (1986) Graphical Exploratory Data Analysis (Springer). hth. spencer graves Prof Brian Ripley wrote:
On Tue, 17 Jun 2003, kan Liu wrote:
I want to calculate the R-squared between two
variables. Can you advice
me how to identify and remove the outliers before
performing R-squared
calculation?
Easy: you don't. It make no sense to consider R^2
after arbitrary outlier
removal: if I remove all but two points I get R^2
= 1!
R^2 is normally used to measure the success of a
multiple regression, but
as you mention two variables, did you just mean
the Pearson
product-moment correlation? It makes more sense
to use a robust measure
of correlation, as in cov.rob (package lqs) or
even Spearman or Kendall
measures (cov.test in package ctest). If you intended to do this for a multiple
regression, you need to do some
sort of robust regression and a use a robust
measure of fit.
__________________________________ SBC Yahoo! DSL - Now only $29.95 per month!
On Tue, 17 Jun 2003, kan Liu wrote:
Hi, many thanks for your advice. I appreciate very much. Maybe I can make the question more clear: I want to evaluate the correlation between two variables: one is the actual outputs of a system, another is the predicted values of the outputs of the system using neural networks. When I made scatterplots in excel, I can get the linear equation and the corresponding R-squared. In the bottom of the page http://www.statsoftinc.com/textbook/stathome.html, it mentioned that sometimes outliers will affect correlation coefficient biasly. So I thought it might be worth to remove outlier before calculating R-squared in R. It seems to be a bad idea according to your comments.
Yes. That's the whole point of robust methods: compensate rather than reject.
Now can you make comments on how to evaluate the performance of the neural network model in predicting the actual outputs?
If you are interested in correlation coefficients, use cov.rob. However, this is predicted vs actual, and you probably do want to penalize bad predictions, not reject them. It's up to you to choose a suitable loss function for your application. In particular, if the predicted values were always 1e-45 times the actual values minus 1e310, the correlation would be one and the predictions would be derisory.
Kan --- Spencer Graves <spencer.graves at PDF.COM> wrote:
It is also wise to make scatterplots, as shown by the famous examples produced of 4 scatterplots with the same R^2, where the first shows the standard ellipsoid pattern implied by the assumptions while the other three indicate very clearly that the assumptions are incorrect. See Anscombe (1973) "Graphs in Statistical Analysis", The American Statistician, 27: 17-22, reproduced in, e.g., du Toit, Steyn and Stumpf (1986) Graphical Exploratory Data Analysis (Springer). hth. spencer graves Prof Brian Ripley wrote:
On Tue, 17 Jun 2003, kan Liu wrote:
I want to calculate the R-squared between two
variables. Can you advice
me how to identify and remove the outliers before
performing R-squared
calculation?
Easy: you don't. It make no sense to consider R^2
after arbitrary outlier
removal: if I remove all but two points I get R^2
= 1!
R^2 is normally used to measure the success of a
multiple regression, but
as you mention two variables, did you just mean
the Pearson
product-moment correlation? It makes more sense
to use a robust measure
of correlation, as in cov.rob (package lqs) or
even Spearman or Kendall
measures (cov.test in package ctest). If you intended to do this for a multiple
regression, you need to do some
sort of robust regression and a use a robust
measure of fit.
__________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
which R environmen variable can be used to point to x1.R so that it can be sourced in any directory? Kan ________________________________________________________________________ Want to chat instantly with your online friends? Get the FREE Yahoo! Messenger http://uk.messenger.yahoo.com/
On Wed, 18 Jun 2003 11:09:41 +0100 (BST), you wrote:
which R environmen variable can be used to point to x1.R so that it can be sourced in any directory?
I don't think there is such a thing. source() looks in the current
working directory, it doesn't use environment variables to do a wider
search.
Of course, you can always make up your own variable, and then do
something like
source(paste(getenv('MYDIR'),'/x1.R',sep=''))
and that will work from anywhere if MYDIR is defined properly.
Duncan Murdoch
I wrote a .R file (see below)to calculate robust measure of correlation using cov.rob. I got different correlation coefficients (0.70, 0.79, 0.63, ...) when I run the file different times. Can you tell me what this means or what is wrong in using cov.rob? ------------- library(lqs) a <- c(5.41,4.67,5.88,2.38,4.79,5.30,1.94,3.40,5.05,3.31,5.88,4.92,5.08,4.58,4.59,4.77,5.25,3.77,2.88,5.30,5.32,2.56,4.29,5.54,4.53,3.51,4.93,2.49,2.85,5.04,2.51,2.60,3.58,2.11,1.70,5.20,5.08,4.48,3.96,4.87,4.98,2.56,1.69,4.28,1.70,2.91,5.37,2.16,3.04,1.69,1.88,5.36,1.70,3.81,1.70,5.88,3.52) p <- c(5.30,4.78,4.79,0.62,4.32,2.33,0.64,3.14,3.06,4.73,5.72,2.21,4.81,1.74,4.93,4.74,5.81,3.88,3.03,4.72,5.79,3.43,4.07,5.93,2.26,3.70,5.32,4.56,1.52,2.54,0.26,2.79,3.67,4.44,1.46,4.26,4.49,5.29,3.26,3.87,3.12,3.97,3.49,0.45,0.76,4.49,5.29,1.94,4.69,2.80,2.75,5.16,0.74,5.81,1.46,5.24,4.00) ap <- cbind(a,p) cov.rob(ap, cor=TRUE)
Please do read the help page. which explains this is a random algorithm. In your example you can try cov.rob(ap, cor=TRUE, nsamp="exact")
On Wed, 18 Jun 2003, kan Liu wrote:
I wrote a .R file (see below)to calculate robust measure of correlation using cov.rob. I got different correlation coefficients (0.70, 0.79, 0.63, ...) when I run the file different times. Can you tell me what this means or what is wrong in using cov.rob? ------------- library(lqs) a <- c(5.41,4.67,5.88,2.38,4.79,5.30,1.94,3.40,5.05,3.31,5.88,4.92,5.08,4.58,4.59,4.77,5.25,3.77,2.88,5.30,5.32,2.56,4.29,5.54,4.53,3.51,4.93,2.49,2.85,5.04,2.51,2.60,3.58,2.11,1.70,5.20,5.08,4.48,3.96,4.87,4.98,2.56,1.69,4.28,1.70,2.91,5.37,2.16,3.04,1.69,1.88,5.36,1.70,3.81,1.70,5.88,3.52) p <- c(5.30,4.78,4.79,0.62,4.32,2.33,0.64,3.14,3.06,4.73,5.72,2.21,4.81,1.74,4.93,4.74,5.81,3.88,3.03,4.72,5.79,3.43,4.07,5.93,2.26,3.70,5.32,4.56,1.52,2.54,0.26,2.79,3.67,4.44,1.46,4.26,4.49,5.29,3.26,3.87,3.12,3.97,3.49,0.45,0.76,4.49,5.29,1.94,4.69,2.80,2.75,5.16,0.74,5.81,1.46,5.24,4.00) ap <- cbind(a,p) cov.rob(ap, cor=TRUE)
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595