Hello, I need to analyse a data matrix with dimensions of 30x100. Before analysing the data there is, however, a need to remove outliers from the data. I read quite a lot about outlier removal already and I think the most common technique for that seems to be Principal Component Analysis (PCA). However, I think that these technqiue is quite subjective. When is an outlier an outlier? I uploaded an example PCA plot here: http://s14.postimage.org/oknyya1ld/pca.png Should we treat the green and red dots as outliers already or only the blue one which lies outside the 95% confidence interval. It seems very arbitrary how people remove outliers using PCA. I also thought about fitting a linear model through my data and look at distribution of the residuals. However, the problem with using linear models is that one can actually never be sure that the model used is the one which describes the data best. In model A, for instance, we might treat sample 1 as and outlier but fitting a different model B sample 1 might not be an outlier at all. I had a brief look at k-means clustering as well but I think it's not the right thing to go for. Again, how do one decide which cluster is an outler? And also it is known that different cluster analysis lead to totally different results. So which one to choose? Is there any other way to non-subjectively remove outliers from data? I would really appreciated any ideas/comments you might have on that topic. Cheers -- View this message in context: http://r.789695.n4.nabble.com/Outlier-removal-techniques-tp4372652p4372652.html Sent from the R help mailing list archive at Nabble.com.
Outlier removal techniques
5 messages · mails, Rich Shepard, Frank E Harrell Jr +2 more
On Thu, 9 Feb 2012, mails wrote:
I need to analyse a data matrix with dimensions of 30x100. Before analysing the data there is, however, a need to remove outliers from the data. I read quite a lot about outlier removal already and I think the most common technique for that seems to be Principal Component Analysis (PCA). However, I think that these technqiue is quite subjective. When is an outlier an outlier? I uploaded an example PCA plot here:
Those more expert than I will certainly provide answers. What I do will new data is create box-and-whisker plots (I use the lattice package) which defines outliers as those data beyond 1.5x the first or third quartile values. No one but you can answer your question on when an outlier is an outlier. It depends on your data set and the context of the data. For example, a water chemistry value that far exceeds a regulartory threshold might be meaningful in the context of a one-off excursion (in which case it's not an outlier but a real data point) or it might result from a handling, instrumentation, or analytical error (in which case toss it as an outlier). Rich
I wonder why it is still standard practice in some circles to search for
"outliers" as opposed to using robust/resistent methods.
Here is a great paper with a scientific approach to "outliers":
@Article{fin06cal,
author = {Finney, David J.},
title = {Calibration guidelines challenge outlier practices},
journal = The American Statistician,
year = 2006,
volume = 60,
pages = {309-313},
annote = {anticoagulant
therapy;bias;causation;ethics;objectivity;outliers;guidelines for
treatment of outliers;overview of types of outliers;letter to the editor and
reply 61:187 May 2007}
}
Frank
Rich Shepard wrote
On Thu, 9 Feb 2012, mails wrote:
I need to analyse a data matrix with dimensions of 30x100. Before analysing the data there is, however, a need to remove outliers from the data. I read quite a lot about outlier removal already and I think the most common technique for that seems to be Principal Component Analysis (PCA). However, I think that these technqiue is quite subjective. When is an outlier an outlier? I uploaded an example PCA plot here:
Those more expert than I will certainly provide answers. What I do will new data is create box-and-whisker plots (I use the lattice package) which defines outliers as those data beyond 1.5x the first or third quartile values. No one but you can answer your question on when an outlier is an outlier. It depends on your data set and the context of the data. For example, a water chemistry value that far exceeds a regulartory threshold might be meaningful in the context of a one-off excursion (in which case it's not an outlier but a real data point) or it might result from a handling, instrumentation, or analytical error (in which case toss it as an outlier). Rich
______________________________________________ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
----- Frank Harrell Department of Biostatistics, Vanderbilt University -- View this message in context: http://r.789695.n4.nabble.com/Outlier-removal-techniques-tp4372652p4373592.html Sent from the R help mailing list archive at Nabble.com.
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Frank Harrell
Sent: Thursday, February 09, 2012 9:19 AM
To: r-help at r-project.org
Subject: Re: [R] Outlier removal techniques
I wonder why it is still standard practice in some circles to search
for
"outliers" as opposed to using robust/resistent methods.
Here is a great paper with a scientific approach to "outliers":
@Article{fin06cal,
author = {Finney, David J.},
title = {Calibration guidelines challenge outlier
practices},
journal = The American Statistician,
year = 2006,
volume = 60,
pages = {309-313},
annote = {anticoagulant
therapy;bias;causation;ethics;objectivity;outliers;guidelines for
treatment of outliers;overview of types of outliers;letter to the
editor and
reply 61:187 May 2007}
}
Frank
Rich Shepard wrote
On Thu, 9 Feb 2012, mails wrote:
I need to analyse a data matrix with dimensions of 30x100. Before analysing the data there is, however, a need to remove outliers from
the
data. I read quite a lot about outlier removal already and I think
the
most common technique for that seems to be Principal Component
Analysis
(PCA). However, I think that these technqiue is quite subjective.
When is
an outlier an outlier? I uploaded an example PCA plot here:
Those more expert than I will certainly provide answers. What I do
will
new data is create box-and-whisker plots (I use the lattice package)
which
defines outliers as those data beyond 1.5x the first or third
quartile
values. No one but you can answer your question on when an outlier is an outlier. It depends on your data set and the context of the data. For example,
a
water chemistry value that far exceeds a regulartory threshold might
be
meaningful in the context of a one-off excursion (in which case it's
not
an outlier but a real data point) or it might result from a handling, instrumentation, or analytical error (in which case toss it as an outlier). Rich
______________________________________________ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I would echo what Frank says. I would also add that in the absence of demonstrated measurement/recording errors, there is good reason to "explain" the extreme values as well as the typical values. If a model can't deal with extreme values, then it may be good enough for some purposes, but it is not a "complete" explanation and may fail at the worst time. I would highly recommend the book "The Black Swan" by Nassim Nicholas Taleb (NOT the ballet story). Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204
-----Original Message----- I wonder why it is still standard practice in some circles to search for "outliers" as opposed to using robust/resistent methods.
At the risk of extending an old debate and driving us off list topic, here are three possible reasons:
i) Identifying outliers is important when you want to find possible mistakes in measurement or data entry - so irrespective of whether you use robust methods, you probably want to ask questions like 'why has that result been entered as almost exactly 1000 times the value I expected?' [typically a unit error, btw). And although graphical outlier checking is the obvious way to do that, eyeballs see oddity in chance; an outlier test can help you distinguish oddity from chance and save some (arguably) unnecessary follow-up.
ii) because supervised outlier rejection at around the 99% level performs - for simple problems - about as well as Huber's with c set to 1.5 and is a lot easier to explain to, er, people who don't understand iterative numerical methods.
iii) Because it's written into some international Standards for statistical processing of data (ie, it's standard practice because it's Standard practice).
iv) because you can't do robust analysis in Excel*
Not that all these are necesarily _good_ reasons ... ;-)
However, I do NOT understand why schools in the UK teach physics students that outliers should automatically and always be thrown out; that's a much larger leap.
*You can actually; with R or several add-ins. But that is off topic.
*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}