Skip to content

Outlier removal techniques

5 messages · mails, Rich Shepard, Frank E Harrell Jr +2 more

#
Hello,


I need to analyse a data matrix with dimensions of 30x100.
Before analysing the data there is, however, a need to remove outliers from
the data.
I read quite a lot about outlier removal already and I think the most common
technique for that seems to be Principal Component Analysis (PCA). However,
I think that these technqiue is quite subjective. When is an outlier an
outlier?
I uploaded an example PCA plot here: 

http://s14.postimage.org/oknyya1ld/pca.png

Should we treat the green and red dots as outliers already or only the blue
one which
lies outside the 95% confidence interval. It seems very arbitrary how people
remove outliers using PCA.

I also thought about fitting a linear model through my data and look at
distribution of the residuals. 
However, the problem with using linear models is that one can actually never
be sure that the model
 used is the one which describes the data best. In model A, for instance, we
might treat sample 1 as and 
outlier but fitting a different model B sample 1 might not be an outlier at
all.

I had a brief look at k-means clustering as well but I think it's not the
right thing to go for. 
Again, how do one decide which cluster is an outler? And also it is known
that different 
cluster analysis lead to totally different results. So which one to choose?


Is there any other way to non-subjectively remove outliers from data?
I would really appreciated any ideas/comments you might have on that topic.


Cheers

--
View this message in context: http://r.789695.n4.nabble.com/Outlier-removal-techniques-tp4372652p4372652.html
Sent from the R help mailing list archive at Nabble.com.
#
On Thu, 9 Feb 2012, mails wrote:

            
Those more expert than I will certainly provide answers. What I do will
new data is create box-and-whisker plots (I use the lattice package) which
defines outliers as those data beyond 1.5x the first or third quartile
values.

   No one but you can answer your question on when an outlier is an outlier.
It depends on your data set and the context of the data. For example, a
water chemistry value that far exceeds a regulartory threshold might be
meaningful in the context of a one-off excursion (in which case it's not an
outlier but a real data point) or it might result from a handling,
instrumentation, or analytical error (in which case toss it as an outlier).

Rich
#
I wonder why it is still standard practice in some circles to search for
"outliers" as opposed to using robust/resistent methods.  

Here is a great paper with a scientific approach to "outliers":

@Article{fin06cal,
  author = 		 {Finney, David J.},
  title = 		 {Calibration guidelines challenge outlier practices},
  journal = 	 The American Statistician,
  year = 		 2006,
  volume =		 60,
  pages =		 {309-313},
  annote =		 {anticoagulant
therapy;bias;causation;ethics;objectivity;outliers;guidelines for
treatment of outliers;overview of types of outliers;letter to the editor and
reply 61:187 May 2007}
}

Frank

Rich Shepard wrote
-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: http://r.789695.n4.nabble.com/Outlier-removal-techniques-tp4372652p4373592.html
Sent from the R help mailing list archive at Nabble.com.
#
I would echo what Frank says.  I would also add that in the absence of demonstrated measurement/recording errors, there is good reason to "explain" the extreme values as well as the  typical values.  If a model can't deal with extreme values, then it may be good enough for some purposes, but it is not a "complete" explanation and may fail at the worst time.  I would highly recommend the book "The Black Swan" by Nassim Nicholas Taleb (NOT the ballet story).


Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
#
At the risk of extending an old debate and driving us off list topic, here are three possible reasons:
i) Identifying outliers is important when you want to find possible mistakes in measurement or data entry - so irrespective of whether you use robust methods, you probably want to ask questions like 'why has that result been entered as almost exactly 1000 times the value I expected?' [typically a unit error, btw). And although graphical outlier checking is the obvious way to do that, eyeballs see oddity in chance; an outlier test can help you distinguish oddity from chance and save some (arguably) unnecessary follow-up. 

ii) because supervised outlier rejection at around the 99% level performs - for simple problems - about as well as Huber's with c set to 1.5 and is a lot easier to explain to, er, people who don't understand iterative numerical methods.

iii) Because it's written into some international Standards for statistical processing of data (ie, it's standard practice because it's Standard practice).

iv) because you can't do robust analysis in Excel* 

Not that all these are necesarily _good_ reasons ... ;-)

However, I do NOT understand why schools in the UK teach physics students that outliers should automatically and always be thrown out; that's a much larger leap.

*You can actually; with R or several add-ins. But that is off topic.
*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}