some thoughts on outlier detection, need help!
I'm not certain what you are asking. PLEASE do read the posting guide! "http://www.R-project.org/posting-guide.html". If you formulate your question in terms of a simple example, showing where you got stuck as suggested in the posting guide, it might help others understand your question and inspire suggestions. TINSTAFL = There is no such thing as a free lunch (Heinlein, The Moon is a Harsh Mistress) spencer graves
Weiwei Shi wrote:
Dear listers: I have an idea to do the outlier detection and I need to use R to implement it first. Here I hope I can get some input from all the guru's here. I select distance-based approach--- step 1: calculate the distance of any two rows for a dataframe. considering the scaling among different variables, I choose mahalanobis, using variance as scaler. step 2: Let k be the number of points in one "cluster". K is decided by answering the following question: how many neighbors a point needs for not being an outlier. for each point, get the smallest (k-1) distances from step1. Among the (k-1) distances of each point, get the max for the point. step 3: get the distribution of those max for all the points. Thus, the multivariate problem becomes a univariate one. Then the outlier in those max's will define the outlier of the point. My question is: 1. I don't know if using mahalanobis is proper or not since most clustering algorithms implemented in R (like pam or clara) use euclidean or mahattan. 2. Is there a way to get the mahalanobis distance matrix for any two rows of a dataframe or matrix? 3. My approach does allow a point belonging to more than one k-cluster. Is there similar algorithm in R or published? Thanks for any suggestions, weiwei
Spencer Graves, PhD Senior Development Engineer PDF Solutions, Inc. 333 West San Carlos Street Suite 700 San Jose, CA 95110, USA spencer.graves at pdf.com www.pdf.com <http://www.pdf.com> Tel: 408-938-4420 Fax: 408-280-7915