outlier identification: is there a redundancy-invariant substitution for mahalanobis distances?

Wed, Jan 21, 2004 8:57 AM #

Dear R-experts,

Searching the help archives I found a recommendation to do multivariate
outlier identification by mahalanobis distances based on a robustly estimated
covariance matrix and compare the resulting distances to a chi^2-distribution
with p (number of your variables) degrees of freedom. I understand that
compared to euclidean distances this has the advantage of being scale-invariant.
However, it seems that such mahalanobis distances are not invariant to
redundancies: adding a highly collinear variable changes the mahalanobis distances
(see code below). Isn't also the comparision to chi^2 assuming that all
variables are independent?

Can anyone recommend a procedure to calculate distances and identify
multivariate outliers which is invariant to the degree of collinearity?

Thanks to any advice



Jens Oehlschl?gel



# Example code
library(MASS)

# generate bivariate normal test data
n <- 500
x <- matrix(rnorm(n*2), ncol=2)
# scale, otherwise euclidean fails
x <- scale(x)
cr <- cov.rob(x, method="mcd")
center <- cr$center
# calculate squared euclidean and mahalanobis
d <- rowSums(t(t(x)-center)^2)
m <- as.vector(mahalanobis(x, center, cr$cov))
# euclidean an dmahalanobis basically coincide, mahalanobis slightly biased
by robust covariance underestimation
eqscplot(x=d, y=m); abline(0,1)


# Now I add a highly redundant column in hope the distances between cases
will not change
x2 <- cbind(x, x[,1]+rnorm(n, sd=0.01))
# scale, otherwise euclidean fails
x2 <- scale(x2)
cr2 <- cov.rob(x2, method="mcd")
center2 <- cr2$center
d2 <- rowSums(t(t(x2)-center2)^2)
m2 <- as.vector(mahalanobis(x2, center2, cr2$cov))
# though equally scaled, euclidean and mahalanobis diverge
eqscplot(x=d2, y=m2); abline(0,1)

# mahalanobis distances are obviously not redundancy invariant
eqscplot(x=m, y=m2); abline(0,1)
# especially if rank order of distances is considered
eqscplot(x=rank(m), y=rank(m2)); abline(0,1)
cor(m, m2)
cor(m, m2, method="spearman")

# euclidean distances look better but are also not redundancy invariant
eqscplot(x=d, y=d2); abline(0,1)
eqscplot(x=rank(d), y=rank(d2)); abline(0,1)
cor(d, d2)
cor(d, d2, method="spearman")

Bis 31.1.: TopMail + Digicam f?r nur 29 EUR http://www.gmx.net/topmail

Brian Ripley

Wed, Jan 21, 2004 9:35 AM #

Your extra column is not redundant: it adds an extra column of 
information, and outliers in that column after removing the effects of the 
other columns are still multivariate outliers.

Effectively you have added one more dimension to the sphered point cloud, 
and mahalanobis distance is Euclidean distance after sphering.

On Wed, 21 Jan 2004, "Jens Oehlschl?gel" wrote:

No.  It assumes that *after sphering* all variables are independent, which 
is true by definition for a joint normal distribution.

I don't think that makes any sense, given what is usually meant by
`multivariate outliers', outliers in any direction in the point cloud.

[...]

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595