Dear R-experts,
Searching the help archives I found a recommendation to do multivariate
outlier identification by mahalanobis distances based on a robustly estimated
covariance matrix and compare the resulting distances to a chi^2-distribution
with p (number of your variables) degrees of freedom. I understand that
compared to euclidean distances this has the advantage of being scale-invariant.
However, it seems that such mahalanobis distances are not invariant to
redundancies: adding a highly collinear variable changes the mahalanobis distances
(see code below). Isn't also the comparision to chi^2 assuming that all
variables are independent?
Can anyone recommend a procedure to calculate distances and identify
multivariate outliers which is invariant to the degree of collinearity?
Thanks to any advice
Jens Oehlschl?gel
# Example code
library(MASS)
# generate bivariate normal test data
n <- 500
x <- matrix(rnorm(n*2), ncol=2)
# scale, otherwise euclidean fails
x <- scale(x)
cr <- cov.rob(x, method="mcd")
center <- cr$center
# calculate squared euclidean and mahalanobis
d <- rowSums(t(t(x)-center)^2)
m <- as.vector(mahalanobis(x, center, cr$cov))
# euclidean an dmahalanobis basically coincide, mahalanobis slightly biased
by robust covariance underestimation
eqscplot(x=d, y=m); abline(0,1)
# Now I add a highly redundant column in hope the distances between cases
will not change
x2 <- cbind(x, x[,1]+rnorm(n, sd=0.01))
# scale, otherwise euclidean fails
x2 <- scale(x2)
cr2 <- cov.rob(x2, method="mcd")
center2 <- cr2$center
d2 <- rowSums(t(t(x2)-center2)^2)
m2 <- as.vector(mahalanobis(x2, center2, cr2$cov))
# though equally scaled, euclidean and mahalanobis diverge
eqscplot(x=d2, y=m2); abline(0,1)
# mahalanobis distances are obviously not redundancy invariant
eqscplot(x=m, y=m2); abline(0,1)
# especially if rank order of distances is considered
eqscplot(x=rank(m), y=rank(m2)); abline(0,1)
cor(m, m2)
cor(m, m2, method="spearman")
# euclidean distances look better but are also not redundancy invariant
eqscplot(x=d, y=d2); abline(0,1)
eqscplot(x=rank(d), y=rank(d2)); abline(0,1)
cor(d, d2)
cor(d, d2, method="spearman")
Your extra column is not redundant: it adds an extra column of
information, and outliers in that column after removing the effects of the
other columns are still multivariate outliers.
Effectively you have added one more dimension to the sphered point cloud,
and mahalanobis distance is Euclidean distance after sphering.
On Wed, 21 Jan 2004, "Jens Oehlschl?gel" wrote:
Dear R-experts,
Searching the help archives I found a recommendation to do multivariate
outlier identification by mahalanobis distances based on a robustly estimated
covariance matrix and compare the resulting distances to a chi^2-distribution
with p (number of your variables) degrees of freedom. I understand that
compared to euclidean distances this has the advantage of being scale-invariant.
However, it seems that such mahalanobis distances are not invariant to
redundancies: adding a highly collinear variable changes the mahalanobis distances
(see code below). Isn't also the comparision to chi^2 assuming that all
variables are independent?
No. It assumes that *after sphering* all variables are independent, which
is true by definition for a joint normal distribution.
Can anyone recommend a procedure to calculate distances and identify
multivariate outliers which is invariant to the degree of collinearity?
I don't think that makes any sense, given what is usually meant by
`multivariate outliers', outliers in any direction in the point cloud.
[...]
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595