Clustering quality measure
Hi,
"Jonck" == Jonck van der Kogel <jonck at vanderkogel.net>
on Tue, 17 Jun 2003 17:23:33 +0200 writes:
Jonck> Hi all, I am running a series of experiments where
Jonck> after manipulating my data I run several clustering
Jonck> algorithms (agnes, diana and a clustering method of
Jonck> my own) on the data. I wanted to determine which
Jonck> clustering method did the best job, so therefore I
Jonck> had defined my own quality measure using two
Jonck> criteria: compactness of the data within the clusters
Jonck> themselves and the amount of seperation between the
Jonck> clusters. Anyway, my quality measure does not work,
Jonck> since according to my quality measure the quality
Jonck> gets increasingly better as more clusters are formed
Jonck> untill every data instance is a cluster by itself.
Jonck> Therefore I was wondering if any of you are aware of
Jonck> any libraries or functions within R that determine
Jonck> quality measures of clusterings, I am very much
Jonck> intrigued by the definition of quality measures that
Jonck> do work. Thanks very much, Jonck
Sounds a bit like ratio of within clusters variation and between clusters variation. Similar measures arise as negative loglikelihoods in certain normal distribution based clustering methods. Of course they get better with more clusters because there are more degrees of freedom for the fit. A common strategy is to penalize the negative loglikelihood by an increasing function of the number of degrees of freedom. This is implemented as BIC (Bayesian Information Criterion) for various normal mixture models in library mclust and is used there to decide about the best model (number of clusters, covariance matrix parametrization). In principle, you could compute the BIC, given a certain covariance matrix parametrization, for every partition from an arbitrary clustering. Note however that this, as every quality measure for clustering, implies a particular concept of what a cluster is. If you define a cluster as "looking like a mixture component in a normal mixture", than this is OK, but very likely you will then get the "best" clustering using a method which performs estimation in a normal mixture model. If you have a different concept of a cluster and you formalize it via a quality criterion, you will get the best clustering by optimizing *this* quality criterion (maybe apart from possible numerical problems). The important point is that no quality criterion for clustering provides an independent objective decision of what the best clustering is. The choice of an adequate quality criterion is as difficult and subjective as the choice of the best clustering method. Best, Christian
*********************************************************************** Christian Hennig Seminar fuer Statistik, ETH-Zentrum (LEO), CH-8092 Zuerich (currently) and Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg hennig at stat.math.ethz.ch, http://stat.ethz.ch/~hennig/ hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ ####################################################################### ich empfehle www.boag.de