Skip to content
Prev 33491 / 398513 Next

Clustering quality measure

Hi,
Sounds a bit like ratio of within clusters variation and between clusters
variation. Similar measures arise as negative
loglikelihoods in certain normal distribution based clustering methods. 
Of course they get better with more
clusters because there are more degrees of freedom for the fit. A common
strategy is to penalize the negative loglikelihood by an increasing
function of the number of degrees of freedom. 

This is implemented as BIC (Bayesian Information Criterion) for various
normal mixture models in library mclust and is used there to decide about
the best model (number of clusters, covariance matrix parametrization).

In principle, you could compute the BIC, given a certain covariance matrix
parametrization, for every partition from an arbitrary clustering.

Note however that this, as every quality measure for clustering, implies a
particular concept of what a cluster is. If you define a cluster as
"looking like a mixture component in a normal mixture", than this is OK,
but very likely you will then get the "best" clustering using a method which
performs estimation in a normal mixture model.

If you have a different concept of a cluster and you formalize it via a
quality criterion, you will get the best clustering by optimizing *this*
quality criterion (maybe apart from possible numerical problems).

The important point is that no quality criterion for clustering provides an
independent objective decision of what the best clustering is. The choice
of an adequate quality criterion is as difficult and subjective as the
choice of the best clustering method.

Best,
Christian