Clustering quality measure - R-help

Tue, Jun 17, 2003 8:23 AM #

Hi all,
I am running a series of experiments where after manipulating my data I 
run several clustering algorithms (agnes, diana and a clustering method 
of my own) on the data. I wanted to determine which clustering method 
did the best job, so therefore I had defined my own quality measure 
using two criteria: compactness of the data within the clusters 
themselves and the amount of seperation between the clusters. Anyway, 
my quality measure does not work, since according to my quality measure 
the quality gets increasingly better as more clusters are formed untill 
every data instance is a cluster by itself.
Therefore I was wondering if any of you are aware of any libraries or 
functions within R that determine quality measures of clusterings, I am 
very much intrigued by the definition of quality measures that do work.
Thanks very much, Jonck

Martin Maechler

Wed, Jun 18, 2003 1:34 AM #

Jonck> Hi all, I am running a series of experiments where
    Jonck> after manipulating my data I run several clustering
    Jonck> algorithms (agnes, diana and a clustering method of
    Jonck> my own) on the data. I wanted to determine which
    Jonck> clustering method did the best job, so therefore I
    Jonck> had defined my own quality measure using two
    Jonck> criteria: compactness of the data within the clusters
    Jonck> themselves and the amount of seperation between the
    Jonck> clusters. Anyway, my quality measure does not work,
    Jonck> since according to my quality measure the quality
    Jonck> gets increasingly better as more clusters are formed
    Jonck> untill every data instance is a cluster by itself.
    Jonck> Therefore I was wondering if any of you are aware of
    Jonck> any libraries or functions within R that determine
    Jonck> quality measures of clusterings, I am very much
    Jonck> intrigued by the definition of quality measures that
    Jonck> do work.  Thanks very much, Jonck

Well,  "do work" is said much.  

But there's silhouette() in the `cluster' package {where agnes()
and diana() reside}. You can plot silhouettes of almost any
clustering {i.e. grouping} as a diagnostic, and the "Average
Silhouette Width" has been proposed as "goodness of fit" measure
for clusters, and even to determine how many clusters you should
choose.

One of its several drawbacks is that it's not defined for the
"only 1 cluster" situation, i.e., you cannot use it to compare
one vs two clusters.

--> ?silhouette

and look and try the "Examples".

Regards,
Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><

Christian Hennig

Wed, Jun 18, 2003 2:20 AM #

Hi,

Sounds a bit like ratio of within clusters variation and between clusters
variation. Similar measures arise as negative
loglikelihoods in certain normal distribution based clustering methods. 
Of course they get better with more
clusters because there are more degrees of freedom for the fit. A common
strategy is to penalize the negative loglikelihood by an increasing
function of the number of degrees of freedom. 

This is implemented as BIC (Bayesian Information Criterion) for various
normal mixture models in library mclust and is used there to decide about
the best model (number of clusters, covariance matrix parametrization).

In principle, you could compute the BIC, given a certain covariance matrix
parametrization, for every partition from an arbitrary clustering.

Note however that this, as every quality measure for clustering, implies a
particular concept of what a cluster is. If you define a cluster as
"looking like a mixture component in a normal mixture", than this is OK,
but very likely you will then get the "best" clustering using a method which
performs estimation in a normal mixture model.

If you have a different concept of a cluster and you formalize it via a
quality criterion, you will get the best clustering by optimizing *this*
quality criterion (maybe apart from possible numerical problems).

The important point is that no quality criterion for clustering provides an
independent objective decision of what the best clustering is. The choice
of an adequate quality criterion is as difficult and subjective as the
choice of the best clustering method.

Best,
Christian

***********************************************************************
Christian Hennig
Seminar fuer Statistik, ETH-Zentrum (LEO), CH-8092 Zuerich (currently)
and Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
hennig at stat.math.ethz.ch, http://stat.ethz.ch/~hennig/
hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
#######################################################################
ich empfehle www.boag.de