Skip to content
Prev 31131 / 398513 Next

AW: [R] estimating number of clusters ("Null or more")

Dear Valery,
On Thu, 24 Apr 2003, Khamenia, Valery wrote:

            
I agree totally.
The problem is that you have to formalize what a cluster is, and this is
not a well defined notion. It has different meanings in different
applications. My interpretation of the normal mixture/BIC approach is that
it should work well if *your* concept of a cluster is that it looks
normal-shaped (and the clusters do not need to be separated too strongly).
Normal mixtures (sometimes with lots of components) are reasonable
approximations to a wide class of distributions, so the validity of the
approach is rather a question of your cluster concept than of the
distribution of the data. (However, if your concept of "homogeneity" does
not look normal, BIC may often decide for more than one component for
*in your sense* homogeneous data.)

Some material about my own point of view is given in "What clusters are
generated by Normal mixtures?" on
http://www.math.uni-hamburg.de/home/hennig/ -> Papers/publications
with associated R-software (fixed point clusters) on the same website.
This means: Do not use N(0,1) as null distribution for homogeneous data if your
data has variance 5 and the test statistics is not scale equivariant (as
k-nearest neighbors and others). A bit more general you have to think about
which features of your data should enter into your homogeneous null model
(which makes the procedure a parametric bootstrap with non-guaranteed
validity of p-values). 

Best,
Christian