Skip to content

estimating number of clusters ("Null or more")

2 messages · Khamenia, Valery, Christian Hennig

#
Hi all,

  once more about the old subj :-)

  My data has too much various distribution families and for every
particular experiment 
  I need just to decide whether the data is "quite homogeneous" or it has
two or more 
  clusters. I've revisited the following libraries: 
         amap, clust, cclust, mclust, multiv, normix, survey.

  And I didn't find any ready-to-use general purpose criterion for answering

  the question whether the data is "quite homogeneous" or has two or more 
  clusters. Even for one dimension data.

  However, in "cclust" a "clustIndex" might be used as a raw criteria.
  But nothing ready to use as far as I understand. Or maybe I am wrong?!

  Q: are there any libraries in R with ready-to-use functions for estimation

       number of clusters...
       - ... with criterion based on entropy?
       - ... with criterion based on ecdf?

Please Cc to:

   vkhamenia at biovision.de

kind thanks.
---------------------------------------------------------------------------
Valery A.Khamenya
Bioinformatics Department
BioVisioN AG, Hannover
#
Hi,

there are at least two methods to estimate the number of clusters in R:
In library(cluster), you can use the information coming with the 
silhouette plot. This is a bit difficult to figure out from the help pages
(it got better in the recent version, I think), and you can find it out
reading help pages of pam, pam.object and partition.object.

EMclust of library mclust decides about an optimal number of mixture
components using the BIC.

As far as I know, there is no direct answer to the problem of testing
homogeneity vs. clustering in R. There are lots of theoretical difficulties
and there is no "standard routine" to do this, neither in R, nor
elsewhere. I would suggest to invent a null model for your data modelled as
homogeneous and to estimate the distribution of a suitable clustering
statistics (such as the silhouette avg.width in pam, BIC, average
distance of the points to kth nearest neighbor or ratio between 25% largest
and smallest distances in the dataset) by Monte
Carlo/parametric bootstrap. Perhaps I say this too quickly; it's
non-trivial and at least you have to design the simulation so that
rejection/acceptance is not a consequence of different scaling of data and
null model. 

Hope that helps,
Christian
On Thu, 24 Apr 2003, Khamenia, Valery wrote: