Hi all,
once more about the old subj :-)
My data has too much various distribution families and for every
particular experiment
I need just to decide whether the data is "quite homogeneous" or it has
two or more
clusters. I've revisited the following libraries:
amap, clust, cclust, mclust, multiv, normix, survey.
And I didn't find any ready-to-use general purpose criterion for answering
the question whether the data is "quite homogeneous" or has two or more
clusters. Even for one dimension data.
However, in "cclust" a "clustIndex" might be used as a raw criteria.
But nothing ready to use as far as I understand. Or maybe I am wrong?!
Q: are there any libraries in R with ready-to-use functions for estimation
number of clusters...
- ... with criterion based on entropy?
- ... with criterion based on ecdf?
Please Cc to:
vkhamenia at biovision.de
kind thanks.
---------------------------------------------------------------------------
Valery A.Khamenya
Bioinformatics Department
BioVisioN AG, Hannover
estimating number of clusters ("Null or more")
2 messages · Khamenia, Valery, Christian Hennig
Hi, there are at least two methods to estimate the number of clusters in R: In library(cluster), you can use the information coming with the silhouette plot. This is a bit difficult to figure out from the help pages (it got better in the recent version, I think), and you can find it out reading help pages of pam, pam.object and partition.object. EMclust of library mclust decides about an optimal number of mixture components using the BIC. As far as I know, there is no direct answer to the problem of testing homogeneity vs. clustering in R. There are lots of theoretical difficulties and there is no "standard routine" to do this, neither in R, nor elsewhere. I would suggest to invent a null model for your data modelled as homogeneous and to estimate the distribution of a suitable clustering statistics (such as the silhouette avg.width in pam, BIC, average distance of the points to kth nearest neighbor or ratio between 25% largest and smallest distances in the dataset) by Monte Carlo/parametric bootstrap. Perhaps I say this too quickly; it's non-trivial and at least you have to design the simulation so that rejection/acceptance is not a consequence of different scaling of data and null model. Hope that helps, Christian
On Thu, 24 Apr 2003, Khamenia, Valery wrote:
Hi all,
once more about the old subj :-)
My data has too much various distribution families and for every
particular experiment
I need just to decide whether the data is "quite homogeneous" or it has
two or more
clusters. I've revisited the following libraries:
amap, clust, cclust, mclust, multiv, normix, survey.
And I didn't find any ready-to-use general purpose criterion for answering
the question whether the data is "quite homogeneous" or has two or more
clusters. Even for one dimension data.
However, in "cclust" a "clustIndex" might be used as a raw criteria.
But nothing ready to use as far as I understand. Or maybe I am wrong?!
Q: are there any libraries in R with ready-to-use functions for estimation
number of clusters...
- ... with criterion based on entropy?
- ... with criterion based on ecdf?
Please Cc to:
vkhamenia at biovision.de
kind thanks.
---------------------------------------------------------------------------
Valery A.Khamenya
Bioinformatics Department
BioVisioN AG, Hannover
______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
*********************************************************************** Christian Hennig Seminar fuer Statistik, ETH-Zentrum (LEO), CH-8092 Zuerich (currently) and Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg hennig at stat.math.ethz.ch, http://stat.ethz.ch/~hennig/ hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ ####################################################################### ich empfehle www.boag.de