If you belong to R-contributors group then thanks a lot
in advance!
The problem is that you have to formalize what a cluster is,
and this is not a well defined notion.
It has different meanings in different applications.
you are right if one follows the idea of full formalization of
the notion it should rather lead to a fail. Should one really
take this extreme way then?
Let's take a small analogy with statistical tests.
Statistical tests never answer "yes" or "no".
One should interpret/treat p-values instead on his/her own.
Thus, say, nice formed statistics just help us to focus on
particular properties of a given distribution.
Now back to our case. Why not to build some statistics (in
cclust package they are named as `indices') to help
focusing our attention on properties of the distribution
given?
My interpretation of the normal mixture/BIC
approach is that it should work well if *your* concept of
a cluster is that it looks normal-shaped
(and the clusters do not need to be separated
too strongly).
fine. I'd like to emphasize here that as long as possible
one should rather deny taking any decision about how
much clusters we have. Like with those p-values.
Normal mixtures (sometimes with lots of components) are reasonable
approximations to a wide class of distributions, so the
validity of the approach is rather a question of your
cluster concept than of the distribution of the data.
I do agree that multimodal normal mixture is a very powerful
approximation basis for a wider class of distributions.
But in context of data homogeneity criterion it is rather
a weak basis. Indeed, simple lognormal distribution will
be adequately approximated with more then one mode only.
That pushes us automatically to a false conclusion that
lognormal distribution is not homogeneous one.
I estimate the very idea of using entropy as quite adequate
idea for describing homogeneity of the set, and therefore, good
enough to be a basis for taking decision about having cluster
or having no cluster.
Some material about my own point of view is given in "What
clusters are generated by Normal mixtures?" on
http://www.math.uni-hamburg.de/home/hennig/ -> Papers/publications
with associated R-software (fixed point clusters) on the same
website.
I am reading.
This means: Do not use N(0,1) as null distribution for
homogeneous data if your
...
a bit more clear now. thank you.
Well, could I ask what is your own opinion about some
statistics (or so called cluster indices) which could
focus on properties of data with respect to being
homogeneously spread or being attracted to some
clusters?
In particular do you believe that entropy-based statistics
should be adequate according to *your* own comprehension of
what the clusters are?
And there is still an open question for me whether one could
calculate BIC based on ECDF.
kind regards,
Valery A.Khamenya