references on cluster analysis
I don't really believe that there is any satisfactory definition of the "true number of clusters" let along a procedure that would reliably find it. Murray Jorgensen
Martin Maechler wrote:
Back from my vacation, I haven't seen an R-help answer on this (Christian, where have you been ? ;-)
"GiampS" == Giampiero Salvi <giampi at speech.kth.se> on Sat, 7 Feb 2004 23:40:36 +0100 (CET) writes:
GiampS> Hi all, I'm doing a study on predicting the "true"
GiampS> number of clusters in a hierarchical clustering
GiampS> scheme. My main reference is at the moment
GiampS> Milligan GW and Cooper MC (1985) "An examination of
GiampS> procedures for determining the number of clusters in
GiampS> a data set" Psychometrika vol 50 no 2 pp 159-179
GiampS> and all the references included in that paper.
(not available to me)
GiampS> I'm planning to perform a similar comparison on a
GiampS> number of indexes, but on a much larger data set (in
GiampS> the order of 3000 points), and with a much higher
GiampS> "true" number of clusters (in the order of some
GiampS> hundreds), to see if the properties of the indexes
GiampS> scale accordingly.
GiampS> I was wondering if the set of indexes described in
GiampS> the reference are still "state of the art" (most of
GiampS> them were introduced in the '60s and '70s), or if
GiampS> there are new indexes and methods I could include in
GiampS> my study. I would really appreciate if you could
GiampS> point me to some newer references addressing this problem.
Gordon's 2nd edition,
author = {A. D. Gordon},
title = {Classification, 2nd Edition},
publisher = {Chappman \& Hall/CRC},
year = 1999,
series = {Monographs on Statistics and Applied Probability 82},
edition = {2nd edition}
has a whole chapter (one of the last ones in the book) on this.
R's cluster package has a generic silhouette() function (with 2 methods),
and plot.silhouette() method --- all are improvements from
Kaufman & Rousseeuw's original code.
A recent research paper using "CLEST" (Fridyland & Dudoit),
mentioning "GAP" (Tibshirani) etc etc still find silhouette
among the best "indices" for determining the number of clusters.
A student's (master) thesis here seems to point in the same
direction.
GiampS> I also read Milligan's chapter in the book
GiampS> "Clustering and Classification" from 1995,
(which book? author?)
GiampS> but didn't find information on this subject that wasn't
GiampS> included in the previous paper.
Regards,
Martin Maechler <maechler at stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27
ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
phone: x-41-1-632-3408 fax: ...-1228 <><
______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html Department of Statistics, University of Waikato, Hamilton, New Zealand Email: maj at waikato.ac.nz Fax 7 838 4155 Phone +64 7 838 4773 wk +64 7 849 6486 home Mobile 021 1395 862