Skip to content

cluster in R

11 messages · Weiwei Shi, Gabor Grothendieck, BBands +1 more

#
hi,

is there some good summary on clustering methods in R? It seems there
are many packages involving it.

And I have two questions on clustering here:

1. Is there a way of evaluate the effecitives (or seperation) of
clustering (rather than by visualization)?

2. Is there a search method (like genetic search) which can help find
the best subset of attributes which gives best seperation?

Thanks,
#
Go the R home page (google for R), click on CRAN in left pane, choose
a mirror, click on Task Views in left pane and choose
Cluster.
On 10/17/06, Weiwei Shi <helprhelp at gmail.com> wrote:
#
On 10/17/06, Weiwei Shi <helprhelp at gmail.com> wrote:

            
Gabor provided this very useful link a couple of days back.

http://cran.r-project.org/src/contrib/Views/Cluster.html

    jab
#
hi,
I just happened to find that page. But it seems too brief to me. For
example, my project involves non-determined cluster number and
non-determined attributes for the would-be-clustered samples. What
kind of methods should I start with?

Thanks a lot for the prompty reply.

W.
On 10/17/06, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:

  
    
#
Dear Weiwei,
The function cluster.stats in package fpc computes several cluster 
validation statistics (among them the average silhouette width).
Function clusterboot in the same package (recent version) assesses cluster 
stability. There are several interfaces to clustering methods implemented 
in R which are documented on the help page of kmeansCBI (which gives you 
kind of an overview of available "general purpose" clustering methods in R 
though I may have missed some).
There are also several methods for the visualization of separation (I 
know that you didn't ask for that) for which the function plotcluster is 
an interface.

Best,
Christian


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
#
Dear Christian:
This is really a good summary. Most of my prev experience was on
classification instead of clustering and this is really a good start
for me. Thank you!

And also hope someone can provide more info and answers to the other questions.

cheers,

weiwei
On 10/18/06, Christian Hennig <chrish at stats.ucl.ac.uk> wrote:

  
    
#
Dear Chris:

I have a sample like this
[1] 142  28

and I want to cluster rows;
first of all, I followed the examples for cluster.stats() by
d.dd <- dist(dd.df) # use Euclidean
d.4 <- cutree(hclust(d.dd), 4) # 4 clusters I tried
cluster.stats(d.dd, d.4) # gives me some results like this:

$cluster.size
[1] 133   5   2   2

$avg.silwidth
[1] 0.9857916

but when I tried to use pearson dist here, by visualization, i think 4
or 5 clusters are good for pearson dist, but it gave me a very bad
avg.silwidth

d.dd <- as.dist(cor(t(x),method="pearson")) # is This correct?
$cluster.size
[1] 86 31  6 19

$avg.silwidth
[1] -0.09543089


is there something wrong or I should not use pearson dist.

btw, what's $seperation? where can I find the detailed explanation on
the output from cluster.stats?

btw, ?cluster.stats does not work on my Mac machine.
_
platform       i386-apple-darwin8.6.1
arch           i386
os             darwin8.6.1
system         i386, darwin8.6.1
status
major          2
minor          3.1
year           2006
month          06
day            01
svn rev        38247
language       R
version.string Version 2.3.1 (2006-06-01)

thanks,

weiwei
On 10/18/06, Weiwei Shi <helprhelp at gmail.com> wrote:

  
    
#
Dear Weiwei,
Because I don't have access to a Mac, I can't tell you anything about
this, unfortunately.
I always thought that my package should work on all platforms if it passes
all the standard tests for packages?
(Is there anyone else who could comment on this please?)
cor can give negative values, which doesn't fit the usual definition
of a distance. I don't know what as.dist does in this case, but I think 
that, depending on your application, you should rather use the absolute 
value of the correlation, or 1+cor.
This is documented on the cluster.stats help page:

separation: vector of clusterwise minimum distances of a point in the
           cluster to a point of another cluster.

Best regards,
Christian


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
#
Dear Chris:

thanks for the prompt reply!

You are right, dist from pearson has negatives there, which I should
use cor+1 in my case (since negatively correlated genes should be
considered farthest). Thanks.

as to the ?cluster.stats, I double-checked it and I found I need to
restart my JGR, until then the help page function starts to accept
newly loaded package, like fpc for this case.

sorry for the confusion,

weiwei
On 10/18/06, Christian Hennig <chrish at stats.ucl.ac.uk> wrote:

  
    
#
Dear Chris:

I tried to use cor+1 but it still gives me sil width < 0 for average.
[1] -0.008750826
[1] -0.09543089
On 10/18/06, Weiwei Shi <helprhelp at gmail.com> wrote:

  
    
#
On Wed, 18 Oct 2006, Weiwei Shi wrote:

            
Well, then it seems that the clustering is not that good.
I don't know your data and there is no theoretical reason why it has to 
be positive. You should read the Kaufman and Rousseeuw book to understand 
the average silhouette width better.

Best wishes,
Christian
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche