Skip to content

Similarity matrix

2 messages · Kaspar Pflugshaupt, Brian Ripley

#
On Wednesday 11 April 2001 10:23, Prof Brian Ripley wrote:

            
I've never done cluster analysis with S-Plus. But let's see:

The statistical manual for S-Plus 5.1/Unix fails to even mention similarity 
matrices.

help(hclust) (in S-Plus 5.1/Unix and 3.4/Unix) says 

  USAGE:                                                            
      
  hclust(dist, method = "compact", sim =)
                                       
  [...]         
                                                         
   sim=                                                  
          structure giving similarities rather than distances. This can
          either be a symmetric matrix or a vector with a "Size"       
          attribute. Missing values are not allowed.

The help text does not explain how the conversion to distances is done, 
though. And the source is not available...
Well, I've taken the time to do it for you (S-PLus 3.4/Unix):

  mat <- matrix(runif(100), nrow=10)
  print(1 - plclust(hclust( sim=mat ))$yn)  # 1 - ...: S-Plus seems to mirror 
					    # the tree's y scale when given a similarity matrix

gives the same values as

  print(plclust(hclust( 1-mat ))$yn)

but different values from

  print(plclust(hclust( sqrt(1-mat) )$yn)

The grouping structure is constant, anyway.

So, S-Plus seems to use D=1-S rather than D=sqrt(1-S) internally.

For R, it might be a good idea to let the user choose the conversion method 
via an additional parameter, making D=1-S the default.

According to Legendre & Legendre, the choice of similarity coefficient 
_does_ make a difference as to which conversion should be preferred. For some 
"species" of similarity coefficients, the resulting distance would be metric 
and euclidean with one method but not with the other, for others vice versa. 
I don't know if this matters for cluster analysis, but I think that it might, 
especially when clustering with an euclidean metric.


Cheers (hoping this was to the point :-)

Kaspar Pflugshaupt
#
On Wed, 11 Apr 2001, Kaspar Pflugshaupt wrote:

            
(Unfortunately a few minutes after I had, although Frank Harrell wanted to
know.)
For single- and complete-link clustering only the ordering matters as far
as the clusters are concerned.
Thanks, it was.

Is it worth adding this to R's hclust?