Skip to content

silhoutte.default bugs

3 messages · Murad Nayal, Martin Maechler

#
Hello all,


This might have been fixed in later versions (I am using R1.7.0), r-help
archive contains messages reporting similar problems but no reports of
codes fixes. I have encountered a couple of problems using the
silhouette function. one occurs when the clustering contains clusters
composed of 1 element (Martin Maechler posted code few months ago that
fixes a similar problem that occurs when clusters have only 2 elements
but not the case with 1 element). the other problem is due to
silhouette's assumption that the clusters are numbered sequentially
starting at 1. one of the clustering programs I use (snob) assigns more
or less arbitrary integer ids to clusters starting from 3! (clusters 1
and 2 have special meaning in snob). the modified code fixing both
problems is included below, changes are commented.

best
Murad

silhouette.default <-
function (x, dist, dmatrix, ...) 
{
    cll <- match.call()
    if (!is.null(cl <- x$clustering)) 
        x <- cl
    n <- length(x)
    if (!all(x == round(x))) 
        stop("`x' must only have integer codes")
    k <- length(clid <- sort(unique(x)))
    if (k <= 1 || k >= n) 
        return(NA)
    if (missing(dist)) {
        if (missing(dmatrix)) 
            stop("Need either a dissimilarity `dist' or diss.matrix
`dmatrix'")
        if (is.null(dm <- dim(dmatrix)) || length(dm) != 2 || 
            !all(n == dm)) 
            stop("`dmatrix' is not a dissimilarity matrix compatible to
`x'")
    }
    else {
        dist <- as.dist(dist)
        if (n != attr(dist, "Size")) 
            stop("clustering `x' and dissimilarity `dist' are
incompatible")
        dmatrix <- as.matrix(dist)
    }
    wds <- matrix(NA, n, 3, dimnames = list(names(x), c("cluster", 
        "neighbor", "sil_width")))
    for (j in 1:k) {
        Nj <- sum(iC <- x == clid[j])
#
# the following line changed from  wds[iC, "cluster"] <- j
#
        wds[iC, "cluster"] <- clid[j]
        a.i <- if (Nj > 1) 
            colSums(dmatrix[iC, iC])/(Nj - 1)
        else 0
#
# the following line changed from 
# diC <- rbind(apply(dmatrix[!iC, iC], 2, function(r) tapply(r,
# x[!iC], mean)))
#
        diC <- rbind(apply(cbind(dmatrix[!iC, iC]), 2, function(r)
tapply(r, 
            x[!iC], mean)))
        minC <- max.col(-t(diC))
        wds[iC, "neighbor"] <- clid[-j][minC]
#
# the following line changed from 
# b.i <- diC[cbind(minC, seq(minC))]
#
        b.i <- diC[cbind(minC, seq(along=minC))]
        s.i <- (b.i - a.i)/pmax(b.i, a.i)
        wds[iC, "sil_width"] <- s.i
    }
    attr(wds, "Ordered") <- FALSE
    attr(wds, "call") <- cll
    class(wds) <- "silhouette"
    wds
}
#
Murad> This might have been fixed in later versions (I am
    Murad> using R1.7.0),

yes, the bug has been fixed "long ago", 
from my ChangeLog (!), it was 2003-07-18.

    Murad>  r-help archive contains messages reporting similar
    Murad> problems but no reports of codes fixes. I have
    Murad> encountered a couple of problems using the silhouette
    Murad> function. one occurs when the clustering contains
    Murad> clusters composed of 1 element (Martin Maechler
    Murad> posted code few months ago that fixes a similar
    Murad> problem that occurs when clusters have only 2
    Murad> elements but not the case with 1 element). the other
    Murad> problem is due to silhouette's assumption that the
    Murad> clusters are numbered sequentially starting at 1. 

which is what  ?silhouette  tells you as well:
So, definitely not a bug, 
but it's your problem of using silhouette() on an object that is
not of appropriate structure.

    Murad> one of the clustering programs I use (snob) assigns
    Murad> more or less arbitrary integer ids to clusters
    Murad> starting from 3! (clusters 1 and 2 have special
    Murad> meaning in snob). the modified code fixing both
    Murad> problems is included below, changes are commented.

Thank you for the good attempt,
but really you should work from the source where
silhouette.default does also contain comments, and as said, has
long been fixed for the case of 1-element clusters
{a better fix is not to use cbind() but to use the ominous 
 ", drop = FALSE" when  subsetting matrices!}

I'm still willing to consider your *feature request* (as opposed
to bug fix) of allowing inputs where the grouping vector does
contain other than "1:g" .

I'll send you the current source of silhouette.default in a
private mail.

Thanks for your collaboration on improving R!
Regards,

Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><
#
Martin Maechler wrote:
sorry about that. I have been reluctant to upgrade recently for fear of
disrupting my environment while in the middle of a project. as I
mentioned I searched the archive and found posts citing this problem but
no replies stating that it has been fixed (the Nj=1 case).
that would be great. it is straightforward to do and will broaden the
utility of silhouette. I'll send you the suggested patch privately.

best regards,
Murad