Hello all,
This might have been fixed in later versions (I am using R1.7.0), r-help
archive contains messages reporting similar problems but no reports of
codes fixes. I have encountered a couple of problems using the
silhouette function. one occurs when the clustering contains clusters
composed of 1 element (Martin Maechler posted code few months ago that
fixes a similar problem that occurs when clusters have only 2 elements
but not the case with 1 element). the other problem is due to
silhouette's assumption that the clusters are numbered sequentially
starting at 1. one of the clustering programs I use (snob) assigns more
or less arbitrary integer ids to clusters starting from 3! (clusters 1
and 2 have special meaning in snob). the modified code fixing both
problems is included below, changes are commented.
best
Murad
silhouette.default <-
function (x, dist, dmatrix, ...)
{
cll <- match.call()
if (!is.null(cl <- x$clustering))
x <- cl
n <- length(x)
if (!all(x == round(x)))
stop("`x' must only have integer codes")
k <- length(clid <- sort(unique(x)))
if (k <= 1 || k >= n)
return(NA)
if (missing(dist)) {
if (missing(dmatrix))
stop("Need either a dissimilarity `dist' or diss.matrix
`dmatrix'")
if (is.null(dm <- dim(dmatrix)) || length(dm) != 2 ||
!all(n == dm))
stop("`dmatrix' is not a dissimilarity matrix compatible to
`x'")
}
else {
dist <- as.dist(dist)
if (n != attr(dist, "Size"))
stop("clustering `x' and dissimilarity `dist' are
incompatible")
dmatrix <- as.matrix(dist)
}
wds <- matrix(NA, n, 3, dimnames = list(names(x), c("cluster",
"neighbor", "sil_width")))
for (j in 1:k) {
Nj <- sum(iC <- x == clid[j])
#
# the following line changed from wds[iC, "cluster"] <- j
#
wds[iC, "cluster"] <- clid[j]
a.i <- if (Nj > 1)
colSums(dmatrix[iC, iC])/(Nj - 1)
else 0
#
# the following line changed from
# diC <- rbind(apply(dmatrix[!iC, iC], 2, function(r) tapply(r,
# x[!iC], mean)))
#
diC <- rbind(apply(cbind(dmatrix[!iC, iC]), 2, function(r)
tapply(r,
x[!iC], mean)))
minC <- max.col(-t(diC))
wds[iC, "neighbor"] <- clid[-j][minC]
#
# the following line changed from
# b.i <- diC[cbind(minC, seq(minC))]
#
b.i <- diC[cbind(minC, seq(along=minC))]
s.i <- (b.i - a.i)/pmax(b.i, a.i)
wds[iC, "sil_width"] <- s.i
}
attr(wds, "Ordered") <- FALSE
attr(wds, "call") <- cll
class(wds) <- "silhouette"
wds
}
Murad Nayal M.D. Ph.D.
Department of Biochemistry and Molecular Biophysics
College of Physicians and Surgeons of Columbia University
630 West 168th Street. New York, NY 10032
Tel: 212-305-6884 Fax: 212-305-6926
"Murad" == Murad Nayal <mn216 at columbia.edu>
on Wed, 21 Jan 2004 15:19:28 -0500 writes:
Murad> This might have been fixed in later versions (I am
Murad> using R1.7.0),
yes, the bug has been fixed "long ago",
from my ChangeLog (!), it was 2003-07-18.
Murad> r-help archive contains messages reporting similar
Murad> problems but no reports of codes fixes. I have
Murad> encountered a couple of problems using the silhouette
Murad> function. one occurs when the clustering contains
Murad> clusters composed of 1 element (Martin Maechler
Murad> posted code few months ago that fixes a similar
Murad> problem that occurs when clusters have only 2
Murad> elements but not the case with 1 element). the other
Murad> problem is due to silhouette's assumption that the
Murad> clusters are numbered sequentially starting at 1.
which is what ?silhouette tells you as well:
Arguments:
x: an object of appropriate class; for the 'default' method an
integer vector with cluster codes in '1:k' or a list with
such an 'x$clustering' component.
So, definitely not a bug,
but it's your problem of using silhouette() on an object that is
not of appropriate structure.
Murad> one of the clustering programs I use (snob) assigns
Murad> more or less arbitrary integer ids to clusters
Murad> starting from 3! (clusters 1 and 2 have special
Murad> meaning in snob). the modified code fixing both
Murad> problems is included below, changes are commented.
Thank you for the good attempt,
but really you should work from the source where
silhouette.default does also contain comments, and as said, has
long been fixed for the case of 1-element clusters
{a better fix is not to use cbind() but to use the ominous
", drop = FALSE" when subsetting matrices!}
I'm still willing to consider your *feature request* (as opposed
to bug fix) of allowing inputs where the grouping vector does
contain other than "1:g" .
I'll send you the current source of silhouette.default in a
private mail.
Thanks for your collaboration on improving R!
Regards,
Martin Maechler <maechler at stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27
ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
phone: x-41-1-632-3408 fax: ...-1228 <><
"Murad" == Murad Nayal <mn216 at columbia.edu>
on Wed, 21 Jan 2004 15:19:28 -0500 writes:
Murad> This might have been fixed in later versions (I am
Murad> using R1.7.0),
yes, the bug has been fixed "long ago",
from my ChangeLog (!), it was 2003-07-18.
sorry about that. I have been reluctant to upgrade recently for fear of
disrupting my environment while in the middle of a project. as I
mentioned I searched the archive and found posts citing this problem but
no replies stating that it has been fixed (the Nj=1 case).
I'm still willing to consider your *feature request* (as opposed
to bug fix) of allowing inputs where the grouping vector does
contain other than "1:g" .
that would be great. it is straightforward to do and will broaden the
utility of silhouette. I'll send you the suggested patch privately.
best regards,
Murad