"FANNY" function in R package "cluster" - R-help

Aamir M

Mon, May 30, 2005 12:57 PM #

Dear All,

I am attempting to use the FANNY fuzzy clustering function in R
(Kaufman & Rousseeuw, 1990), found in the "cluster" package. I have
run into a variety of difficulties; the two most crucial difficulties
are enumerated below.

1. Where is the 'm' parameter in FANNY? 
In _Finding Groups in Data: An Introduction to Cluster Analysis_
(1990) by Kaufman & Rousseeuw, the authors discuss the FANNY
algorithm. On pages 189-190, they attempt to demonstrate an
equivalence between Fuzzy c-Means (FCM) and FANNY. In doing so they,
appear to be assuming that the value of the 'm' parameter in FCM (a
measure of the fuzziness) is fixed at m=2. Although this is how FCM
was originally formulated, it eventually became apparent that m should
be a user-specified parameter and not a fixed value of m=2.  My
question, then, is twofold. Firstly, am I correct in saying that
Kaufman & Rousseeuw have assumed m=2 in their formulation of Fuzzy
c-Means and FANNY? Secondly, is it possible to modify the FANNY
algorithm to allow user-specification of the m (fuzziness) parameter?

2. What do I do with training data?
Is there any way to use FANNY for assigning clustering membership
values to new, test data? In Fuzzy c-Means, new data is compared to
the cluster centers in order to assign clustering membership values to
the test data. However, in FANNY these centers do not exist. Is there,
then, any way to compute the FANNY clustering membership values of a
test data point without affecting the clustering membership values of
the training data? Perhaps there are enough conditions to use the
objective function as a way of computing the membership values of the
test data?

Aamir M
University of Toronto

Martin Maechler

Tue, May 31, 2005 1:40 AM #

Aamir> Dear All, I am attempting to use the FANNY fuzzy
    Aamir> clustering function in R (Kaufman & Rousseeuw, 1990),
    Aamir> found in the "cluster" package. I have run into a
    Aamir> variety of difficulties; the two most crucial
    Aamir> difficulties are enumerated below.

    Aamir> 1. Where is the 'm' parameter in FANNY?  In _Finding
    Aamir> Groups in Data: An Introduction to Cluster Analysis_
    Aamir> (1990) by Kaufman & Rousseeuw, the authors discuss
    Aamir> the FANNY algorithm. On pages 189-190, they attempt
    Aamir> to demonstrate an equivalence between Fuzzy c-Means
    Aamir> (FCM) and FANNY. 

The section is called "*Related* Methods and References"
and they (as most Statisticians) use the word "k-Means"...

The first point in that section is to explain the
NON-equivalence:  fanny() works with arbitrary dissimilarities d[i,j]
whereas all versions of k-means have to assume a euclidean
measurement space.  K&R show that *when* you have a data matrix X,
and then use distances  d[i,j] :=  || x_{.,i} - x_{.,j} ||^2
(i.e. SQUARED Euclidean distances), then fanny() does the same
as fuzzy k-means.  But they even say there, that they'd rather
use the non-squared distance {as fanny() does by default}, i.e.,
exponent 1, not 2,  for good robustness reasons.  If this is
your 'm' below, then fanny() already does what you might want by default.
OTOH, you can always pass squared distances to fanny() ..

    Aamir> In doing so they, appear to be
    Aamir> assuming that the value of the 'm' parameter in FCM
    Aamir> (a measure of the fuzziness) is fixed at m=2.

there is no 'm' in the book there, but they talk about the
exponent "^ 2" used in some places {but "^ 1" in other places},
notably in   5.2 "Why did we choose FANNY?" 

There is no "fuzziness" parameter defined there, so can you be
more specific?

    Aamir>  Although this is how FCM was originally
    Aamir> formulated, it eventually became apparent that m
    Aamir> should be a user-specified parameter and not a fixed
    Aamir> value of m=2.  My question, then, is
    Aamir> twofold. Firstly, am I correct in saying that Kaufman
    Aamir> & Rousseeuw have assumed m=2 in their formulation of
    Aamir> Fuzzy c-Means and FANNY? Secondly, is it possible to
    Aamir> modify the FANNY algorithm to allow
    Aamir> user-specification of the m (fuzziness) parameter?

Maybe, if you tell me what you are taking about.
See the section 5.2 mentioned above
Is it the exponent  2 in  u_{jv}^2 ?
That one is currently fixed at 2, and yes, that could be made a
parameter though K & R warn against going all the way to "1"
where their algorithm can happend to converge very slowly.


    Aamir> 2. What do I do with training data?  Is there any way
    Aamir> to use FANNY for assigning clustering membership
    Aamir> values to new, test data? In Fuzzy c-Means, new data
    Aamir> is compared to the cluster centers in order to assign
    Aamir> clustering membership values to the test
    Aamir> data. However, in FANNY these centers do not
    Aamir> exist.

correct.  Note again that in general such centers don't even
make sense, since the sample space may contain unordered
categorical variables mixed with continuous ones {and you would
be advised to use  daisy(), not dist() to compute
dissimilarities for such data}.

    Aamir> Is there, then, any way to compute the FANNY
    Aamir> clustering membership values of a test data point
    Aamir> without affecting the clustering membership values of
    Aamir> the training data? Perhaps there are enough
    Aamir> conditions to use the objective function as a way of
    Aamir> computing the membership values of the test data?

    Aamir> Aamir M University of Toronto

That's an interesting proposal, at least the way I choose to
understand you :-)

Yes, why not look at the objective function C {eq.(1), p.182}

One could think of optimizing it with respect to new data only,
by keeping all "old data" memberships.
For that to work, one would need the n dissimilarites 
    d[i', j]   where  i'  : `index for' new data 
	       j = 1,..,n : indices for training data.
Is this feasible in your situation?

Alternatively, when we *did* assume ``all continuous'' data 
*and* the use of simple Euclidean distances,
we could easily compute the cluster centers, determine (by
minimization!) memberships for new observations.

In any case that needs some assumptions (and code!) currently
not part of fanny().