Using pam, agnes or clara as prediction models?

Thu, Jan 15, 2004 12:59 AM

On Thu, 15 Jan 2004, Renald Buter wrote:

On Thu, Jan 15, 2004 at 08:32:45AM +0000, Prof Brian Ripley wrote:

On Thu, 15 Jan 2004, Renald Buter wrote:

On Wed, Jan 14, 2004 at 03:18:10PM -0500, Liaw, Andy wrote:

If pam produces the cluster medoids, you should be able to use the
1-nearest-neighbor classifier for prediction of future data, using the
medoids as the `training' data.  1-NN is available in the `class' package,
part of the `VR' bundle.

Thanks very much for your quick answer! I've tried your suggestion in
the following way:

 # separate the ruspini data into train and test set

 > train<-ruspini[1:50,]
 > test<-ruspini[51:75,]
 > pamx<-pam(train,4)
 > knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=3)
 > knnx

 [1] d d b b d c b c c d c a a d c c a a c a a d c d a
 Levels: a b c d

But the result of applying the test set to the knn should only contain 2
clusters, since the upper half of the ruspini data contains only 2
clusters.

Could you tell me what I am missing here?

You asked that the upper half be divided into 4 clusters.  Did you look at 
the object pamx?  It contains 4 clusters covering only the first part of 
the dataset.

Yes, that what was I understood. My objective was to use this division
by applying it to the test set: for each point in the test set, predict
what cluster it would enter.

Given that when you apply pam to the whole dataset there is a cluster that
only occurs for objects 61:75, there is no way you can find that cluster
when no member of it is in your training set.

By isn't that what the knn does: locate the nearest neighbour of a point
and assigning its (nn) label to the point-to-be-classified?

I thought that I was doing:
 1. create a clustering of data using PAM
 2. train a knn with the medoids of the PAM clustering
 3. apply the knn to the test set
 4. look at the result

Could you tell me what I'm not getting here?

You created a clustering of the training set, yet interpreted it against
the clustering of the whole set using the now irrelevant statement

`the upper half of the ruspini data contains only 2 clusters'

which applies to the wrong clustering.  I pointed out that the training 
set does not contain a single member of one of _those_ clusters so you are 
bound to get a completely different clustering.

When you divided a dataset into `training' and `testing' sets you are 
assuming an least exchangeability whereas this dataset is clearly ordered.
So it is not credible that `train' and `test' are samples from the same 
population.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Using pam, agnes or clara as prediction models?

Thread (7 messages)