Skip to content

Using pam, agnes or clara as prediction models?

7 messages · Liaw, Andy, Brian Ripley, Renald Buter +1 more

#
If pam produces the cluster medoids, you should be able to use the
1-nearest-neighbor classifier for prediction of future data, using the
medoids as the `training' data.  1-NN is available in the `class' package,
part of the `VR' bundle.

HTH,
Andy
------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}
#
On Wed, Jan 14, 2004 at 03:18:10PM -0500, Liaw, Andy wrote:
Thanks very much for your quick answer! I've tried your suggestion in
the following way:

 # separate the ruspini data into train and test set
 > train<-ruspini[1:50,]
 > test<-ruspini[51:75,]
 > pamx<-pam(train,4)
 > knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=3)
 > knnx
 [1] d d b b d c b c c d c a a d c c a a c a a d c d a
 Levels: a b c d

But the result of applying the test set to the knn should only contain 2
clusters, since the upper half of the ruspini data contains only 2
clusters.

Could you tell me what I am missing here?

Regards,

Renald
#
On Thu, 15 Jan 2004, Renald Buter wrote:

            
You asked that the upper half be divided into 4 clusters.  Did you look at 
the object pamx?  It contains 4 clusters covering only the first part of 
the dataset.

Given that when you apply pam to the whole dataset there is a cluster that
only occurs for objects 61:75, there is no way you can find that cluster
when no member of it is in your training set.
#
On Thu, Jan 15, 2004 at 08:32:45AM +0000, Prof Brian Ripley wrote:
Yes, that what was I understood. My objective was to use this division
by applying it to the test set: for each point in the test set, predict
what cluster it would enter.
By isn't that what the knn does: locate the nearest neighbour of a point
and assigning its (nn) label to the point-to-be-classified?

I thought that I was doing:
 1. create a clustering of data using PAM
 2. train a knn with the medoids of the PAM clustering
 3. apply the knn to the test set
 4. look at the result

Could you tell me what I'm not getting here?

Regards,

Renald
#
On Thu, 15 Jan 2004, Renald Buter wrote:

            
You created a clustering of the training set, yet interpreted it against
the clustering of the whole set using the now irrelevant statement

`the upper half of the ruspini data contains only 2 clusters'

which applies to the wrong clustering.  I pointed out that the training 
set does not contain a single member of one of _those_ clusters so you are 
bound to get a completely different clustering.

When you divided a dataset into `training' and `testing' sets you are 
assuming an least exchangeability whereas this dataset is clearly ordered.
So it is not credible that `train' and `test' are samples from the same 
population.
#
On Thu, Jan 15, 2004 at 08:59:37AM +0000, Prof Brian Ripley wrote:
[snip]
[snip]
Thank you *very* much for your help. I thought I'd let the list know
what I did to get it right:

 # create a seed vector
 > seed<-rank(runif(75))
 > train<-ruspini[seed[1:60],]
 > test<-ruspini[seed[61:75],]
 > pamx<-pam(train,4)
 > knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=1)

And now the result makes sense!

Thanks again,

Renald
#
<....>

    Renald> Thank you *very* much for your help. I thought I'd let the list know
    Renald> what I did to get it right:

    >> # create a seed vector
    >> seed<-rank(runif(75))

S has a function for this :
     seed <- sample(75)

     ## (and "seed" is not very sensical name here)

    >> train<-ruspini[seed[1:60],]
    >> test<-ruspini[seed[61:75],]
    >> pamx<-pam(train,4)
    >> knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=1)

Note on style:

  Using " " (space) in S statements is very much recommended for
  readability, particularly
  space around "<-", i.e. " <- " 
	 (and this is provided with one key stroke by ESS and R-WinEdt)

Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><