Using pam, agnes or clara as prediction models? - R-help

Wed, Jan 14, 2004 12:18 PM #

If pam produces the cluster medoids, you should be able to use the
1-nearest-neighbor classifier for prediction of future data, using the
medoids as the `training' data.  1-NN is available in the `class' package,
part of the `VR' bundle.

HTH,
Andy

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}

Renald Buter

Thu, Jan 15, 2004 12:07 AM #

On Wed, Jan 14, 2004 at 03:18:10PM -0500, Liaw, Andy wrote:

Thanks very much for your quick answer! I've tried your suggestion in
the following way:

 # separate the ruspini data into train and test set
 > train<-ruspini[1:50,]
 > test<-ruspini[51:75,]
 > pamx<-pam(train,4)
 > knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=3)
 > knnx
 [1] d d b b d c b c c d c a a d c c a a c a a d c d a
 Levels: a b c d

But the result of applying the test set to the knn should only contain 2
clusters, since the upper half of the ruspini data contains only 2
clusters.

Could you tell me what I am missing here?

Regards,

Renald

Brian Ripley

Thu, Jan 15, 2004 12:32 AM #

On Thu, 15 Jan 2004, Renald Buter wrote:

You asked that the upper half be divided into 4 clusters.  Did you look at 
the object pamx?  It contains 4 clusters covering only the first part of 
the dataset.

Given that when you apply pam to the whole dataset there is a cluster that
only occurs for objects 61:75, there is no way you can find that cluster
when no member of it is in your training set.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Renald Buter

Thu, Jan 15, 2004 12:46 AM #

On Thu, Jan 15, 2004 at 08:32:45AM +0000, Prof Brian Ripley wrote:

Yes, that what was I understood. My objective was to use this division
by applying it to the test set: for each point in the test set, predict
what cluster it would enter.

By isn't that what the knn does: locate the nearest neighbour of a point
and assigning its (nn) label to the point-to-be-classified?

I thought that I was doing:
 1. create a clustering of data using PAM
 2. train a knn with the medoids of the PAM clustering
 3. apply the knn to the test set
 4. look at the result

Could you tell me what I'm not getting here?

Regards,

Renald

Brian Ripley

Thu, Jan 15, 2004 12:59 AM #

On Thu, 15 Jan 2004, Renald Buter wrote:

On Thu, Jan 15, 2004 at 08:32:45AM +0000, Prof Brian Ripley wrote:

On Thu, 15 Jan 2004, Renald Buter wrote:

On Wed, Jan 14, 2004 at 03:18:10PM -0500, Liaw, Andy wrote:

If pam produces the cluster medoids, you should be able to use the
1-nearest-neighbor classifier for prediction of future data, using the
medoids as the `training' data.  1-NN is available in the `class' package,
part of the `VR' bundle.

Thanks very much for your quick answer! I've tried your suggestion in
the following way:

 # separate the ruspini data into train and test set

 > train<-ruspini[1:50,]
 > test<-ruspini[51:75,]
 > pamx<-pam(train,4)
 > knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=3)
 > knnx

 [1] d d b b d c b c c d c a a d c c a a c a a d c d a
 Levels: a b c d

But the result of applying the test set to the knn should only contain 2
clusters, since the upper half of the ruspini data contains only 2
clusters.

Could you tell me what I am missing here?

You asked that the upper half be divided into 4 clusters.  Did you look at 
the object pamx?  It contains 4 clusters covering only the first part of 
the dataset.

Yes, that what was I understood. My objective was to use this division
by applying it to the test set: for each point in the test set, predict
what cluster it would enter.

Given that when you apply pam to the whole dataset there is a cluster that
only occurs for objects 61:75, there is no way you can find that cluster
when no member of it is in your training set.

By isn't that what the knn does: locate the nearest neighbour of a point
and assigning its (nn) label to the point-to-be-classified?

I thought that I was doing:
 1. create a clustering of data using PAM
 2. train a knn with the medoids of the PAM clustering
 3. apply the knn to the test set
 4. look at the result

Could you tell me what I'm not getting here?

You created a clustering of the training set, yet interpreted it against
the clustering of the whole set using the now irrelevant statement

`the upper half of the ruspini data contains only 2 clusters'

which applies to the wrong clustering.  I pointed out that the training 
set does not contain a single member of one of _those_ clusters so you are 
bound to get a completely different clustering.

When you divided a dataset into `training' and `testing' sets you are 
assuming an least exchangeability whereas this dataset is clearly ordered.
So it is not credible that `train' and `test' are samples from the same 
population.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Renald Buter

Thu, Jan 15, 2004 3:22 AM #

On Thu, Jan 15, 2004 at 08:59:37AM +0000, Prof Brian Ripley wrote:

[snip]

[snip]

Thank you *very* much for your help. I thought I'd let the list know
what I did to get it right:

 # create a seed vector
 > seed<-rank(runif(75))
 > train<-ruspini[seed[1:60],]
 > test<-ruspini[seed[61:75],]
 > pamx<-pam(train,4)
 > knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=1)

And now the result makes sense!

Thanks again,

Renald

Martin Maechler

Thu, Jan 15, 2004 7:05 AM #

<....>

    Renald> Thank you *very* much for your help. I thought I'd let the list know
    Renald> what I did to get it right:

    >> # create a seed vector
    >> seed<-rank(runif(75))

S has a function for this :
     seed <- sample(75)

     ## (and "seed" is not very sensical name here)

    >> train<-ruspini[seed[1:60],]
    >> test<-ruspini[seed[61:75],]
    >> pamx<-pam(train,4)
    >> knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=1)

Note on style:

  Using " " (space) in S statements is very much recommended for
  readability, particularly
  space around "<-", i.e. " <- " 
	 (and this is provided with one key stroke by ESS and R-WinEdt)

Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><