Using pam, agnes or clara as prediction models?
On Thu, 15 Jan 2004, Renald Buter wrote:
On Thu, Jan 15, 2004 at 08:32:45AM +0000, Prof Brian Ripley wrote:
On Thu, 15 Jan 2004, Renald Buter wrote:
On Wed, Jan 14, 2004 at 03:18:10PM -0500, Liaw, Andy wrote:
If pam produces the cluster medoids, you should be able to use the 1-nearest-neighbor classifier for prediction of future data, using the medoids as the `training' data. 1-NN is available in the `class' package, part of the `VR' bundle.
Thanks very much for your quick answer! I've tried your suggestion in the following way: # separate the ruspini data into train and test set
> train<-ruspini[1:50,]
> test<-ruspini[51:75,]
> pamx<-pam(train,4)
> knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=3)
> knnx
[1] d d b b d c b c c d c a a d c c a a c a a d c d a Levels: a b c d But the result of applying the test set to the knn should only contain 2 clusters, since the upper half of the ruspini data contains only 2 clusters. Could you tell me what I am missing here?
You asked that the upper half be divided into 4 clusters. Did you look at the object pamx? It contains 4 clusters covering only the first part of the dataset.
Yes, that what was I understood. My objective was to use this division by applying it to the test set: for each point in the test set, predict what cluster it would enter.
Given that when you apply pam to the whole dataset there is a cluster that only occurs for objects 61:75, there is no way you can find that cluster when no member of it is in your training set.
By isn't that what the knn does: locate the nearest neighbour of a point and assigning its (nn) label to the point-to-be-classified? I thought that I was doing: 1. create a clustering of data using PAM 2. train a knn with the medoids of the PAM clustering 3. apply the knn to the test set 4. look at the result Could you tell me what I'm not getting here?
You created a clustering of the training set, yet interpreted it against the clustering of the whole set using the now irrelevant statement `the upper half of the ruspini data contains only 2 clusters' which applies to the wrong clustering. I pointed out that the training set does not contain a single member of one of _those_ clusters so you are bound to get a completely different clustering. When you divided a dataset into `training' and `testing' sets you are assuming an least exchangeability whereas this dataset is clearly ordered. So it is not credible that `train' and `test' are samples from the same population.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595