Using pam, agnes or clara as prediction models?
On Thu, Jan 15, 2004 at 08:59:37AM +0000, Prof Brian Ripley wrote:
[snip]
# separate the ruspini data into train and test set
> train<-ruspini[1:50,]
> test<-ruspini[51:75,]
> pamx<-pam(train,4)
> knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=3)
> knnx
[1] d d b b d c b c c d c a a d c c a a c a a d c d a Levels: a b c d But the result of applying the test set to the knn should only contain 2 clusters, since the upper half of the ruspini data contains only 2 clusters. Could you tell me what I am missing here?
[snip]
When you divided a dataset into `training' and `testing' sets you are assuming an least exchangeability whereas this dataset is clearly ordered. So it is not credible that `train' and `test' are samples from the same population.
Thank you *very* much for your help. I thought I'd let the list know
what I did to get it right:
# create a seed vector
> seed<-rank(runif(75))
> train<-ruspini[seed[1:60],]
> test<-ruspini[seed[61:75],]
> pamx<-pam(train,4)
> knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=1)
And now the result makes sense!
Thanks again,
Renald