Skip to content

cluster

3 messages · Christian Hennig, Weiwei Shi

#
Dear listers:

Here I have a question on clustering methods available in R. I am
trying to down-sampling the majority class in a classification problem
on an imbalanced dataset. Since I don't want to lose information in
the original dataset, I don't want to use naive down-sampling: I think
using clustering on the majority class' side to select
"representative" samples might help. So, my question is, which
clustering method should be tested to get the best result. I think the
key thing might be the selection of "distance" considering the next
step in which I would like to use  decision trees.

Please share your experience in using clustering (Any available
implementation outside R is also welcome)

weiwei
#
Dear Weiwei,

your question sounds a bit too general and complicated for the R-list.
Perhaps you should look for personal statistical advice.
The quality of methods (and especially distance choice) for down-sampling
ceratinly depends on the structure of the data set. I do not see at the moment why
you need any down-sampling at all, and you should find out first if and
why it's a good thing to do (by whatever method).

An obvious candidate for a clustering algorithm would be pam/clara in
package cluster, because this approach chooses points already in the data
set as cluster centroids (and produces therefore a proper subsample),
which does not apply to most other clustering methods.

However, in
 C. Hennig and L. J. Latecki:  The choice of vantage objects for image
retrieval.  Pattern Recognition 36 (2003), 2187-2196.
the clustering approach has been clearly outperformed by some stepwise
selection approaches for down-sampling - admittedly in a different kind of
problem, but I think that the reasons for this may apply also to your
situation,

You can compare different clusterings (or choices of a subset) by
cross-validation or
bootstrap applied to the resulting decision tree in the classification
problem.

Best,
Christian
On Mon, 25 Jul 2005, Weiwei Shi wrote:

            
*** NEW ADDRESS! ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
#
Dear Chris:

You are right and It IS too general. I think I should ask like "what
kind of cluster algorithms or functions are available in R" , which
might be easier. But for that, I probably can google or use help() in
R to find out. I want to know more about the performance of clustering
on this kind of problems and hope someone can share previous experince
if he/she had similar situation or problems before. And I will share
my experience later :)

As to the reason of using downsampling here, it is one fo the
straightforward ways to deal with imbalanced data classification
problem. In my understanding of classification problems, among others,
two things are important: feature construction/selection and sample
selection. I had an idea (which might be discovered by others) that
finding the best subset of features in clustering (to get highest
inter-cluster dissimilarities and the largest intra-cluster
similarity) might help the next classification process. I quickly read
through the abstract of your paper and I think your approach here is
applying feature selection (use p instead of n), while here, in my
proposal, I would like to try both.

thanks for further advice!

weiwei
On 7/26/05, Christian Hennig <chrish at stats.ucl.ac.uk> wrote: