Skip to content

Document clustering for R

5 messages · Raymond K Pon, Christian Hennig, David Ruau +1 more

#
I'm working on a project related to document clustering. I know that R 
has clustering algorithms such as clara, but only supports two distance 
metrics: euclidian and manhattan, which are not very useful for 
clustering documents. I was wondering how easy it would be to extend the 
clustering package in R to support other distance metrics, such as 
cosine distance, or if there was an API for custom distance metrics.

Best regards,
Raymond Pon
pon3 at llnl.gov
x43062
#
If you are able to implement the computation of the distance matrix, you
can use methods such as pam, agnes and hclust, which operate on
dissimilarity matrices of any kind. You may also perform a
multidimensional scaling with isoMDS, sammon or cmdscale and use any
clustering algorithm for n*p data on the outcome.

Best,
Christian
On Mon, 12 Sep 2005, Raymond K Pon wrote:

            
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
#
Hi,
We discovered that the package "amap" contain a distance calculation 
function call Dist which can calculate the distance according to a 
method call "pearson" which is in fact the "not centered Pearson" which 
seems to be the cosine distance.
Could you tell me what do you think on that?

Best regards,
David
On Sep 12, 2005, at 21:47, Raymond K Pon wrote:

            
#
On Mon, 2005-09-12 at 12:47 -0700, Raymond K Pon wrote:
You don't have to extend the "clustering package in R to support other
distance metrics", but you should take care that you produce your
dissimilarities (or distances) in the standard format so that they can
be used in "clustering package" or in cmdscale or in isoMDS or any other
function excepting a "dist" object.  "Clustering package" will support
new dissimilarities if they were written in standard conforming way.
There are several packages that offer alternative dissimilarities (and
some even distances) that can be used in clustering functions. Look for
"distances" or "dissimilarities" in the R Site. Some of these could be
the one for you... I would be surprised if cosine index is missing (and
if needed, I could write it for you in C, but I don't think that is
necessary).

cheers, jari oksanen
#
On Tue, 13 Sep 2005, Jari Oksanen wrote:

            
Generation of the standard dist format out of a distance
matrix m works simply by as.dist(m).

Christian


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche