Skip to content
Back to formatted view

Raw Message

Message-ID: <Pine.LNX.4.64.0803061212420.6031@egon.stats.ucl.ac.uk>
Date: 2008-03-06T12:18:34Z
From: Christian Hennig
Subject: Clustering large data matrix
In-Reply-To: <47CFCCFE.6070501@uab.cat>

Hi there,

whether clara is a proper way of clustering depends strongly on what your data 
are and particularly what interpretation or use you want for your 
clustering. You may do better with a hierarchical method after having defined a 
proper distance (however this would rather go into statistical consultation and 
not just R help).

Assuming that you use some reasonable dimension reduction and clustering
method, you may get a good visualization of you clustering using the methods 
available via functions plotcluster/discrproj in package fpc.

Best,
Christian

On Thu, 6 Mar 2008, Dani Valverde wrote:

> Hello,
> I have a large data matrix (68x13112), each row corresponding to one
> observation (patients) and each column corresponding to the variables
> (points within an NMR spectrum). I would like to carry out some kind of
> clustering on these data to see how many clusters are there. I have
> tried the function clara() from the package cluster. If I use the matrix
> as is, I can perform the clara analysis but when I call clusplot() I get
> this error:
>
> Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
> 'princomp' can only be used with more units than variables
>
> Then, I reduce the dimensionality by using the function prcomp(). Then I
> take the 13 first principal components (80%< variability) and I carry
> out the clara() analysis again. Then, I call the clusplot() function
> again and voil?!, it works. The problem is that clusplot() only
> represents the two first components of my prcomp() analysis, which
> represents only 15% of the variability.
> So, my questions are 1) is clara() a proper way to analyze such a large
> data set? and 2) Is there an appropiate method for graphic plotting of
> my data, that takes into account the whole variability if my data, not
> just two principal components?
> Many thanks.
> Best,
>
> Dani
>
> -- 
> Daniel Valverde Saub?
>
> Grup de Biologia Molecular de Llevats
> Facultat de Veterin?ria de la Universitat Aut?noma de Barcelona
> Edifici V, Campus UAB
> 08193 Cerdanyola del Vall?s- SPAIN
>
> Centro de Investigaci?n Biom?dica en Red
> en Bioingenier?a, Biomateriales y
> Nanomedicina (CIBER-BBN)
>
> Grup d'Aplicacions Biom?diques de la RMN
> Facultat de Bioci?ncies
> Universitat Aut?noma de Barcelona
> Edifici Cs, Campus UAB
> 08193 Cerdanyola del Vall?s- SPAIN
> +34 93 5814126
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche