Skip to content

Advice on exploration of sub-clusters in hierarchical dendrogram

6 messages · ilai, kosmo7, michael.weylandt at gmail.com (R. Michael Weylandt

#
Dear R user,

I am a biochemist/bioinformatician, at the moment working on protein
clusterings by conformation similarity.

I only started seriously working with R about a couple of months ago.
I have been able so far to read my way through tutorials and set-up my
hierarchical clusterings. My problem is that I cannot find a way to obtain
information on the rooting of specific nodes, i.e. of specific clusters of
interest.
In other words, I am trying to obtain/read the sub-clusters of a specific
cluster in the dendrogram, by isolating a specific node and exploring
locally its lower hierarchy.

Please allow me to display some of the code I have been using for your
reference:

df=read.table('mydata.txt', head=T, row.names=1) #read file with distance
matrix
d=as.dist(df) #format table as distance matrix
z<-hclust(d,method="complete", members=NULL)
x<-as.dendrogram(z)
plot(x, xlab="mydata complete-LINKAGE", ylim=c(0,4)) #visualization of the
dendrogram
clusters<-cutree(z, h=1.6) #obtain clusters at cutoff height=1.6
ord<-cmdscale(d, k=2) #Multidimensional scaling of the data down to 2
dimensions
clusplot(ord,clusters, color=TRUE, shade=TRUE,labels=4, lines=0)
#visualization of the clusters in 2D map
var1<-var(clusters==1) #variance of cluster 1

#extract cluster memberships:
clids = as.data.frame(clusters)
names(clids) = c("id")
clids$cdr = row.names(clids)
row.names(clids) = c(1:dim(clids)[1])
clstructure = lapply(unique(clids$id), function(x){clids[clids$id ==
x,'cdr']})

clstructure[[1]] #get memberships of cluster 1
the members of a specific cluster and then re-apply hierarchical clustering
and start all over again.
But this would take me ages to perform individually for hundred of clusters.
So, I was hoping if anyone could point me to a direction as to how to take
advantage of the initial dendrogram and focus on specific clusters from
which to derive the sub-clusters at a new given cutoff height.

I recently found in this page 
http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual
http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual 

the following code:
clid <- c(1,2)
ysub <- y[names(mycl[mycl%in%clid]),]
hrsub <- hclust(as.dist(1-cor(t(ysub), method="pearson")),
method="complete") # Select sub-cluster number (here: clid=c(1,2)) and
generate corresponding dendrogram.

Even with this given example I am afraid I can't work my way around.
So I guess in my case I could grab all the members of a specific cluster
using my existing code and try to reformat the distance matrix in one that
only contains the distances of those members:
cluster1members<-clstructure[[1]]

Then I need to reformat the distance matrix into a new one, say d1, which I
can feed to a new -local- hierarchical clustering:
hrsub<-hclust(d1, method="complete")

Any ideas on how I can obtain a new distance matrix with just the distances
of the members in that clusters, with names contained in vector
"cluster1members" ?

Apologies if this seems trivial, but I really can't find the correct
functions to use for this task.
Thank you very much in advance - as I am really a novice with R, small
chunks of code as example would be of great help.

Take care all - 

--
View this message in context: http://r.789695.n4.nabble.com/Advice-on-exploration-of-sub-clusters-in-hierarchical-dendrogram-tp4414277p4414277.html
Sent from the R help mailing list archive at Nabble.com.
#
See inline
On Thu, Feb 23, 2012 at 8:54 AM, kosmo7 <dnicolgr at hotmail.com> wrote:
To explore or "zoom in" on elements of z you had the first step right:
create x<-as.dendrogram(z) but then you didn't use x anymore (except
for the plot which could have been done on z). Maybe you wanted:
clusters<-cut(x, h=1.6) #obtain clusters at cutoff height=1.6

# clusters is now (after cut x not cutree z) a list of two components:
upper and lower. Each is in itself a list of dendrograms: the
structure above 1.6, and the local clusters below:

plot(clusters$upper)  # the structure above 1.6
plot(clusters$lower[[1]])  # cluster 1

# To print the details of cluster 1 (this output maybe very long
depending on how many members):

str(clusters$lower[[1]])

To extract specific details from the list and automate for all or some
of the clusters ?dendrapply is your friend.

I'm assuming your attempts at reclustering locally later in your post
are no longer necessary, unless I'm missing something on what exactly
you are trying to do.

Hope this helps

Elai
#
Dear Elai,
thank you very much for your suggestion. I tried cutting the dendrogram
instead of the hclust tree with:
clusters<-cut(x, h=1.6)

but then when I try to call/plot cluster 1 for example, with:
plot(clusters$lower[[1]])

I get only 2 members that are joined together at distance=0  (cluster 1 for
instance, consists of several hundred of members).
So it looks like / plot(clusters$lower[[1]])/ only calls the very first node
of the tree and not the content of the respective cluster [[1]] at the
defined cutoff=1.6. Maybe /cut/ instead of /cutree/ doesnt do the work? Or 
maybe I am just doing something  wrong?...



In another post I read that with /df[value %in% v, ] / I can extract
specific subsets of a data frame/table. Maybe I could use this to extract
only the distances of members of a specific cluster as defined by cutree
from the initial distance matrix? But still, I am afraid I don't get what I
should use as /value/ and /v/....

--
View this message in context: http://r.789695.n4.nabble.com/Advice-on-exploration-of-sub-clusters-in-hierarchical-dendrogram-tp4414277p4415589.html
Sent from the R help mailing list archive at Nabble.com.
Inline:
On Feb 23, 2012, at 6:20 PM, kosmo7 <dnicolgr at hotmail.com> wrote:

            
That was me and there's a slight mistake in that post (corrected by Sarah): should be

df[df$value %in% v, ]

Sorry for any confusion that might have caused 

Michael
#
Ok, I was able to work it out finally.
As I have been aided myself numerous times from posted questions by other
users who have reached in the end a solution to their problem, I will put
the code that worked for me for future googlers - it is certainly not
optimal but it works:

# Initial clustering
df=read.table('mydata.txt', head=T, row.names=1) #read file with distance
matrix
d=as.dist(df) #format table as distance matrix
z<-hclust(d,method="complete", members=NULL)
x<-as.dendrogram(z)
plot(x, xlab="mydata complete-LINKAGE", ylim=c(0,4)) #visualization of the
dendrogram
clusters<-cutree(z, h=1.6) #obtain clusters at cutoff height=1.6
ord<-cmdscale(d, k=2) #Multidimensional scaling of the data down to 2
dimensions
clusplot(ord,clusters, color=TRUE, shade=TRUE,labels=4, lines=0)
#visualization of the clusters in 2D map 

# Local sub-clustering (actually re-clustering on a specific tree
node/cluster)

h<-as.matrix(d)  # transform the distance matrix to a simple matrix. We
should ideally  work with the initial data table but  it sometimes contains
an "X" letter preceding labels and there is a risk labels aren't recognized
by comparison to name vectors. Distance matrices don't contain the preceding
"X" so I transformed it back to a simple matrix  (this step might not be
required, depending on your initial data table format).

clid<-c(1)  # Just a column containing the number of the clusters of the
initial clustering that you want to pick - separate with commas if more than
one clusters,. Here we only want cluster 1.
ysub<-h[names(clusters[clusters%in%clid]),]  #Remove all rows from the h
table that do not begin by the label of a member of cluster 1
ysub<-t(ysub)[names(clusters[clusters%in%clid]),]  #We want a rectangular
table to be used as distance matrix later on, so we transpose the previous
table ysub and remove again the unneeded rows.
hrsub<-hclust(as.dist(ysub),method="average") #Perform your preferred
hierarchical method on just the initial clusters selected with clid 
plot(hrsub)
ord2<-cmdscale(ysub, k=2) 
plot(ord2) # Now we can visually "zoom" on the data configuration of just
the selected cluster by 2d MDS
aa<-silhouette(cutree(hrsub,h=1.2),as.dist(ysub)) #We can perform silhouette
analysis localy on the selected cluster (by clid)
plot(aa)
clusplot(ord2,cutree(hrsub,h=1.2), color=TRUE, shade=TRUE,labels=4, lines=0)
# clusterplot of the subclusters


Thanks for reading - take care all.

PS. If anyone can write all these things in a more efficient way, please
feel free to add a comment.


--
View this message in context: http://r.789695.n4.nabble.com/Advice-on-exploration-of-sub-clusters-in-hierarchical-dendrogram-tp4414277p4417419.html
Sent from the R help mailing list archive at Nabble.com.
#
Inline:

On Thu, Feb 23, 2012 at 8:23 PM, R. Michael Weylandt
<michael.weylandt at gmail.com> <michael.weylandt at gmail.com> wrote:
The "suggestions" in my original post are just pointers to the fact
there are methods for class dendrogram to achieve what you wanted.
Since you got as far as x<-as.dendrogram(z) I assumed that's all you
needed.

Maybe /cut/ instead of /cutree/ doesnt do the work? Or
The examples in ?as.dendrogram and ?dendrapply are self contained,
very clear and straight forward. If you haven't done so already I
suggest you try them. Most likely the problem is in your data
(row.names ? ) or your interpretation of who is "cluster1" or the 1.6
cutoff.
Seems I missed some back and forth on this post already, so my
apologies if this is no longer an issue. Personally I find that
because there are many more nodes and info in a tree than rows in the
data set (leaf nodes only) much of the "usual" generic R solutions get
distorted when it comes to trees. Better to use appropriate methods
for the class (dendrapply helps as I've said before).

Hope that helps dig you out of the hole.
Elai