Skip to content

extracting groups from hclust() for a very large matrix

1 message · Milan Bouchet-Valat

#
Le vendredi 12 octobre 2012 ? 11:33 -0700, Christopher R. Dolanc a
?crit :
Yeah, but that's a problem with your data or your dist function, not
with hclust() and cutree().

As always, it's good to try to find the minimal example that reproduces
the problem. Start from examples provided by ?cutree:
hc <- hclust(dist(USArrests))
cutree(hc, k=2)
       Alabama         Alaska        Arizona       Arkansas     California 
             1              1              1              2              1 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             2              2              1              1              2 

      etc.

Here you see the cluster numbers are not in sequence, and my command
shows groups correctly:
 split(rownames(USArrests), cutree(hc, 2))
$`1`
 [1] "Alabama"        "Alaska"         "Arizona"        "California"    
 etc.

$`2`
 [1] "Arkansas"      "Colorado"      "Connecticut"   "Georgia"      
 [5] "Hawaii"        "Idaho"         "Indiana"       "Iowa"         
 etc.  

So either your data is already ordered, or you have a problem with your
distance function. One guess: you have included the "Plot" column in the
call to vegdist(). I don't know this function, but it seems to work like
dist(), which means passing the plot id is plain wrong. Indeed, if I use
VTM.Dist<-vegdist(VTM.Matrix[,-1])
VTM.HClust<- hclust(VTM.Dist, method="ward")
VTM.8groups<- cutree(VTM.HClust, 8)
the result is not ordered as before.

Lesson: try with simple, standard data when complex data sets don't
work, and compare results.


My two cents