-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of marco milella
Sent: Thursday, December 06, 2012 12:08 PM
To: r-help at r-project.org
Subject: [R] clustering of binary data
Good morning,
I am analyzing a dataset composed by 364 subjects and 13 binary
variables
(0,1 = absence,presence).
I am testing possible association (co-presence) of my variables. To do
this, I was trying with cluster analysis.
My main interest is to check for the significance of the obtained
clusters.
First, I tried with the pvclust() function, by using
method.hclust="ward"
and method.dist="binary". Altoghether it works (clusters and
significance
obtained). However, I'm not convinced by the distance matrix.
Association
between variables are indeed different from results obtained in PAST by
using Ward on a Jaccard matrix (that should be ok for binary data).
Moreover, when I try to obtain a Jaccard matrix in R from my data, by
using
the Vegan package
mydistance<-vegdist(t(data),method="jaccard")
I receive the following error message:
Error in rowSums(x, na.rm = TRUE) : 'x' must be numeric
below an subset from my dataset:
variable1 variable2 variable3 variable4 variable5 variable6
variable7
variable8 variable9 variable10 variable11 variable12 variable13 case1
0 0 0
0 0 1 0 0 1 1 0 0 0 case2 0 0 0 0 0 1 0 NA NA 1 0 0 0 case3 0 0 0 0 0
1 0
0 1 1 0 0 0 case4 1 0 0 0 0 1 0 1 0 1 0 0 0 case5 0 0 0 0 0 1 0 0 1 1
0 0
0 case6 0 1 0 0 0 1 0 1 0 1 0 0 0 case7 0 1 0 0 0 1 0 0 1 1 0 0 0
case8 0
0 0 0 0 1 0 1 0 1 0 0 0 case9 0 0 0 0 0 1 0 1 0 1 0 0 0 case10 0 0 0
0 0 1
0 0 1 1 0 0 0 case11 1 0 0 1 0 1 1 1 0 1 0 0 0 case12 0 0 0 1 1 0 1 1
0 1
0 0 0 .....
So, my questions are the following: Is the Jaccard index a good
strategy
for my kind of data? Is binary distance used in pvclust is
theoretically
more correct? Is there any alternative to pvclust for testing the
significance of my clusters?
Thanks in advance
Marco
[[alternative HTML version deleted]]