What is mclust up to? Different clusters found if x and y interchanged - R-help

Tue, Apr 20, 2010 10:36 AM #

Thanks Prof. Fraley.  This is very helpful.  I had looked at the way mclust
gets its start and that looked like a possible source of the differences,
but had not had time to get up to speed and figure it out.  I was thinking
it was more of a set.seed sort of thing.  Now I know more.  Thanks so much.
Bryan

On 4/20/10 11:41 AM, "Chris Fraley" <fraley at u.washington.edu> wrote:


This is a really interesting case, but it is probably not that unusual.

mod23 <- Mclust(tur[,2:3])
mod32 <- Mclust(tur[,3:2])

mod23[c("modelName","G")]
mod32[c("modelName","G")]

The 2:3 case is choosing the VVV model (more parameters per cluster) with 3
clusters.
The 3:2 case is choosing the EEV model (more parsimonious), but with 7
clusters.

If you look at the BIC curves, you can see that mclust is indeed choosing the
model
corresponding to the maximum BIC in each case.

plot( mclustBIC( tur[,2:3], modelNames = c("EEV", "VVV"))
plot( mclustBIC( tur[,3:2], modelNames = c("EEV", "VVV"))

The default initialization for Mclust is via hierarchical clustering, and
that's
why the results differ

hc23 <- hcVVV( tur[,2:3]) # default initialization for mod23
hc32 <- hcVVV( tur[,3:2]) # default initialization for mod32

all.equal(hc32,hc23) # they differ!

The hierarchical clustering merges two groups at each stage.
If there is more than one choice at any particular stage for
which the merge criteria are equal or within roundoff error,
a different pair could be chosen for merging if columns (or rows)
are permuted. This will affect later merges, and could ultimately
affect the mclust results.

If you initialize both col permutations in the same way, you'll get
the same results.

mod23mod <- Mclust(tur[,2:3], initialization = list( hcPairs = hc32)) # uses
mod32 initialization
mod32mod <- Mclust(tur[,3:2], initialization = list( hcPairs = hc23)) # uses
mod23 initialization

mod23mod[c("modelName","G")] # same as mod32
mod32mod[c("modelName","G")] # same as mod23

Hope this helps,

Chris Fraley


On Tue, 20 Apr 2010, Bryan Hanson wrote:

Prof. Fraley, I wonder if you would have a couple of moments to respond to
this question I put to the R help list.  Thanks in advance for your time.

Bryan
*************
Bryan Hanson
Acting Chair
Professor of Chemistry & Biochemistry
DePauw University, Greencastle IN USA


------ Forwarded Message
From: Bryan Hanson <hanson at depauw.edu>
Date: Mon, 19 Apr 2010 19:31:03 -0400
To: <hanson at gapps.depauw.edu>, R Help <r-help at stat.math.ethz.ch>
Subject: [R] What is mclust up to? Different clusters found if x and y
interchanged

Hello All...

I gave a task to my students that involved using mclust to look for clusters
in some bivariate data of isotopes vs various mining locations.  They
discovered something I didn?t expect; the data (called tur) is appended
below.

p <- qplot(x = dD, y = dCu65, data = tur, color = mine)
print(p) # simple bivariate plot of the data; looks fine

mod1 <- Mclust(tur[,2:3])
mod1$G
mod2 <- Mclust(tur[,3:2])
mod2$G

One can use coordProj to see the clusters found, but the basic result is
that mclust found 3 clusters in mod1, but if you interchange the x and y
columns it finds 7 clusters (mod2).  Since this is bivariate data, I find
this result pretty strange.  The only thing I can think of is that it has
something to do with how the algorithm is seeded, but that isn't too
convincing.  Since I couldn't believe my eyes, I made up some bivariate data
with known clusters (not given here) and interchanged the x and y columns
and got the same clusters.  I'm at a loss.  Hopefully someone can set my
understanding on the right path.

Thanks for any insight!  Bryan

sessionInfo()

R version 2.10.1 (2009-12-14)
x86_64-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] datasets  tools     grid      graphics  grDevices utils     stats
methods   base

other attached packages:
[1] faraway_1.0.4      GGally_0.1         xtable_1.5-6       mvbutils_2.5.1
ggplot2_0.8.7
[6] digest_0.4.2       reshape_0.8.3      proto_0.3-8        ChemoSpec_1.43
R.utils_1.4.0
[11] R.oo_1.7.1         R.methodsS3_1.2.0  rgl_0.91           lattice_0.18-3
mvoutlier_1.4
[16] plyr_0.1.9         RColorBrewer_1.0-2 chemometrics_0.8   som_0.3-5
robustbase_0.5-0-1
[21] rpart_3.1-46       pls_2.1-0          pcaPP_1.8          mvtnorm_0.9-9
nnet_7.3-1
[26] mclust_3.4.4       MASS_7.3-5         lars_0.9-7         e1071_1.5-23
class_7.3-2

And the data:

dput(tur)

structure(list(mine = structure(c(13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L,
9L, 9L, 9L, 9L, 12L, 12L, 12L, 12L, 12L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 4L, 4L, 4L, 4L, 4L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("Cas",
"CH", "CR", "FM", "GU", "KG", "KM", "LV", "MC", "MZ", "N8", "OR",
"SB", "TF"), class = "factor"), dCu65 = c(6.1, 6.5, 5.7, 6.4,
6.1, 5.9, 6.5, 5.7, 6, 6.4, 6.8, 6.9, 6.7, 6.7, 5.8, 5.3, 6.7,
5.7, 6.7, 5.6, 6.4, 6.4, 6.5, 7.6, 6.7, 6.4, 6.4, 7, 7, 5.9,
5.3, 6.1, 11.8, 12.1, 12.4, 1, 1.8, 1.4, 1.2, 1.2, 1.2, 1.1,
1.2, 1.8, 1, 0.8, 0.7, 1.2, 1.5, 1, 0.2, 1.4, 0.3, 1.1, 0, -0.5,
-1.4, -0.2, 1.1, 0.9, 0.1, 1.4, -1.2, -0.1, -0.4, -0.8, 3, 3.3,
3.9, 3.9, 3.7, 9.5, 9.6, 9.6, 8.3, 10.3, 8.4, 9.4, 0.8, 0.9,
1.3, 0.2, 2.2, 0.7, 1.8, 4.5, 4.5, 4.6, 4.5, 4.6, 3.9, 4.2, 4.9,
4.9, 4.5, 16.7, 17.4, 17.5, 17.5, 17.4, 1.9, 2.4, 1.7, 1.2, 1.5,
1.1, 1.8, 3.1, 3.1, 3, 4.1, 4.2, 2.7, 4.5, 0.4, 1.2, 3, 3.5,
1.2, 0.8, 0.9, 2, -0.8, 5.8, 0.6, 1, 0.9, 4.4, 2.3, -0.1), dD = c(-66L,
-81L, -81L, -71L, -68L, -79L, -73L, -77L, -74L, -71L, -73L, -80L,
-76L, -80L, -74L, -77L, -74L, -77L, -74L, -82L, -79L, -72L, -74L,
-71L, -76L, -72L, -76L, -73L, -75L, -78L, -81L, -77L, -77L, -73L,
-79L, -104L, -98L, -98L, -109L, -106L, -99L, -107L, -100L, -98L,
-95L, -95L, -95L, -97L, -98L, -121L, -119L, -118L, -121L, -121L,
-106L, -116L, -122L, -112L, -125L, -109L, -114L, -118L, -102L,
-109L, -101L, -101L, -74L, -78L, -63L, -67L, -75L, -65L, -60L,
-62L, -67L, -59L, -62L, -68L, -61L, -58L, -69L, -57L, -59L, -67L,
-73L, -81L, -90L, -87L, -82L, -80L, -80L, -81L, -83L, -90L, -89L,
-107L, -122L, -128L, -113L, -124L, -93L, -89L, -92L, -85L, -79L,
-84L, -83L, -148L, -142L, -127L, -130L, -123L, -131L, -85L, -87L,
-90L, -81L, -65L, -81L, -119L, -121L, -94L, -90L, -117L, -95L,
-69L, -122L, -103L, -85L, -106L)), .Names = c("mine", "dCu65",
"dD"), class = "data.frame", row.names = c(NA, -130L))

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

------ End of Forwarded Message