Thanks Prof. Fraley. This is very helpful. I had looked at the way mclust gets its start and that looked like a possible source of the differences, but had not had time to get up to speed and figure it out. I was thinking it was more of a set.seed sort of thing. Now I know more. Thanks so much. Bryan
On 4/20/10 11:41 AM, "Chris Fraley" <fraley at u.washington.edu> wrote:
This is a really interesting case, but it is probably not that unusual.
mod23 <- Mclust(tur[,2:3])
mod32 <- Mclust(tur[,3:2])
mod23[c("modelName","G")]
mod32[c("modelName","G")]
The 2:3 case is choosing the VVV model (more parameters per cluster) with 3
clusters.
The 3:2 case is choosing the EEV model (more parsimonious), but with 7
clusters.
If you look at the BIC curves, you can see that mclust is indeed choosing the
model
corresponding to the maximum BIC in each case.
plot( mclustBIC( tur[,2:3], modelNames = c("EEV", "VVV"))
plot( mclustBIC( tur[,3:2], modelNames = c("EEV", "VVV"))
The default initialization for Mclust is via hierarchical clustering, and
that's
why the results differ
hc23 <- hcVVV( tur[,2:3]) # default initialization for mod23
hc32 <- hcVVV( tur[,3:2]) # default initialization for mod32
all.equal(hc32,hc23) # they differ!
The hierarchical clustering merges two groups at each stage.
If there is more than one choice at any particular stage for
which the merge criteria are equal or within roundoff error,
a different pair could be chosen for merging if columns (or rows)
are permuted. This will affect later merges, and could ultimately
affect the mclust results.
If you initialize both col permutations in the same way, you'll get
the same results.
mod23mod <- Mclust(tur[,2:3], initialization = list( hcPairs = hc32)) # uses
mod32 initialization
mod32mod <- Mclust(tur[,3:2], initialization = list( hcPairs = hc23)) # uses
mod23 initialization
mod23mod[c("modelName","G")] # same as mod32
mod32mod[c("modelName","G")] # same as mod23
Hope this helps,
Chris Fraley
On Tue, 20 Apr 2010, Bryan Hanson wrote:
Prof. Fraley, I wonder if you would have a couple of moments to respond to this question I put to the R help list. Thanks in advance for your time. Bryan ************* Bryan Hanson Acting Chair Professor of Chemistry & Biochemistry DePauw University, Greencastle IN USA ------ Forwarded Message From: Bryan Hanson <hanson at depauw.edu> Date: Mon, 19 Apr 2010 19:31:03 -0400 To: <hanson at gapps.depauw.edu>, R Help <r-help at stat.math.ethz.ch> Subject: [R] What is mclust up to? Different clusters found if x and y interchanged Hello All... I gave a task to my students that involved using mclust to look for clusters in some bivariate data of isotopes vs various mining locations. They discovered something I didn?t expect; the data (called tur) is appended below. p <- qplot(x = dD, y = dCu65, data = tur, color = mine) print(p) # simple bivariate plot of the data; looks fine mod1 <- Mclust(tur[,2:3]) mod1$G mod2 <- Mclust(tur[,3:2]) mod2$G One can use coordProj to see the clusters found, but the basic result is that mclust found 3 clusters in mod1, but if you interchange the x and y columns it finds 7 clusters (mod2). Since this is bivariate data, I find this result pretty strange. The only thing I can think of is that it has something to do with how the algorithm is seeded, but that isn't too convincing. Since I couldn't believe my eyes, I made up some bivariate data with known clusters (not given here) and interchanged the x and y columns and got the same clusters. I'm at a loss. Hopefully someone can set my understanding on the right path. Thanks for any insight! Bryan
sessionInfo()
R version 2.10.1 (2009-12-14) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] datasets tools grid graphics grDevices utils stats methods base other attached packages: [1] faraway_1.0.4 GGally_0.1 xtable_1.5-6 mvbutils_2.5.1 ggplot2_0.8.7 [6] digest_0.4.2 reshape_0.8.3 proto_0.3-8 ChemoSpec_1.43 R.utils_1.4.0 [11] R.oo_1.7.1 R.methodsS3_1.2.0 rgl_0.91 lattice_0.18-3 mvoutlier_1.4 [16] plyr_0.1.9 RColorBrewer_1.0-2 chemometrics_0.8 som_0.3-5 robustbase_0.5-0-1 [21] rpart_3.1-46 pls_2.1-0 pcaPP_1.8 mvtnorm_0.9-9 nnet_7.3-1 [26] mclust_3.4.4 MASS_7.3-5 lars_0.9-7 e1071_1.5-23 class_7.3-2 And the data:
dput(tur)
structure(list(mine = structure(c(13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L,
9L, 9L, 9L, 9L, 12L, 12L, 12L, 12L, 12L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 4L, 4L, 4L, 4L, 4L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("Cas",
"CH", "CR", "FM", "GU", "KG", "KM", "LV", "MC", "MZ", "N8", "OR",
"SB", "TF"), class = "factor"), dCu65 = c(6.1, 6.5, 5.7, 6.4,
6.1, 5.9, 6.5, 5.7, 6, 6.4, 6.8, 6.9, 6.7, 6.7, 5.8, 5.3, 6.7,
5.7, 6.7, 5.6, 6.4, 6.4, 6.5, 7.6, 6.7, 6.4, 6.4, 7, 7, 5.9,
5.3, 6.1, 11.8, 12.1, 12.4, 1, 1.8, 1.4, 1.2, 1.2, 1.2, 1.1,
1.2, 1.8, 1, 0.8, 0.7, 1.2, 1.5, 1, 0.2, 1.4, 0.3, 1.1, 0, -0.5,
-1.4, -0.2, 1.1, 0.9, 0.1, 1.4, -1.2, -0.1, -0.4, -0.8, 3, 3.3,
3.9, 3.9, 3.7, 9.5, 9.6, 9.6, 8.3, 10.3, 8.4, 9.4, 0.8, 0.9,
1.3, 0.2, 2.2, 0.7, 1.8, 4.5, 4.5, 4.6, 4.5, 4.6, 3.9, 4.2, 4.9,
4.9, 4.5, 16.7, 17.4, 17.5, 17.5, 17.4, 1.9, 2.4, 1.7, 1.2, 1.5,
1.1, 1.8, 3.1, 3.1, 3, 4.1, 4.2, 2.7, 4.5, 0.4, 1.2, 3, 3.5,
1.2, 0.8, 0.9, 2, -0.8, 5.8, 0.6, 1, 0.9, 4.4, 2.3, -0.1), dD = c(-66L,
-81L, -81L, -71L, -68L, -79L, -73L, -77L, -74L, -71L, -73L, -80L,
-76L, -80L, -74L, -77L, -74L, -77L, -74L, -82L, -79L, -72L, -74L,
-71L, -76L, -72L, -76L, -73L, -75L, -78L, -81L, -77L, -77L, -73L,
-79L, -104L, -98L, -98L, -109L, -106L, -99L, -107L, -100L, -98L,
-95L, -95L, -95L, -97L, -98L, -121L, -119L, -118L, -121L, -121L,
-106L, -116L, -122L, -112L, -125L, -109L, -114L, -118L, -102L,
-109L, -101L, -101L, -74L, -78L, -63L, -67L, -75L, -65L, -60L,
-62L, -67L, -59L, -62L, -68L, -61L, -58L, -69L, -57L, -59L, -67L,
-73L, -81L, -90L, -87L, -82L, -80L, -80L, -81L, -83L, -90L, -89L,
-107L, -122L, -128L, -113L, -124L, -93L, -89L, -92L, -85L, -79L,
-84L, -83L, -148L, -142L, -127L, -130L, -123L, -131L, -85L, -87L,
-90L, -81L, -65L, -81L, -119L, -121L, -94L, -90L, -117L, -95L,
-69L, -122L, -103L, -85L, -106L)), .Names = c("mine", "dCu65",
"dD"), class = "data.frame", row.names = c(NA, -130L))
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ------ End of Forwarded Message