I have been asked to forward this. Please reply directly or include the
people who have been CC-ed in this e-mail. Thank you.
forwarded message from "Timothy Waters"
<timothy.waters at plant-sciences.oxford.ac.uk> -----
Consider the following problem. You have a dataset with approx 190
datapoints. Each datapoint has between 7 and 16 dimensions known: most
7, a few have 16, many have 14. The ones that have seven are divided into
two categories, such that the vast bulk fall into a category with
1,2,3,4,5,6,7 known, and the others have dimensions 8,9,10,11,12,13,14
known.
So, a fairly difficult dataset, but anyway.
Now, on to the analysis. You wish to look for the presence of any form of
multivariate structuring to the data, specifically, discrete clusters
identified by combinations of one or more variables. Clearly you have a
couple of options. You can just get it to produce a dendrogram (strictly
phenogram in biological terms), or you can ask it to cluster the data into
some number n of sets, where 1 =< n =< 10 (for present purposes) . You
then look at each possible solution (i.e. each value of n) and examine
discriminant function analysis tells you about the ease of separation of
clusters you have identified.
BUT, can you give a program a dataset (i.e., this dataset) and say "Find
where n is the number of clusters that the data is structured into, such
that the statistical differences between clusters are maximally
significant."