Skip to content

Clustering methods for data that has bimodal distribution

2 messages · Adrian Johnson, Ranjan Maitra

#
Dear group,
pardon me for a naive question. I have data matrix (11K rows , 4K columns).
The data range is between -1 to 1. Not strictly integers, but real
numbers with at least place values in millionths.

The data distribution is peculiar (if I do plot(density(myMatrix)),  I
get nice bimodal curve (nice standard distribution between -1 and 0
and another curve between 0 and 1) .

I am interested in clustering the data (using conesnsus clustering
(that uses K-means)).

My question are:

1. If my data is range is between -1  and 1. Is K-means appropriate
method. considering if the data might have ties.

2. Although K-means is non-parametric, would a bimodal distributed
data be okay as input to K-means.

I appreciate any suggestion.
Thanks
Adrian.
#
Hello Adrian,

It all depends on what the structure of the dataset is. For instance, you said that all your values are betweenn -1 and 1. Do the data rown sum-squared up to 1? How about the means? Are they zero. I guess all this has to depend on the application and how the data were processed or what is sought to be answered? Even if Euclidean space is most apt, then you need to figure out what sort of structure you would like in your derived groups/clusters. For example again, k-means has an underlying philosophy: homoegenous spherical clusters of roughly equal sizes. Is this what yuo want?

HTH,
Ranjan
On Sun, 4 Dec 2016 22:52:33 -0500 Adrian Johnson <oriolebaltimore at gmail.com> wrote: