Skip to content

clustering or homegenity approaches?

1 message · Weiwei Shi

#
Hi, there:
I have a question on the following dataset
[,1]      [,2]       [,3]       [,4]       [,5]
[1,] 34.216166 96.928587 330.125990 330.183222 330.201215
[2,]  2.819183  8.134491   8.275841   8.525256   8.828448
[3,]  2.819183  7.541680   7.550333   8.374636   8.690998
[4,]  4.672551  5.036353   5.072710   5.152218   5.223204
[5,]  5.470131  5.500513   5.674139   5.689151   5.770423
[6,]  4.480287  4.628300   4.797686   4.814106   4.823345

I want to filter out the first 3 cases from the rest and the criteria
is I am looking for a "gap".

My way is using std(eachrow)/median(each) and set up a threshold,
which is very naive, but fast and good enough. But I want it better
and more "academic". Please be advised. I think clustering might help,
but it needs to be quick since t2 has 30000 rows.

Thanks,

Weiwei