Reg : null values in kmeans

5 messages · Raji, Jannis, raji sankaran +1 more

Original

1

5

Raji

Mon, Dec 13, 2010 6:57 AM #

Hi,

  I am using k means algorithm for clustering.My data contains a few null/NA
values.kmeans doesnt cluster with those values.Are there any option like
na.omit which can avoid these null values and cluster the remaining values?

Thanks,
Raji

View this message in context: http://r.789695.n4.nabble.com/Reg-null-values-in-kmeans-tp3085518p3085518.html
Sent from the R help mailing list archive at Nabble.com.

1 day later

Jannis

Wed, Dec 15, 2010 5:20 AM #

I do not really understand your question. You can use use kmeans but 
without the observations that include the NA values (e.g. by deleting 
whole rows in your observation matrix). If you want to keep the 
information in the valid observations of those rows, I fear you need to 
look for a clustering algorithm that can handle missing values. I doubt 
that there is a kmeans version that can. Think about inserting means of 
all other observations into the gaps, though this introduces bias as well.


Jannis

Raji schrieb:

Wed, Dec 15, 2010 7:43 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20101216/b181ceba/attachment.pl>

Wed, Dec 15, 2010 8:28 PM #

Have your tried something like the following?

x1       x2        x3 cluster
1        NA 1.000000 1.0000000      NA
2 0.6931472 1.414214 0.5000000       3
3 1.0986123 1.732051        NA      NA
4 1.3862944 2.000000 0.2500000       3
5 1.6094379 2.236068 0.2000000       3
6 1.7917595 2.449490 0.1666667       3

x1       x2         x3 cluster
45 3.806662 6.708204 0.02222222       1
46 3.828641 6.782330 0.02173913       1
47 3.850148 6.855655 0.02127660       1
48 3.871201 6.928203 0.02083333       1
49 3.891820 7.000000 0.02040816       1
50 3.912023 7.071068 0.02000000       1

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

-----Original Message-----
From: r-help-bounces at r-project.org 
[mailto:r-help-bounces at r-project.org] On Behalf Of raji sankaran
Sent: Wednesday, December 15, 2010 7:43 PM
To: Jannis
Cc: r-help at r-project.org
Subject: Re: [R] Reg : null values in kmeans

Hi Jannis,

  Thank you for answering my question. I saw the option 
called na.omit when
i used nnet() and tried to classify Iris data with that. I 
wanted to know if
there is a similar option available in kmeans which can omit 
or in some way
consider the null/NA values and cluster the 
observations.Currently, kmeans
throws an error for the dataset with NULL/NA values.

From your answer, i could understand that, the option of

handling NULL/NA is
not available with kmeans. Please correct me if am wrong.

Thanks again :)

On Wed, Dec 15, 2010 at 6:50 PM, Jannis <bt_jannis at yahoo.de> wrote:

I do not really understand your question. You can use use kmeans but
without the observations that include the NA values (e.g.

by deleting whole

rows in your observation matrix). If you want to keep the

information in the

valid observations of those rows, I fear you need to look

for a clustering

algorithm that can handle missing values. I doubt that

there is a kmeans

version that can. Think about inserting means of all other

observations into

the gaps, though this introduces bias as well.


Jannis

Raji schrieb:

 Hi,

 I am using k means algorithm for clustering.My data contains a few
null/NA
values.kmeans doesnt cluster with those values.Are there

any option like

na.omit which can avoid these null values and cluster the remaining
values?

Thanks,
Raji

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jannis

Thu, Dec 16, 2010 4:28 AM #

Hi Raji,

I am quite sure that kmeans in general is not able to handle missing 
values so most probably there wont be an option for this in R. Either 
you omit the observations with NAs as William proposed or you search for 
some algorithm that can handle missing values (not sure whether there is 
any).  Other alternatives would be to put mean values in the NA places. 
This, however, biases the results.


HTH
Jannis

raji sankaran schrieb: