Skip to content

algorithm for clustering categorical data

10 messages · David L Carlson, Li, Yan, David Winsemius +1 more

#
Read up on Gower's Distance measures (available in the ecodist
package) which can combine numeric and categorical data. You
didn't give us any information about how you numerically
transformed the categorical variables, but the usual approach
is to create indicator variables that code presence/absence
for each category within a categorical variable. Different
variances between variables can be reduced by standardizing
the variables.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Li, Yan
Sent: Thursday, August 1, 2013 11:00 AM
To: r-help at r-project.org
Subject: [R] algorithm for clustering categorical data

Hi All,

Does anyone know what algorithm for clustering categorical
variables? R
packages? Which is the best?

If a data has both numeric and categorical data, what is the
best clustering algorithm
to use and R package?

I tried numeric transformation of all categorical fields  and
doing clustering afterwards. But the transformed fields have
values from 1...10, and my other fields is in a bigger scale:
10000-...This will make the categorical fields has less effect
on the distance calculation...

Thank you!
Yan


______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible
code.
#
Great! Thanks!

Yeah, I just use the usual way: as.numeric(..) for numeric transformation...seemed a standardization is needed. Thank you.

-----Original Message-----
From: David Carlson [mailto:dcarlson at tamu.edu] 
Sent: Thursday, August 01, 2013 12:08 PM
To: Li, Yan; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

Read up on Gower's Distance measures (available in the ecodist
package) which can combine numeric and categorical data. You didn't give us any information about how you numerically transformed the categorical variables, but the usual approach is to create indicator variables that code presence/absence for each category within a categorical variable. Different variances between variables can be reduced by standardizing the variables.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Li, Yan
Sent: Thursday, August 1, 2013 11:00 AM
To: r-help at r-project.org
Subject: [R] algorithm for clustering categorical data

Hi All,

Does anyone know what algorithm for clustering categorical variables? R packages? Which is the best?

If a data has both numeric and categorical data, what is the best clustering algorithm to use and R package?

I tried numeric transformation of all categorical fields  and doing clustering afterwards. But the transformed fields have values from 1...10, and my other fields is in a bigger scale:
10000-...This will make the categorical fields has less effect on the distance calculation...

Thank you!
Yan


______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
On Aug 1, 2013, at 9:00 AM, Li, Yan wrote:

            
Many.

http://cran.r-project.org/web/views/Cluster.html
For what purpose?
This seems impossibly vague and confused. You are asked in the Posting Guide to provide a working example if you want help with code.
#
Thanks for the reply....
-----Original Message-----
From: David Winsemius [mailto:dwinsemius at comcast.net] 
Sent: Thursday, August 01, 2013 12:15 PM
To: Li, Yan
Cc: r-help at r-project.org
Subject: Re: [R] algorithm for clustering categorical data
On Aug 1, 2013, at 9:00 AM, Li, Yan wrote:

            
Many.

http://cran.r-project.org/web/views/Cluster.html
For what purpose?
This seems impossibly vague and confused. You are asked in the Posting Guide to provide a working example if you want help with code.
4 days later
#
H David and other R helpers,

If I rescale the numerical fields to [0,1] and represent the categorical fields to 1:k, which is the same starting point as Gower's measure, but I use Euclidean distance instead of Gower's distance to do k-means clustering. How much is the difference? What is the draw back? 

Thanks you,
Yan

-----Original Message-----
From: David Carlson [mailto:dcarlson at tamu.edu] 
Sent: Thursday, August 01, 2013 12:08 PM
To: Li, Yan; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

Read up on Gower's Distance measures (available in the ecodist
package) which can combine numeric and categorical data. You didn't give us any information about how you numerically transformed the categorical variables, but the usual approach is to create indicator variables that code presence/absence for each category within a categorical variable. Different variances between variables can be reduced by standardizing the variables.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Li, Yan
Sent: Thursday, August 1, 2013 11:00 AM
To: r-help at r-project.org
Subject: [R] algorithm for clustering categorical data

Hi All,

Does anyone know what algorithm for clustering categorical variables? R packages? Which is the best?

If a data has both numeric and categorical data, what is the best clustering algorithm to use and R package?

I tried numeric transformation of all categorical fields  and doing clustering afterwards. But the transformed fields have values from 1...10, and my other fields is in a bigger scale:
10000-...This will make the categorical fields has less effect on the distance calculation...

Thank you!
Yan


______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
What do you mean by representing the categorical fields by 1:k?

a <- c("red", "green", "blue", "orange", "yellow")

becomes

a <- c(1, 2, 3, 4, 5)

That guarantees your results are worthless unless your categories
have an inherent order (e.g. tiny, small, medium, big, giant).
Otherwise it should be four (k-1) indicator/dummy variables (e.g.):

a.red <- c(1, 0, 0, 0, 0)
a.green <- c(0, 1, 0, 0, 0)
a.blue <- c(0, 0, 1, 0, 0)
a.orange <- c(0, 0, 0, 1, 0)

Then you can use Euclidean distance.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352


-----Original Message-----
From: Li, Yan [mailto:Yan_Li at ibi.com] 
Sent: Tuesday, August 6, 2013 9:36 AM
To: dcarlson at tamu.edu; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

H David and other R helpers,

If I rescale the numerical fields to [0,1] and represent the
categorical fields to 1:k, which is the same starting point as
Gower's measure, but I use Euclidean distance instead of Gower's
distance to do k-means clustering. How much is the difference? What
is the draw back? 

Thanks you,
Yan

-----Original Message-----
From: David Carlson [mailto:dcarlson at tamu.edu] 
Sent: Thursday, August 01, 2013 12:08 PM
To: Li, Yan; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

Read up on Gower's Distance measures (available in the ecodist
package) which can combine numeric and categorical data. You didn't
give us any information about how you numerically transformed the
categorical variables, but the usual approach is to create indicator
variables that code presence/absence for each category within a
categorical variable. Different variances between variables can be
reduced by standardizing the variables.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Li, Yan
Sent: Thursday, August 1, 2013 11:00 AM
To: r-help at r-project.org
Subject: [R] algorithm for clustering categorical data

Hi All,

Does anyone know what algorithm for clustering categorical
variables? R packages? Which is the best?

If a data has both numeric and categorical data, what is the best
clustering algorithm to use and R package?

I tried numeric transformation of all categorical fields  and doing
clustering afterwards. But the transformed fields have values from
1...10, and my other fields is in a bigger scale:
10000-...This will make the categorical fields has less effect on
the distance calculation...

Thank you!
Yan


______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
> What do you mean by representing the categorical fields by 1:k?
    > a <- c("red", "green", "blue", "orange", "yellow")

    > becomes

    > a <- c(1, 2, 3, 4, 5)

    > That guarantees your results are worthless 
worthless indeed!

    > unless your categories
    > have an inherent order (e.g. tiny, small, medium, big, giant).
    > Otherwise it should be four (k-1) indicator/dummy variables (e.g.):

    > a.red <- c(1, 0, 0, 0, 0)
    > a.green <- c(0, 1, 0, 0, 0)
    > a.blue <- c(0, 0, 1, 0, 0)
    > a.orange <- c(0, 0, 0, 1, 0)

    > Then you can use Euclidean distance.

Yes, ... or use Gower's or other similarly sophisticated
distances, as you (David) mentioned earlier in this thread.

Do also note that a generalized Gower's distance (+ weighting of
variables) is available from the ('recommended' hence always
installed) package 'cluster' :

  require("cluster")
  ?daisy
  ## notably  daisy(*,  metric="gower")

Note that daisy() is more sophisticated than most users know, 
using the 'type = *' specification allowing, notably for binary
variables (as your a.<col> dummies above) allowing asymmetric
behavior which maybe quite important in "rare event" and similar
cases.

Martin


    > -------------------------------------
    > David L Carlson
    > Associate Professor of Anthropology
    > Texas A&M University
    > College Station, TX 77840-4352


    > -----Original Message-----
    > From: Li, Yan [mailto:Yan_Li at ibi.com] 
    > Sent: Tuesday, August 6, 2013 9:36 AM
    > To: dcarlson at tamu.edu; r-help at r-project.org
    > Subject: RE: [R] algorithm for clustering categorical data

    > H David and other R helpers,

    > If I rescale the numerical fields to [0,1] and represent the
    > categorical fields to 1:k, which is the same starting point as
    > Gower's measure, but I use Euclidean distance instead of Gower's
    > distance to do k-means clustering. How much is the difference? What
    > is the draw back? 

    > Thanks you,
    > Yan

    > -----Original Message-----
    > From: David Carlson [mailto:dcarlson at tamu.edu] 
    > Sent: Thursday, August 01, 2013 12:08 PM
    > To: Li, Yan; r-help at r-project.org
    > Subject: RE: [R] algorithm for clustering categorical data

    > Read up on Gower's Distance measures (available in the ecodist
    > package) which can combine numeric and categorical data. You didn't
    > give us any information about how you numerically transformed the
    > categorical variables, but the usual approach is to create indicator
    > variables that code presence/absence for each category within a
    > categorical variable. Different variances between variables can be
    > reduced by standardizing the variables.

    > -------------------------------------
    > David L Carlson
    > Associate Professor of Anthropology
    > Texas A&M University
    > College Station, TX 77840-4352

    > -----Original Message-----
    > From: r-help-bounces at r-project.org
    > [mailto:r-help-bounces at r-project.org] On Behalf Of Li, Yan
    > Sent: Thursday, August 1, 2013 11:00 AM
    > To: r-help at r-project.org
    > Subject: [R] algorithm for clustering categorical data

    > Hi All,

    > Does anyone know what algorithm for clustering categorical
    > variables? R packages? Which is the best?

    > If a data has both numeric and categorical data, what is the best
    > clustering algorithm to use and R package?

    > I tried numeric transformation of all categorical fields  and doing
    > clustering afterwards. But the transformed fields have values from
    > 1...10, and my other fields is in a bigger scale:
    > 10000-...This will make the categorical fields has less effect on
    > the distance calculation...

    > Thank you!
    > Yan
#
Thanks David, This is very useful!

-----Original Message-----
From: David Carlson [mailto:dcarlson at tamu.edu] 
Sent: Tuesday, August 06, 2013 11:27 AM
To: Li, Yan; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

What do you mean by representing the categorical fields by 1:k?

a <- c("red", "green", "blue", "orange", "yellow")

becomes

a <- c(1, 2, 3, 4, 5)

That guarantees your results are worthless unless your categories have an inherent order (e.g. tiny, small, medium, big, giant).
Otherwise it should be four (k-1) indicator/dummy variables (e.g.):

a.red <- c(1, 0, 0, 0, 0)
a.green <- c(0, 1, 0, 0, 0)
a.blue <- c(0, 0, 1, 0, 0)
a.orange <- c(0, 0, 0, 1, 0)

Then you can use Euclidean distance.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352


-----Original Message-----
From: Li, Yan [mailto:Yan_Li at ibi.com]
Sent: Tuesday, August 6, 2013 9:36 AM
To: dcarlson at tamu.edu; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

H David and other R helpers,

If I rescale the numerical fields to [0,1] and represent the categorical fields to 1:k, which is the same starting point as Gower's measure, but I use Euclidean distance instead of Gower's distance to do k-means clustering. How much is the difference? What is the draw back? 

Thanks you,
Yan

-----Original Message-----
From: David Carlson [mailto:dcarlson at tamu.edu]
Sent: Thursday, August 01, 2013 12:08 PM
To: Li, Yan; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

Read up on Gower's Distance measures (available in the ecodist
package) which can combine numeric and categorical data. You didn't give us any information about how you numerically transformed the categorical variables, but the usual approach is to create indicator variables that code presence/absence for each category within a categorical variable. Different variances between variables can be reduced by standardizing the variables.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Li, Yan
Sent: Thursday, August 1, 2013 11:00 AM
To: r-help at r-project.org
Subject: [R] algorithm for clustering categorical data

Hi All,

Does anyone know what algorithm for clustering categorical variables? R packages? Which is the best?

If a data has both numeric and categorical data, what is the best clustering algorithm to use and R package?

I tried numeric transformation of all categorical fields  and doing clustering afterwards. But the transformed fields have values from 1...10, and my other fields is in a bigger scale:
10000-...This will make the categorical fields has less effect on the distance calculation...

Thank you!
Yan


______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
Thanks for the reply...

For some reason, I need to keep Euclidean distance in the process...

-----Original Message-----
From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] 
Sent: Tuesday, August 06, 2013 12:04 PM
To: dcarlson at tamu.edu
Cc: Li, Yan; r-help at r-project.org
Subject: Re: [R] algorithm for clustering categorical data
> What do you mean by representing the categorical fields by 1:k?
    > a <- c("red", "green", "blue", "orange", "yellow")

    > becomes

    > a <- c(1, 2, 3, 4, 5)

    > That guarantees your results are worthless worthless indeed!

    > unless your categories
    > have an inherent order (e.g. tiny, small, medium, big, giant).
    > Otherwise it should be four (k-1) indicator/dummy variables (e.g.):

    > a.red <- c(1, 0, 0, 0, 0)
    > a.green <- c(0, 1, 0, 0, 0)
    > a.blue <- c(0, 0, 1, 0, 0)
    > a.orange <- c(0, 0, 0, 1, 0)

    > Then you can use Euclidean distance.

Yes, ... or use Gower's or other similarly sophisticated distances, as you (David) mentioned earlier in this thread.

Do also note that a generalized Gower's distance (+ weighting of
variables) is available from the ('recommended' hence always
installed) package 'cluster' :

  require("cluster")
  ?daisy
  ## notably  daisy(*,  metric="gower")

Note that daisy() is more sophisticated than most users know, using the 'type = *' specification allowing, notably for binary variables (as your a.<col> dummies above) allowing asymmetric behavior which maybe quite important in "rare event" and similar cases.

Martin


    > -------------------------------------
    > David L Carlson
    > Associate Professor of Anthropology
    > Texas A&M University
    > College Station, TX 77840-4352


    > -----Original Message-----
    > From: Li, Yan [mailto:Yan_Li at ibi.com] 
    > Sent: Tuesday, August 6, 2013 9:36 AM
    > To: dcarlson at tamu.edu; r-help at r-project.org
    > Subject: RE: [R] algorithm for clustering categorical data

    > H David and other R helpers,

    > If I rescale the numerical fields to [0,1] and represent the
    > categorical fields to 1:k, which is the same starting point as
    > Gower's measure, but I use Euclidean distance instead of Gower's
    > distance to do k-means clustering. How much is the difference? What
    > is the draw back? 

    > Thanks you,
    > Yan

    > -----Original Message-----
    > From: David Carlson [mailto:dcarlson at tamu.edu] 
    > Sent: Thursday, August 01, 2013 12:08 PM
    > To: Li, Yan; r-help at r-project.org
    > Subject: RE: [R] algorithm for clustering categorical data

    > Read up on Gower's Distance measures (available in the ecodist
    > package) which can combine numeric and categorical data. You didn't
    > give us any information about how you numerically transformed the
    > categorical variables, but the usual approach is to create indicator
    > variables that code presence/absence for each category within a
    > categorical variable. Different variances between variables can be
    > reduced by standardizing the variables.

    > -------------------------------------
    > David L Carlson
    > Associate Professor of Anthropology
    > Texas A&M University
    > College Station, TX 77840-4352

    > -----Original Message-----
    > From: r-help-bounces at r-project.org
    > [mailto:r-help-bounces at r-project.org] On Behalf Of Li, Yan
    > Sent: Thursday, August 1, 2013 11:00 AM
    > To: r-help at r-project.org
    > Subject: [R] algorithm for clustering categorical data

    > Hi All,

    > Does anyone know what algorithm for clustering categorical
    > variables? R packages? Which is the best?

    > If a data has both numeric and categorical data, what is the best
    > clustering algorithm to use and R package?

    > I tried numeric transformation of all categorical fields  and doing
    > clustering afterwards. But the transformed fields have values from
    > 1...10, and my other fields is in a bigger scale:
    > 10000-...This will make the categorical fields has less effect on
    > the distance calculation...

    > Thank you!
    > Yan