complete linkage Agglomerative hierarchical clustering, nnclust, spatclus or something else? - R-SIG-Geo

Tue, Apr 20, 2010 8:16 AM #

I have just read about clustering on wikipedia, and learnt that what I
want is:

Agglomerative hierarchical clustering, with complete linkage

I searched for suitable r-packages for this and found nnclust, and
spatclus. Are those the packages that you could recommend for
clustering events data (the events here is urban fires, created by
arsonists)? Or do you want to suggest other packages?

In this first analysis I want to do the clustering should only by
location, and ignore the point in time.

Later on I will also include time data, so if the clustering package
could handle time too, that would be great, but that is not a
requirement at this time.

The aim at this stage is only to group events with the same, or almost
the same, location. In my data-set the coordiante-data is a bit too
precise in some cases.

I guess a crude way of clustering would be to round the
coordinate-data to a lesser number of significant digits, however, a
sound clustering algorithm would be better :-)

I also need to grasp the scale here:

With this proj4string, and datapoints as below:

  ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots
  .. .. ..@ projargs: chr " +proj=utm +zone=33 +ellps=GRS80 +units=m +no_defs"

head(fires at coords, 2)
         East   North
[1,] 315359.9 6393110
[2,] 325862.4 6405239

dist(head(fires at coords, 2))
         1
2 16044.10

Is this in meters?

TIA

Hans Ekbrand (http://sociologi.cjb.net) <hans at sociologi.cjb.net>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20100420/edb90403/attachment.bin>

Roger Bivand

Tue, Apr 20, 2010 8:25 AM #

On Tue, 20 Apr 2010, Hans Ekbrand wrote:

library(cluster)
?hclust

is for clustering with moderate numbers of points.

Below: in UTM, the units are metres, so distances are in metres too.

Hope this helps,

Roger

Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no

Hans Ekbrand

Tue, Apr 20, 2010 2:13 PM #

Roger Bivand wrote:

Thanks. In a first run I have 900 points, but the whole set contains
70.000 points.

OK, good to know.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20100420/1f8e19ab/attachment.bin>

Hans Ekbrand

Wed, Apr 21, 2010 12:59 AM #

On Tue, Apr 20, 2010 at 11:13:22PM +0200, Hans Ekbrand wrote:

print(load(url("http://sociologi.cjb.net/temp/clust.geo.test.RData")))
clust.geo.test.tree <- hclust(dist(clust.geo.test at coords))
clust.geo.test.tree$height

head(clust.geo.test.tree$height, 70)
 [1]   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
[11]   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
[21]   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
[31]   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
[41]   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
[51]   0.000000   0.000000   0.000000   0.000000   3.160631  18.963676  30.398644  32.232351  37.927539  44.987446
[61]  50.065192  81.542472  82.691738  93.553729  95.971207 105.325405 115.218371 119.540239 125.235381 130.181302

As I understand this, the 54 zeroes represent identical coordinates.
The positive numbers represent the distance in meters between points
that have been grouped together at a certain level of the tree. Now, I
am not interested in grouping together points with distances larger
than 100 meters, so I would like to stop the clustering process at
that point - or, after the hclust has completed, extract the clusters
that were in effect at that level. In the above example that would be
at level 65.

I didn't understand from the documentation of hclust how to accomplish
that, can someone on the list help me?

The goal is to count, for each cluster, the number of fires and then
to analyse how the fires within each cluster is distributed over time,
and to count how many of them that are too close in time to be
considered independent.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20100421/6fd09231/attachment.bin>

Hans Ekbrand

Wed, Apr 21, 2010 4:38 AM #

On Wed, Apr 21, 2010 at 09:59:51AM +0200, Hans Ekbrand wrote:

[...]

I found cutree(), and understood the "h" parameter of cutree, and then
it all worked out. Here's an example for the archives.

# Clustering
max.distance.in.same.cluster <- 100
print(load(url("http://sociologi.cjb.net/temp/clust.geo.test.RData")))
clust.geo.test.tree <- hclust(dist(clust.geo.test at coords))
my.cluster <- cutree(clust.geo.test.tree, h = max.distance.in.same.cluster)
# Which clusters have more than one member?
sort(unique(my.cluster[which(duplicated(my.cluster))]))

# How many members do these cluster have?
sapply(sort(unique(my.cluster[which(duplicated(my.cluster))])), function(x) {length(which(my.cluster == x))})

# Print a sorted list of the longest distances within each of these clusters.
sort(sapply(sort(unique(my.cluster[which(duplicated(my.cluster))])), function(x) {max(dist(clust.geo.test at coords[which(my.cluster == x),]))}))


Thanks again, Roger, for the pointer to hclust() 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20100421/91e87d8a/attachment.bin>

Roger Bivand

Wed, Apr 21, 2010 4:42 AM #

On Wed, 21 Apr 2010, Hans Ekbrand wrote:

On Tue, Apr 20, 2010 at 11:13:22PM +0200, Hans Ekbrand wrote:

Roger Bivand wrote:

On Tue, 20 Apr 2010, Hans Ekbrand wrote:

I have just read about clustering on wikipedia, and learnt that what I
want is:

Agglomerative hierarchical clustering, with complete linkage

library(cluster)
?hclust

print(load(url("http://sociologi.cjb.net/temp/clust.geo.test.RData")))
clust.geo.test.tree <- hclust(dist(clust.geo.test at coords))
clust.geo.test.tree$height

head(clust.geo.test.tree$height, 70)
[1]   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
[11]   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
[21]   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
[31]   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
[41]   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000
[51]   0.000000   0.000000   0.000000   0.000000   3.160631  18.963676  30.398644  32.232351  37.927539  44.987446
[61]  50.065192  81.542472  82.691738  93.553729  95.971207 105.325405 115.218371 119.540239 125.235381 130.181302

As I understand this, the 54 zeroes represent identical coordinates.
The positive numbers represent the distance in meters between points
that have been grouped together at a certain level of the tree. Now, I
am not interested in grouping together points with distances larger
than 100 meters, so I would like to stop the clustering process at
that point - or, after the hclust has completed, extract the clusters
that were in effect at that level. In the above example that would be
at level 65.

I didn't understand from the documentation of hclust how to accomplish
that, can someone on the list help me?

So you do not want hclust at all, really. Look at dnearneigh() in spdep, 
setting a 100m bound. Then use n.comp.nb() to see which points belong to 
which graph component, using perhaps plot.nb with colours to distinguish 
the subgraphs.

Roger

Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no

Hans Ekbrand

Wed, Apr 21, 2010 5:04 AM #

On Wed, Apr 21, 2010 at 01:42:05PM +0200, Roger Bivand wrote:

Well, hclust was useful, once I understood how cutree works. What
would be the benefit of dnearneigh(), is it faster?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20100421/161c7af8/attachment.bin>

Roger Bivand

Wed, Apr 21, 2010 6:14 AM #

On Wed, 21 Apr 2010, Hans Ekbrand wrote:

For larger data sets, hclust needs a triangular distance matrix, 
dnearneigh does not. Finding graph components in the output "nb" object 
also seems conceptually more direct. But if your solution works for you, 
that is fine.

Roger

Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no

Hans Ekbrand

Wed, Apr 21, 2010 6:24 AM #

On Wed, Apr 21, 2010 at 03:14:46PM +0200, Roger Bivand wrote:

[...]

OK, good to know if I run into trouble when using the code on larger
data-sets later on.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20100421/2f074037/attachment.bin>

Thomas Lumley

Wed, Apr 21, 2010 8:53 AM #

You asked earlier about nnclust:  it does single-linkage rather than complete-linkage clustering, that is, it defines clusters so that each point in the cluster has a nearest neighbour in the cluster closer than the threshold distance. This produces much less circular clusters than complete-linkage clustering.

The main distinctive feature of nnclust is that it is feasible even for quite large data sets, taking linear space and roughly nlogn time.

         -thomas

On Wed, 21 Apr 2010, Hans Ekbrand wrote:

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle