Skip to content

Natural Breaks - Jenks

3 messages · Ben Brehmer, Roger Bivand, Darla Munroe

#
I have implemented Jenks' algorithm (for finding the natural breaks) in 
php thanks to some sample code I found at: 
http://www.mail-archive.com/r-sig-geo at stat.math.ethz.ch/msg00290.html . 
This algorithm is also implemented in ArcView to determine natural 
breaks in the legends.
Currently I am running the algorithm on a data set which has 65 000 
elements in it, which takes over 3 hours (due to a nested for loop). 
ArcViews' implementation on the other hand returns within seconds. Would 
anyone possibly know why ArcViews implementation is so much more efficient.

Any help would be greatly appreciated.

Ben Brehmer
#
On Tue, 18 Apr 2006, Ben Brehmer wrote:

            
library(classInt)
?classIntervals
y <- runif(65000)
yClass <- classIntervals(y, n=5, style="fisher")

runs on a 1.5GHz machine in 225 seconds. This is using the Fortran code 
you refer to directly. My guess is that Arc looks at the number of unique 
values, and, if there are many, uses a heuristic. If it sampled and set 
the seed the same each time, the result would be the same, and the code 
runs acceptably fast for say 2000 values. Maybe Arc also precomputes 
values?

Roger

  
    
#
yes, by default I think ArcGIS takes a 10% sample to create its breaks, and
you have to go in manually to change that option, and yes, Roger - I think
you're right - it samples the data as you load it into the map (you can see
this if you have a very large discrete grid - it won't display all the cells
if there are more than 5,000 or some such cut-off number).

-----Original Message-----
From: r-sig-geo-bounces at stat.math.ethz.ch
[mailto:r-sig-geo-bounces at stat.math.ethz.ch] On Behalf Of Roger Bivand
Sent: Tuesday, April 18, 2006 2:05 PM
To: Ben Brehmer
Cc: r-sig-geo at stat.math.ethz.ch
Subject: Re: [R-sig-Geo] Natural Breaks - Jenks
On Tue, 18 Apr 2006, Ben Brehmer wrote:

            
efficient.

library(classInt)
?classIntervals
y <- runif(65000)
yClass <- classIntervals(y, n=5, style="fisher")

runs on a 1.5GHz machine in 225 seconds. This is using the Fortran code 
you refer to directly. My guess is that Arc looks at the number of unique 
values, and, if there are many, uses a heuristic. If it sampled and set 
the seed the same each time, the result would be the same, and the code 
runs acceptably fast for say 2000 values. Maybe Arc also precomputes 
values?

Roger