Natural Breaks Classification
Most of my problem is that this is my first time using R, so I just don't know the functions (or even really the semantic differences) for dealing with vectors,arrays, and matrices. Thank you very much for the code sample. I have been looking around, is there a good basic reference online to find out basic function information so I can reference what things like order, match, unlist, tapply, etc. are doing? I have yet to find just a good functional reference doc for R. Given that I am not plotting my data in R, I think that you are going further than what I actually need. Given the small example data set, all that I really need is just the two numbers 4 and 11 (before I had said 3, but I realized the third is unnecessary). From these I plan to construct my classes as <4,>=4 and <11, >=11 when I create my classification in MapServer where I am doing the plotting. David
On 2/24/06, Roger Bivand <Roger.Bivand at nhh.no> wrote:
On Fri, 24 Feb 2006, David Bitner wrote:
I was actually already using that paper as a resource, but I am still not quite understanding how to get just the breaks that I can use for classification in an external mapping application. Right now I am trying out some samples just to understand what's going on. Using this array: a<-c(11,1,3,3,4,4,4,4,4,4,5) This gives me the vector (3,1,1,1,2,2,2,2,2,2,2) ideally I want the output for this to be just (3,5,11) Taking this with the min/max of my data I can create the classes 1-3,4-5,6-11
Next line on p. 15:
a<-c(11,1,3,3,4,4,4,4,4,4,5) av <- kmeans(a, 3) str(av)
List of 4 $ cluster : int [1:11] 3 1 1 1 2 2 2 2 2 2 ... $ centers : num [1:3, 1] 2.33 4.14 11.00 ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:3] "1" "2" "3" .. ..$ : NULL $ withinss: num [1:3] 2.667 0.857 0.000 $ size : int [1:3] 3 7 1 - attr(*, "class")= chr "kmeans"
cols <- match(av$cluster, order(av$centers)) cols
[1] 3 1 1 1 2 2 2 2 2 2 2
stbrks <- matrix(unlist(tapply(a, factor(cols), range)), ncol = 2,
+ byrow = TRUE)
stbrks
[,1] [,2] [1,] 1 3 [2,] 4 5 [3,] 11 11 which then needs unpacking:
apply(stbrks, 1, function(x) paste(x, collapse="-"))
[1] "1-3" "4-5" "11-11" but which isn't pretty, because there is a gap. As plot(ecdf(jitter(a)), verticals=TRUE) shows, the gap is there in the data, so it is only a matter of convention that class intervals are shown for values that are not in the data, and my preference is to let the data do the talking (but I'm in a minority!), and cartographic convention will probably mean shifting the internal gaps to "look" smaller, and rounding the values to make them easier to handle. For larger numbers of data items, it may be necessary to get the breaks and re-assign the cols because of rounding. The stbrks matrix is just the range within the class for each class for the cols resulting from the classification. With kmeans, note that it can get stuck, so I'd put it inside try() and copy jittered values onto the end of the data vector if it fails - simply because kmeans seeds its classes at random first, and can be unlucky. Please let the list know how you get on, I'm sure class intervals are an interesting topic. Roger
On 2/24/06, Roger Bivand <Roger.Bivand at nhh.no> wrote:
On Fri, 24 Feb 2006, David Bitner wrote:
I am trying to create some type of a Natural Breaks Classification in PL/R to classify data that I have in a PostgreSQL/PostGIS database. All that I really need so that I can pass information on to Mapserver to display this data is the class break values (ie an array [3,5,7,9] would mean show values 1-3 in blue, 4-5 in red, etc.). I am completely new to R, but have a fair bit of experience with the other PL's in PostgreSQL, so the PL part shouldn't be too hard to figure out. I am thinking that kmeans should give me something close to what I want. My problem is that I am not quite sure to massage the output to get my class breaks or what format to input the data. For arguments, kmeans takes a matrix, number of classes, and a max number of iterations -- last two are easy, but how do I convert (do I need to convert) an array into a matrix (is matrix just R speak for an array?) I will be starting with a one dimension array ie ([1,4,1,1,1,6,4,9,9]). The next issue is spitting the data out. The docs tell me that I get the centers of the clusters where what I really want are the boundaries of the clusters, how could I get at the "break points" that I am after?
I think you may find some of the code in: http://spatial.nhh.no/papers/aag04.pdf useful (though dated). kmeans() and - my preference - bclust() in the e1071 package may fail when given too few values, but there are ways round that. You'll see code (for example at the top of p. 15) on how to get the class centres out - but they - as done there - still leave gaps between classes as can be seen from the ECDF plot below. As you can see, kmeans() manages with a vector OK. It would be nice to follow this up as a single function - input the data vector and some preferences wrt. number of classes, and output as a number of (list of) class intervals with some fitness criterion, something like that? I guess your colour palette is on the PL side? Roger
Any help is appreciated, David
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-geo
-- Roger Bivand Economic Geography Section, Department of Economics, Norwegian School of Economics and Business Administration, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43 e-mail: Roger.Bivand at nhh.no
-- Roger Bivand Economic Geography Section, Department of Economics, Norwegian School of Economics and Business Administration, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43 e-mail: Roger.Bivand at nhh.no