<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1 Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
Finding overlaps in vector
13 messages · jim holtman, Charles C. Berry, Johannes Graumann +2 more
Here is a modification of the algorithm to use a specified value for the overlap:
vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) # following add 0.5 as the overlap detection -- can be changed x <- rbind(cbind(value=vector, oper=1, id=seq_along(vector)),
+ cbind(value=vector+0.5, oper=-1, id=seq_along(vector)))
x <- x[order(x[,'value'], -x[, 'oper']),] # determine which ones overlap x <- cbind(x, over=cumsum(x[, 'oper'])) # now partition into groups and only use groups greater than or equal to 3 # determine where the breaks are (0 values in cumsum(over)) x <- cbind(x, breaks=cumsum(x[, 'over'] == 0)) # delete entries with 'over' == 0 x <- x[x[, 'over'] != 0,] # split into groupd x.groups <- split(x[, 'id'], x[, 'breaks']) # only keep those with more than 2 x.subsets <- x.groups[sapply(x.groups, length) >= 3] # print out the subsets invisible(lapply(x.subsets, function(a) print(vector[unique(a)])))
[1] 0.00 0.45 [1] 3.00 3.25 3.33 3.75 4.10 [1] 6.00 6.45 [1] 7.0 7.1
On Dec 21, 2007 4:56 AM, Johannes Graumann <johannes_graumann at web.de> wrote:
<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1 Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
This may not be as direct as Jim's in terms of specifying granularity but will uses conventional hierarchical clustering to create the clusters and also draws a nice dendrogram for you. I have split the dendrogram at a height of 0.5 to define the clusters but you can change that to whatever granularity you like:
v <- c(0, 0.45, 1, 2, 3, 3.25, 3.33, 3.75, 4.1, 5, 6, 6.45, 7, 7.1, 8) # cluster and plot hc <- hclust(dist(v), method = "single") plot(hc, lab = v) cl <- rect.hclust(hc, h = .5, border = "red") # each component of list cl is one cluster. Print them out. for(idx in cl) print(unname(v[idx]))
[1] 8 [1] 7.0 7.1 [1] 6.00 6.45 [1] 5 [1] 3.00 3.25 3.33 3.75 4.10 [1] 2 [1] 1 [1] 0.00 0.45
# a different representation of the clusters vv <- v names(vv) <- ct <- cutree(hc, h = .5) vv
1 1 2 3 4 4 4 4 4 5 6 6 7 7 8 0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00 7.10 8.00
On Dec 21, 2007 4:56 AM, Johannes Graumann <johannes_graumann at web.de> wrote:
<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1 Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Fri, 21 Dec 2007, Johannes Graumann wrote:
<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1
Try this:
tmp <- rle( diff(v)<.5 ) ends <- 1+cumsum(tmp$lengths)[tmp$values] mapply(function(x,y) v[ seq(to=x,length=y) ], ends, 1+tmp$lengths[tmp$values])
[[1]] [1] 0.00 0.45 [[2]] [1] 3.00 3.25 3.33 3.75 4.10 [[3]] [1] 6.00 6.45 [[4]] [1] 7.0 7.1 HTH, Chuck
Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
Jim, Although I can't find the post this code stems from, I had come across it on my prowling the NG. It's not the one you had shared with me to eliminate overlaps (and which I referenced below: http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html). That particular solution you had come up with marked entries as overlapping or not, and I am looking for an extension to that code which would also return the actual "clusters" of consecutively overlapping values. While Gabor's code in this thread does what I require for the example I still hope somebody more cluefull than myself can extent your code since it carries the - for me - significant advantage of being able to build the windows of overlap with different values for 'up' and 'down', let's say check which values overlap when the overlap-defining distance is 5ppm 'up' and 7.5ppm 'down' from each value. This is a generalization I would highly cherish. Thanks for your help and continuous patience on r-help. Joh
jim holtman wrote:
Here is a modification of the algorithm to use a specified value for the overlap:
vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) # following add 0.5 as the overlap detection -- can be changed x <- rbind(cbind(value=vector, oper=1, id=seq_along(vector)),
+ cbind(value=vector+0.5, oper=-1, id=seq_along(vector)))
x <- x[order(x[,'value'], -x[, 'oper']),] # determine which ones overlap x <- cbind(x, over=cumsum(x[, 'oper'])) # now partition into groups and only use groups greater than or equal to # 3 determine where the breaks are (0 values in cumsum(over)) x <- cbind(x, breaks=cumsum(x[, 'over'] == 0)) # delete entries with 'over' == 0 x <- x[x[, 'over'] != 0,] # split into groupd x.groups <- split(x[, 'id'], x[, 'breaks']) # only keep those with more than 2 x.subsets <- x.groups[sapply(x.groups, length) >= 3] # print out the subsets invisible(lapply(x.subsets, function(a) print(vector[unique(a)])))
[1] 0.00 0.45 [1] 3.00 3.25 3.33 3.75 4.10 [1] 6.00 6.45 [1] 7.0 7.1 On Dec 21, 2007 4:56 AM, Johannes Graumann <johannes_graumann at web.de> wrote:
<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1 Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thank you very much for this elegant solution to the problem. The reason I still hope for an extension of Jim's code (not the one re responded with in this thread, but the one I actually reference) is that windows of overlap can be asymetric with that: one can check e.g. whether values overlap given the constraints that the closest allowed proximity 'down' is 0.5 and 'up' is 0.75. I would highly cherish a solution that would allow for cluster isolation with that requirement. Thanks for your time and insight, Joh
Gabor Grothendieck wrote:
This may not be as direct as Jim's in terms of specifying granularity but will uses conventional hierarchical clustering to create the clusters and also draws a nice dendrogram for you. I have split the dendrogram at a height of 0.5 to define the clusters but you can change that to whatever granularity you like:
v <- c(0, 0.45, 1, 2, 3, 3.25, 3.33, 3.75, 4.1, 5, 6, 6.45, 7, 7.1, 8) # cluster and plot hc <- hclust(dist(v), method = "single") plot(hc, lab = v) cl <- rect.hclust(hc, h = .5, border = "red") # each component of list cl is one cluster. Print them out. for(idx in cl) print(unname(v[idx]))
[1] 8 [1] 7.0 7.1 [1] 6.00 6.45 [1] 5 [1] 3.00 3.25 3.33 3.75 4.10 [1] 2 [1] 1 [1] 0.00 0.45
# a different representation of the clusters vv <- v names(vv) <- ct <- cutree(hc, h = .5) vv
1 1 2 3 4 4 4 4 4 5 6 6 7 7 8 0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00 7.10 8.00 On Dec 21, 2007 4:56 AM, Johannes Graumann <johannes_graumann at web.de> wrote:
<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1 Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hm, hm, rect.hclust doesn't accept "plot=FALSE" and cutree doesn't retain the indexes of membership ... anyway short of ripping out the guts of rect.hclust to achieve the same result without an active graphics device? Joh
# cluster and plot hc <- hclust(dist(v), method = "single") plot(hc, lab = v) cl <- rect.hclust(hc, h = .5, border = "red") # each component of list cl is one cluster. Print them out. for(idx in cl) print(unname(v[idx]))
[1] 8 [1] 7.0 7.1 [1] 6.00 6.45 [1] 5 [1] 3.00 3.25 3.33 3.75 4.10 [1] 2 [1] 1 [1] 0.00 0.45
# a different representation of the clusters vv <- v names(vv) <- ct <- cutree(hc, h = .5) vv
1 1 2 3 4 4 4 4 4 5 6 6 7 7 8 0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00 7.10 8.00 On Dec 21, 2007 4:56 AM, Johannes Graumann <johannes_graumann at web.de> wrote:
<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1 Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
If we don't need any plotting we don't really need rect.hclust at all. Split the output of cutree, instead. Continuing from the prior code:
for(el in split(unname(vv), names(vv))) print(el)
[1] 0.00 0.45 [1] 1 [1] 2 [1] 3.00 3.25 3.33 3.75 4.10 [1] 5 [1] 6.00 6.45 [1] 7.0 7.1 [1] 8
On Dec 21, 2007 3:24 PM, Johannes Graumann <johannes_graumann at web.de> wrote:
Hm, hm, rect.hclust doesn't accept "plot=FALSE" and cutree doesn't retain the indexes of membership ... anyway short of ripping out the guts of rect.hclust to achieve the same result without an active graphics device? Joh
# cluster and plot hc <- hclust(dist(v), method = "single") plot(hc, lab = v) cl <- rect.hclust(hc, h = .5, border = "red") # each component of list cl is one cluster. Print them out. for(idx in cl) print(unname(v[idx]))
[1] 8 [1] 7.0 7.1 [1] 6.00 6.45 [1] 5 [1] 3.00 3.25 3.33 3.75 4.10 [1] 2 [1] 1 [1] 0.00 0.45
# a different representation of the clusters vv <- v names(vv) <- ct <- cutree(hc, h = .5) vv
1 1 2 3 4 4 4 4 4 5 6 6 7 7 8 0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00 7.10 8.00 On Dec 21, 2007 4:56 AM, Johannes Graumann <johannes_graumann at web.de> wrote:
<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1 Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
But cutree does away with the indexes from the original input, which rect.hclust retains. I will have no other choice and match that input with the 'values' contained in the clusters ... Joh
Gabor Grothendieck wrote:
If we don't need any plotting we don't really need rect.hclust at all. Split the output of cutree, instead. Continuing from the prior code:
for(el in split(unname(vv), names(vv))) print(el)
[1] 0.00 0.45 [1] 1 [1] 2 [1] 3.00 3.25 3.33 3.75 4.10 [1] 5 [1] 6.00 6.45 [1] 7.0 7.1 [1] 8 On Dec 21, 2007 3:24 PM, Johannes Graumann <johannes_graumann at web.de> wrote:
Hm, hm, rect.hclust doesn't accept "plot=FALSE" and cutree doesn't retain the indexes of membership ... anyway short of ripping out the guts of rect.hclust to achieve the same result without an active graphics device? Joh
# cluster and plot hc <- hclust(dist(v), method = "single") plot(hc, lab = v) cl <- rect.hclust(hc, h = .5, border = "red") # each component of list cl is one cluster. Print them out. for(idx in cl) print(unname(v[idx]))
[1] 8 [1] 7.0 7.1 [1] 6.00 6.45 [1] 5 [1] 3.00 3.25 3.33 3.75 4.10 [1] 2 [1] 1 [1] 0.00 0.45
# a different representation of the clusters vv <- v names(vv) <- ct <- cutree(hc, h = .5) vv
1 1 2 3 4 4 4 4 4 5 6 6 7 7
8
0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00 7.10
8.00
On Dec 21, 2007 4:56 AM, Johannes Graumann <johannes_graumann at web.de>
wrote:
<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1 Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Here's what I finally came up with. Thanks for your help!
Joh
MQUSpotOverlapClusters <- function(
Series,# Vector of data to be evaluated
distance=0.5,# Maximum distance of clustered data points
minSize=2# Minimum size of clusters returned
){
############################################################################################
# Check prerequisites
#####################
# Check prerequisites: Series
if(!(is.numeric(Series) & length(Series) > 1)){
stop("'Series' must be a vector of numerical data.")
}
# Check prerequisites: distance
if(!(is.numeric(distance) & distance > 0)){
stop("'distance' must be a positive number.")
}
############################################################################################
# Perform clustering
####################
hc <- hclust(dist(Series), method = "single")
hcut <- cutree(hc,h=distance)
cluster.idx <- c()
for(i in unique(hcut)){
members <- which(hcut == i)
if(length(members) >= minSize){
cluster.idx <- append(cluster.idx,list(members))
}
}
return(cluster.idx)
}
Gabor Grothendieck wrote:
If we don't need any plotting we don't really need rect.hclust at all. Split the output of cutree, instead. Continuing from the prior code:
for(el in split(unname(vv), names(vv))) print(el)
[1] 0.00 0.45 [1] 1 [1] 2 [1] 3.00 3.25 3.33 3.75 4.10 [1] 5 [1] 6.00 6.45 [1] 7.0 7.1 [1] 8 On Dec 21, 2007 3:24 PM, Johannes Graumann <johannes_graumann at web.de> wrote:
Hm, hm, rect.hclust doesn't accept "plot=FALSE" and cutree doesn't retain the indexes of membership ... anyway short of ripping out the guts of rect.hclust to achieve the same result without an active graphics device? Joh
# cluster and plot hc <- hclust(dist(v), method = "single") plot(hc, lab = v) cl <- rect.hclust(hc, h = .5, border = "red") # each component of list cl is one cluster. Print them out. for(idx in cl) print(unname(v[idx]))
[1] 8 [1] 7.0 7.1 [1] 6.00 6.45 [1] 5 [1] 3.00 3.25 3.33 3.75 4.10 [1] 2 [1] 1 [1] 0.00 0.45
# a different representation of the clusters vv <- v names(vv) <- ct <- cutree(hc, h = .5) vv
1 1 2 3 4 4 4 4 4 5 6 6 7 7
8
0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00 7.10
8.00
On Dec 21, 2007 4:56 AM, Johannes Graumann <johannes_graumann at web.de>
wrote:
<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1 Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Johannes Graumann <johannes_graumann at web.de> wrote in news:fkinut$re4$1 at ger.gmane.org:
But cutree does away with the indexes from the original input, which rect.hclust retains. I will have no other choice and match that input with the 'values' contained in the clusters ...
If you want to retain the original rownames, then try:
vector
[1] 0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00 7.10 8.00 #-----start cut-and-pastable----- #this will "label" individual group membership #diff(.) returns a vector that is smaller by one than its input #so it needs to be augmented with c(1,fn(diff((.)) grp.v<-cbind(vector,(c(1,1+cumsum(as.numeric(diff(vector)>0.5))))) #You can then tally up the counts in groups tb<-table(grp.v[,2]) tb #1 2 3 4 5 6 7 8 #2 1 1 5 1 2 2 1 # And apply the counts to the rows by doing a # "row count" lookup into tb[.] grp.v<-cbind(grp.v,tb[grp.v[,2]]) grp.v -----end cut and pastable------ vector 1 0.00 1 2 1 0.45 1 2 2 1.00 2 1 3 2.00 3 1 4 3.00 4 5 4 3.25 4 5 4 3.33 4 5 4 3.75 4 5 4 4.10 4 5 5 5.00 5 1 6 6.00 6 2 6 6.45 6 2 7 7.00 7 2 7 7.10 7 2 8 8.00 8 1 Further processing of the membership "label" might better be accomplished by converting the matrix to a dataframe, and then working with the membership "label" as a factor. If you only want to deal with the rownames and values of vector that have more than <x> values, that should be straightforward.
David Winsemius > Gabor Grothendieck wrote: > >> If we don't need any plotting we don't really need rect.hclust at >> all. Split the output of cutree, instead. Continuing from the >> prior code: >> >>> for(el in split(unname(vv), names(vv))) print(el) >> [1] 0.00 0.45 >> [1] 1 >> [1] 2 >> [1] 3.00 3.25 3.33 3.75 4.10 >> [1] 5 >> [1] 6.00 6.45 >> [1] 7.0 7.1 >> [1] 8 >> >> On Dec 21, 2007 3:24 PM, Johannes Graumann <johannes_graumann at web.de> >> wrote: >>> Hm, hm, rect.hclust doesn't accept "plot=FALSE" and cutree doesn't >>> retain the indexes of membership ... anyway short of ripping out the >>> guts of rect.hclust to achieve the same result without an active >>> graphics device? >>> >>> Joh >>> >>> >>> >> # cluster and plot >>> >> hc <- hclust(dist(v), method = "single") >>> >> plot(hc, lab = v) >>> >> cl <- rect.hclust(hc, h = .5, border = "red") >>> >> >>> >> # each component of list cl is one cluster. Print them out. >>> >> for(idx in cl) print(unname(v[idx])) >>> > [1] 8 >>> > [1] 7.0 7.1 >>> > [1] 6.00 6.45 >>> > [1] 5 >>> > [1] 3.00 3.25 3.33 3.75 4.10 >>> > [1] 2 >>> > [1] 1 >>> > [1] 0.00 0.45 >>> > >>> >> # a different representation of the clusters >>> >> vv <- v >>> >> names(vv) <- ct <- cutree(hc, h = .5) >>> >> vv >>> > 1 1 2 3 4 4 4 4 4 5 6 6 7 >>> > 7 >>> > 8 >>> > 0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00 >>> > 7.10 8.00 >>> > >>> > >>> > On Dec 21, 2007 4:56 AM, Johannes Graumann >>> > <johannes_graumann at web.de> wrote: >>> >> <posted & mailed> >>> >> >>> >> Dear all, >>> >> >>> >> I'm trying to solve the problem, of how to find clusters of >>> >> values in a vector that are closer than a given value. >>> >> Illustrated this might look as follows: >>> >> >>> >> vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) >>> >> >>> >> When using '0.5' as the proximity requirement, the following >>> >> groups would result: >>> >> 0,0.45 >>> >> 3,3.25,3.33,3.75,4.1 >>> >> 6,6.45 >>> >> 7,7.1 >>> >> >>> >> Jim Holtman proposed a very elegant solution in >>> >> http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which >>> >> I have modified and perused since he wrote it to me. The beauty >>> >> of this approach is that it will not only work for constant >>> >> proximity requirements as above, but also for overlap-windows >>> >> defined in terms of ppm around each value. Now I have an >>> >> additional need and have found no way (short of iteratively step >>> >> through all the groups returned) to figure out how to do that >>> >> with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are >>> >> separate clusters? >>> >> >>> >> Thanks for any hints, Joh >>> >>
If you want indexes, i.e. 1, 2, 3, ... instead of the values in v you can still use split -- just split on seq_along(v) instead of v (or if v had names you might want to split along names(v)): split(seq_along(v), ct) and if you only want to retain groups with 2+ elements then you can just Filter then out: twoplus <- function(x) length(x) >= 2 Filter(twoplus, split(seq_along(v), ct))
On Dec 22, 2007 5:12 AM, Johannes Graumann <johannes_graumann at web.de> wrote:
But cutree does away with the indexes from the original input, which rect.hclust retains. I will have no other choice and match that input with the 'values' contained in the clusters ... Joh Gabor Grothendieck wrote:
If we don't need any plotting we don't really need rect.hclust at all. Split the output of cutree, instead. Continuing from the prior code:
for(el in split(unname(vv), names(vv))) print(el)
[1] 0.00 0.45 [1] 1 [1] 2 [1] 3.00 3.25 3.33 3.75 4.10 [1] 5 [1] 6.00 6.45 [1] 7.0 7.1 [1] 8 On Dec 21, 2007 3:24 PM, Johannes Graumann <johannes_graumann at web.de> wrote:
Hm, hm, rect.hclust doesn't accept "plot=FALSE" and cutree doesn't retain the indexes of membership ... anyway short of ripping out the guts of rect.hclust to achieve the same result without an active graphics device? Joh
# cluster and plot hc <- hclust(dist(v), method = "single") plot(hc, lab = v) cl <- rect.hclust(hc, h = .5, border = "red") # each component of list cl is one cluster. Print them out. for(idx in cl) print(unname(v[idx]))
[1] 8 [1] 7.0 7.1 [1] 6.00 6.45 [1] 5 [1] 3.00 3.25 3.33 3.75 4.10 [1] 2 [1] 1 [1] 0.00 0.45
# a different representation of the clusters vv <- v names(vv) <- ct <- cutree(hc, h = .5) vv
1 1 2 3 4 4 4 4 4 5 6 6 7 7
8
0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00 7.10
8.00
On Dec 21, 2007 4:56 AM, Johannes Graumann <johannes_graumann at web.de>
wrote:
<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1 Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Enlightening. Thanks. Joh
Gabor Grothendieck wrote:
If you want indexes, i.e. 1, 2, 3, ... instead of the values in v you can still use split -- just split on seq_along(v) instead of v (or if v had names you might want to split along names(v)): split(seq_along(v), ct) and if you only want to retain groups with 2+ elements then you can just Filter then out: twoplus <- function(x) length(x) >= 2 Filter(twoplus, split(seq_along(v), ct)) On Dec 22, 2007 5:12 AM, Johannes Graumann <johannes_graumann at web.de> wrote:
But cutree does away with the indexes from the original input, which rect.hclust retains. I will have no other choice and match that input with the 'values' contained in the clusters ... Joh Gabor Grothendieck wrote:
If we don't need any plotting we don't really need rect.hclust at all. Split the output of cutree, instead. Continuing from the prior code:
for(el in split(unname(vv), names(vv))) print(el)
[1] 0.00 0.45 [1] 1 [1] 2 [1] 3.00 3.25 3.33 3.75 4.10 [1] 5 [1] 6.00 6.45 [1] 7.0 7.1 [1] 8 On Dec 21, 2007 3:24 PM, Johannes Graumann <johannes_graumann at web.de> wrote:
Hm, hm, rect.hclust doesn't accept "plot=FALSE" and cutree doesn't retain the indexes of membership ... anyway short of ripping out the guts of rect.hclust to achieve the same result without an active graphics device? Joh
# cluster and plot hc <- hclust(dist(v), method = "single") plot(hc, lab = v) cl <- rect.hclust(hc, h = .5, border = "red") # each component of list cl is one cluster. Print them out. for(idx in cl) print(unname(v[idx]))
[1] 8 [1] 7.0 7.1 [1] 6.00 6.45 [1] 5 [1] 3.00 3.25 3.33 3.75 4.10 [1] 2 [1] 1 [1] 0.00 0.45
# a different representation of the clusters vv <- v names(vv) <- ct <- cutree(hc, h = .5) vv
1 1 2 3 4 4 4 4 4 5 6 6 7
7
8
0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00
7.10 8.00
On Dec 21, 2007 4:56 AM, Johannes Graumann
<johannes_graumann at web.de> wrote:
<posted & mailed> Dear all, I'm trying to solve the problem, of how to find clusters of values in a vector that are closer than a given value. Illustrated this might look as follows: vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8) When using '0.5' as the proximity requirement, the following groups would result: 0,0.45 3,3.25,3.33,3.75,4.1 6,6.45 7,7.1 Jim Holtman proposed a very elegant solution in http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which I have modified and perused since he wrote it to me. The beauty of this approach is that it will not only work for constant proximity requirements as above, but also for overlap-windows defined in terms of ppm around each value. Now I have an additional need and have found no way (short of iteratively step through all the groups returned) to figure out how to do that with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are separate clusters? Thanks for any hints, Joh
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.