问题
I have a vector of integers which I wish to divide into clusters so that the distance between any two clusters is greater than a lower bound, and within any cluster, the distance between two elements is less than an upper bound.
For example, suppose we have the following vector:
1, 4, 5, 6, 9, 29, 32, 36
And set the aforementioned lower bound and upper bound to 19 and 9 respectively, the two vectors below should be a possible result:
1, 4, 5, 6, 9
29, 32, 36
Thanks to @flodel 's comments, I realized this kind of clustering may be impossible. So I would like to modify the questions a bit:
What are the possible clustering methods if I impose only the between cluster distance lower bound? What are the possible clustering methods if I impose only the within cluster distance upper bound?
回答1:
What are the possible clustering methods if I impose only the between cluster distance lower bound?
Hierarchical clustering with single linkage:
x <- c(1, 4, 5, 6, 9, 29, 32, 46, 55)
tree <- hclust(dist(x), method = "single")
split(x, cutree(tree, h = 19))
# $`1`
# [1] 1 4 5 6 9
#
# $`2`
# [1] 29 32 46 55
What are the possible clustering methods if I impose only the within cluster distance upper bound?
Hierarchical clustering with complete linkage:
x <- c(1, 4, 5, 6, 9, 20, 26, 29, 32)
tree <- hclust(dist(x), method = "complete")
split(x, cutree(tree, h = 9))
# $`1`
# [1] 1 4 5 6 9
#
# $`2`
# [1] 20
#
# $`3`
# [1] 26 29 32
回答2:
Here's a simple algorithm that will work, explained conceptually (implementation details omitted):
- Ensure your list is sorted.
- Place a "marker" between every pair of consecutive elements that are more than
lower_boundapart. These mark all the possible cluster boundaries. - Include a marker before the beginning of the list and after the end.
- Go through pairs of markers in order, and for each pair
left_markerandright_marker, check if the distance between the element immediately to the right of theleft_markerand the element immediately to the left of theright_markeris less thanupper_boundapart. - If the previous step ever returns false, the clustering is impossible.
- Otherwise, the markers form the boundaries of the desired clusterings.
Applying this to your example, we get:
- Sorted: 1, 4, 5, 6, 9, 26, 29, 32
- Markers: 1, 4, 5, 6, 9 | 26, 29, 32
- Additional start/end markers: | 1, 4, 5, 6, 9 | 26, 29, 32 |
- Check "upper bound" constraint: (9-1) = 8 < 9: TRUE; (32 - 26) = 6 < 9: TRUE
- None of the comparisons returned false
- Desired clustering: (1, 4, 5, 6, 9), (26, 29, 32)
EDIT: Original poster relaxed the conditions of the problem.
If you only want to satisfy the lower bound condition:
- Ensure your list is sorted.
- Place a marker between every pair of consecutive elements that are more than
lower_boundapart. - Include a marker before the beginning and after the end.
- These markers form the boundaries of the desired clustering.
The following gets you step 2 assuming your vector is already sorted:
# Given
vec <- c(1, 4, 5, 6, 9, 29, 32, 26)
lower_bound <- 19
f <- function(x) {
return(vec[x+1] - vec[x] > lower_bound);
}
indices <- seq(length(vec)-1)
marker_positions <- Position(f, indices)
来源:https://stackoverflow.com/questions/17228737/clustering-by-distance-in-r