Clustering by distance in R

问题

I have a vector of integers which I wish to divide into clusters so that the distance between any two clusters is greater than a lower bound, and within any cluster, the distance between two elements is less than an upper bound.

For example, suppose we have the following vector:

1, 4, 5, 6, 9, 29, 32, 36

And set the aforementioned lower bound and upper bound to 19 and 9 respectively, the two vectors below should be a possible result:

1, 4, 5, 6, 9

29, 32, 36

Thanks to @flodel 's comments, I realized this kind of clustering may be impossible. So I would like to modify the questions a bit:

What are the possible clustering methods if I impose only the between cluster distance lower bound? What are the possible clustering methods if I impose only the within cluster distance upper bound?

回答1:

What are the possible clustering methods if I impose only the between cluster distance lower bound?

Hierarchical clustering with single linkage:

x <- c(1, 4, 5, 6, 9, 29, 32, 46, 55)
tree <- hclust(dist(x), method = "single")
split(x, cutree(tree, h = 19))

# $`1`
# [1] 1 4 5 6 9
# 
# $`2`
# [1] 29 32 46 55

What are the possible clustering methods if I impose only the within cluster distance upper bound?

Hierarchical clustering with complete linkage:

x <- c(1, 4, 5, 6, 9, 20, 26, 29, 32)
tree <- hclust(dist(x), method = "complete")
split(x, cutree(tree, h = 9))

# $`1`
# [1] 1 4 5 6 9
# 
# $`2`
# [1] 20
# 
# $`3`
# [1] 26 29 32

回答2:

Here's a simple algorithm that will work, explained conceptually (implementation details omitted):

Ensure your list is sorted.
Place a "marker" between every pair of consecutive elements that are more than lower_bound apart. These mark all the possible cluster boundaries.
Include a marker before the beginning of the list and after the end.
Go through pairs of markers in order, and for each pair left_marker and right_marker, check if the distance between the element immediately to the right of the left_marker and the element immediately to the left of the right_marker is less than upper_bound apart.
If the previous step ever returns false, the clustering is impossible.
Otherwise, the markers form the boundaries of the desired clusterings.

Applying this to your example, we get:

Sorted: 1, 4, 5, 6, 9, 26, 29, 32
Markers: 1, 4, 5, 6, 9 | 26, 29, 32
Additional start/end markers: | 1, 4, 5, 6, 9 | 26, 29, 32 |
Check "upper bound" constraint: (9-1) = 8 < 9: TRUE; (32 - 26) = 6 < 9: TRUE
None of the comparisons returned false
Desired clustering: (1, 4, 5, 6, 9), (26, 29, 32)

EDIT: Original poster relaxed the conditions of the problem.

If you only want to satisfy the lower bound condition:

Ensure your list is sorted.
Place a marker between every pair of consecutive elements that are more than lower_bound apart.
Include a marker before the beginning and after the end.
These markers form the boundaries of the desired clustering.

The following gets you step 2 assuming your vector is already sorted:

# Given
vec <- c(1, 4, 5, 6, 9, 29, 32, 26)
lower_bound <- 19

f <- function(x) {
  return(vec[x+1] - vec[x] > lower_bound);
}
indices <- seq(length(vec)-1)
marker_positions <- Position(f, indices)

来源：https://stackoverflow.com/questions/17228737/clustering-by-distance-in-r

标签

cluster-analysis