Number clustering/partitioning algorithm

怎甘沉沦 提交于 2019-12-07 18:43:55

问题


I have an ordered 1-D array of numbers. Both the array length and the values of the numbers in the array are arbitrary. I want to partition the array into k partitions, according to the number values, e.g. let's say I want 4 partitions, distributed as 30% / 30% / 20% / 20%, i.e. the top 30% values first, the next 30% afterwards, etc. I get to choose k and the percentages of the distribution. In addition, if the same number appears more than once in the array, it should not be contained in two different partitions. This means that the distribution percentages above are not strict, but rather the "goals" or "starting points" if you wish.

For example, let's say my array is ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8].

I choose k = 4 and the numbers should be distributed into partitions A, B, C and D with percentages pA = pB = pC = pD = 25%.

Given the constraints I gave above, the resulting partitions should be:

A = [1] B = [5, 5] C = [6, 7] D = [8, 8, 8, 8, 8]

with resulting (achieved/corrected) percentages pcA = 10%, pcB = 20%, pcC = 20%, pcD = 50%

It seems to me that I need a modified k-means algorithm, because the standard algorithm is not guaranteed to respect my percentages and/or the requirement that the same value cannot be in more than one cluster/partition.

So, is there an algorithm for this kind of clustering?


回答1:


Clustering algorithms are used on multi-dimensional data. For one-dimensional data, you should simply use a sorting algorithm.

Sort the data. Then partition the data-set linearly working from the bottom of the array to the top, as per your example.




回答2:


Here's a dynamic programming solution that finds a partition that minimizes the sum of squares of the errors in the sizes of the parts. So in your example of [1, 5, 5, 6, 7, 8, 8, 8, 8, 8], you want parts of size (2.5, 2.5, 2.5, 2.5) and the result given by this code is (9.0, (1, 2, 2, 5)). That means the partitions chosen were of size 1, 2, 2 and 5, and the total error is 9 = (2.5-1)^2 + (2.5-2)^2 + (2.5-2)^2 + (2.5-5)^2.

def partitions(a, i, sizes, cache):
    """Find a least-cost partition of a[i:].

    The ideal sizes of the partitions are stored in the tuple 'sizes'
    and cache is used to memoize previously calculated results.
    """
    key = (i, sizes)
    if key in cache: return cache[key]
    if len(sizes) == 1:
        segment = len(a) - i
        result = (segment - sizes[0]) ** 2, (segment,)
        cache[key] = result
        return result
    best_cost, best_partition = None, None
    for j in xrange(len(a) - i + 1):
        if 0 < j < len(a) - i and a[i + j - 1] == a[i + j]:
            # Avoid breaking a run of one number.
            continue
        bc, bp = partitions(a, i + j, sizes[1:], cache)
        c = (j - sizes[0]) ** 2 + bc
        if best_cost is None or c < best_cost:
            best_cost = c
            best_partition = (j,) + bp
    cache[key] = (best_cost, best_partition)
    return cache[key]


ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8]
sizes = (len(ar) * 0.25,) * 4
print partitions(ar, 0, (2.5, 2.5, 2.5, 2.5), {})



回答3:


The naive approach would go like this:

Say p1...pk are the percentages for your partitions (p1+...+pk = 1)

Say you have N elements in the array

The initial boundaries (there's k+1 of them, including the array ends, since you have k partitions)are: 0, p1*N, (p1+p2)*N, ..., N (there'll be some rounding to do).

For moving the boundaries, you look at the two array elements on each side of a boundary (for the k-1 boundaries that you can move). If the two elements are equal, you need to move to boundary, either left of right, at least until the constraint is satisfied. A naive approach would be to start on the left and do minimal adjustments (just adjust the constraint to the side that causes the least movement, and don't move the boundary any further).

This algorithm doesn't cover the whole space of partitions though. It just gives you one solution. To find the best solution, you'd need to do a brute-force search on the entire partition space, with some kind of pruning (e.g. dynamic programming, where you remember the best partitioning for a subarray of the initial array).



来源:https://stackoverflow.com/questions/8140036/number-clustering-partitioning-algorithm

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!