grouping objects to achieve a similar mean property for all groups

随声附和 提交于 2019-12-03 03:41:38

You might try using k-means clustering:

import scipy.cluster.vq as vq
import collections
import numpy as np

def auto_cluster(data,threshold=0.1,k=1):
    # There are more sophisticated ways of determining k
    # See http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
    data=np.asarray(data)
    distortion=1e20
    while distortion>threshold:
        codebook,distortion=vq.kmeans(data,k)
        k+=1   
    code,dist=vq.vq(data,codebook)    
    groups=collections.defaultdict(list)
    for index,datum in zip(code,data):
        groups[index].append(datum)
    return groups

np.random.seed(784789)
N=20
weights=100*np.random.random(N)
groups=auto_cluster(weights,threshold=1.5,k=N//5)
for index,data in enumerate(sorted(groups.values(),key=lambda d: np.mean(d))):
    print('{i}: {d}'.format(i=index,d=data))

The code above generates a random sequence of N weights. It uses scipy.cluster.vq.kmeans to partition the sequence into k clusters of numbers which are close together. If the distortion is above a threshold, the kmeans is recomputed with k increased. This repeats until the distortion is below the given threshold.

It yields clusters such as this:

0: [4.9062151907551366]
1: [13.545565038022112, 12.283828883935065]
2: [17.395300245930066]
3: [28.982058040201832, 30.032607500871023, 31.484125759701588]
4: [35.449637591061979]
5: [43.239840915978043, 48.079844689518424, 40.216494950261506]
6: [52.123246083619755, 53.895726546070463]
7: [80.556052179748079, 80.925071671718413, 75.211470587171803]
8: [86.443868931310249, 82.474064251040375, 84.088655128258964]
9: [93.525705849369416]

Note that the k-means clustering algorithm uses random guesses to initially pick centers of the k groups. This means that repeated runs of the same code can produce different results, particularly if the weights do not separate themselves into clearly distinct groups.

You'll also have to twiddle the threshold parameter to produce the desired number of groups.

The program that follows is a low-cost heuristic. What it does is distribute the values among "buckets" placing large values along with small ones by choosing the values from one end of a sorted list on one round, and from the other end on the next. Doing the distribution in round-robin guarantees that the rules about the number of elements per bucket are met. It is a heuristic and not an algorithm because it tends produce good solutions, but without guarantee that better ones don't exist.

In theory, if there are enough values, and they are uniformly or normally distributed, then chances are that just randomly placing the values in the buckets will result in alike means for the buckets. Assuming that that the dataset is small, this heuristic improves the chances of a good solution. Knowing more about the size and statistical distribution of the datasets would help devise a better heuristic, or an algorithm.

from random import randint, seed
from itertools import cycle,chain

def chunks(q, n):
    q = list(q)
    for i in range(0, len(q), n):
       yield q[i:i+n]

def shuffle(q, n):
    q = list(q)
    m = len(q)//2
    left =  list(chunks(q[:m],n))
    right = list(chunks(reversed(q[m:]),n)) + [[]]
    return chain(*(a+b for a,b in zip(left, right)))

def listarray(n):
    return [list() for _ in range(n)]

def mean(q):
    return sum(q)/len(q)

def report(q):
    for x in q:
        print mean(x), len(x), x

SIZE = 5
COUNT= 37

#seed(SIZE)
data = [randint(1,1000) for _ in range(COUNT)]
data = sorted(data)
NBUCKETS = (COUNT+SIZE-1) // SIZE

order = shuffle(range(COUNT), NBUCKETS)
posts = cycle(range(NBUCKETS))
buckets = listarray(NBUCKETS)
for o in order:
    i = posts.next()
    buckets[i].append(data[o])
report(buckets)
print mean(data)

Complexity is logarithmic because of the sorting step. These are sample results:

439 5 [15, 988, 238, 624, 332]
447 5 [58, 961, 269, 616, 335]
467 5 [60, 894, 276, 613, 495]
442 5 [83, 857, 278, 570, 425]
422 5 [95, 821, 287, 560, 347]
442 4 [133, 802, 294, 542]
440 4 [170, 766, 301, 524]
418 4 [184, 652, 326, 512]
440

Note that the requirement on the size of the buckets dominates, which means that the means won't be close if the variance in the original data is large. You can try with this dataset:

data = sorted(data) + [100000]

The bucket containing 100000 will get at least another 3 datums.

I came up with this heuristic thinking that it's what a group of kids would do if handed a pack of bills of different denominations and told to share them according to this game's rules. It's statistically reasonable, and O(log(N)).

You could also try a centroid-based linkage algorithm, which achieves the same.

See this for the code and this for understanding.

UPGMA (aka centroid-based) is what you probably want to do.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!