grouping objects to achieve a similar mean property for all groups

I have a collection of objects, each of which has a numerical 'weight'. I would like to create groups of these objects such that each group has approximately the same arithmetic mean of object weights.

The groups won't necessarily have the same number of members, but the size of groups will be within one of each other. In terms of numbers, there will be between 50 and 100 objects and the maximum group size will be about 5.

Is this a well-known type of problem? It seems a bit like a knapsack or partition problem. Are efficient algorithms known to solve it?

As a first step, I created a python script that achieves very crude equivalence of mean weights by sorting the objects by weight, subgrouping these objects, and then distributing a member of each subgroup to one of the final groups.

I am comfortable programming in python, so if existing packages or modules exist to achieve part of this functionality, I'd appreciate hearing about them.

Thank you for your help and suggestions.

You might try using k-means clustering:

import scipy.cluster.vq as vq
import collections
import numpy as np

def auto_cluster(data,threshold=0.1,k=1):
    # There are more sophisticated ways of determining k
    # See http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
    data=np.asarray(data)
    distortion=1e20
    while distortion>threshold:
        codebook,distortion=vq.kmeans(data,k)
        k+=1   
    code,dist=vq.vq(data,codebook)    
    groups=collections.defaultdict(list)
    for index,datum in zip(code,data):
        groups[index].append(datum)
    return groups

np.random.seed(784789)
N=20
weights=100*np.random.random(N)
groups=auto_cluster(weights,threshold=1.5,k=N//5)
for index,data in enumerate(sorted(groups.values(),key=lambda d: np.mean(d))):
    print('{i}: {d}'.format(i=index,d=data))

The code above generates a random sequence of N weights. It uses scipy.cluster.vq.kmeans to partition the sequence into k clusters of numbers which are close together. If the distortion is above a threshold, the kmeans is recomputed with k increased. This repeats until the distortion is below the given threshold.

It yields clusters such as this:

0: [4.9062151907551366]
1: [13.545565038022112, 12.283828883935065]
2: [17.395300245930066]
3: [28.982058040201832, 30.032607500871023, 31.484125759701588]
4: [35.449637591061979]
5: [43.239840915978043, 48.079844689518424, 40.216494950261506]
6: [52.123246083619755, 53.895726546070463]
7: [80.556052179748079, 80.925071671718413, 75.211470587171803]
8: [86.443868931310249, 82.474064251040375, 84.088655128258964]
9: [93.525705849369416]

Note that the k-means clustering algorithm uses random guesses to initially pick centers of the k groups. This means that repeated runs of the same code can produce different results, particularly if the weights do not separate themselves into clearly distinct groups.

You'll also have to twiddle the threshold parameter to produce the desired number of groups.

The program that follows is a low-cost heuristic. What it does is distribute the values among "buckets" placing large values along with small ones by choosing the values from one end of a sorted list on one round, and from the other end on the next. Doing the distribution in round-robin guarantees that the rules about the number of elements per bucket are met. It is a heuristic and not an algorithm because it tends produce good solutions, but without guarantee that better ones don't exist.

In theory, if there are enough values, and they are uniformly or normally distributed, then chances are that just randomly placing the values in the buckets will result in alike means for the buckets. Assuming that that the dataset is small, this heuristic improves the chances of a good solution. Knowing more about the size and statistical distribution of the datasets would help devise a better heuristic, or an algorithm.

from random import randint, seed
from itertools import cycle,chain

def chunks(q, n):
    q = list(q)
    for i in range(0, len(q), n):
       yield q[i:i+n]

def shuffle(q, n):
    q = list(q)
    m = len(q)//2
    left =  list(chunks(q[:m],n))
    right = list(chunks(reversed(q[m:]),n)) + [[]]
    return chain(*(a+b for a,b in zip(left, right)))

def listarray(n):
    return [list() for _ in range(n)]

def mean(q):
    return sum(q)/len(q)

def report(q):
    for x in q:
        print mean(x), len(x), x

SIZE = 5
COUNT= 37

#seed(SIZE)
data = [randint(1,1000) for _ in range(COUNT)]
data = sorted(data)
NBUCKETS = (COUNT+SIZE-1) // SIZE

order = shuffle(range(COUNT), NBUCKETS)
posts = cycle(range(NBUCKETS))
buckets = listarray(NBUCKETS)
for o in order:
    i = posts.next()
    buckets[i].append(data[o])
report(buckets)
print mean(data)

Complexity is logarithmic because of the sorting step. These are sample results:

439 5 [15, 988, 238, 624, 332]
447 5 [58, 961, 269, 616, 335]
467 5 [60, 894, 276, 613, 495]
442 5 [83, 857, 278, 570, 425]
422 5 [95, 821, 287, 560, 347]
442 4 [133, 802, 294, 542]
440 4 [170, 766, 301, 524]
418 4 [184, 652, 326, 512]
440

Note that the requirement on the size of the buckets dominates, which means that the means won't be close if the variance in the original data is large. You can try with this dataset:

data = sorted(data) + [100000]

The bucket containing 100000 will get at least another 3 datums.

I came up with this heuristic thinking that it's what a group of kids would do if handed a pack of bills of different denominations and told to share them according to this game's rules. It's statistically reasonable, and O(log(N)).

You could also try a centroid-based linkage algorithm, which achieves the same.

See this for the code and this for understanding.

UPGMA (aka centroid-based) is what you probably want to do.

来源：https://stackoverflow.com/questions/4462531/grouping-objects-to-achieve-a-similar-mean-property-for-all-groups

标签

python

grouping

combinations

mean