Selecting non-overlapping best quality clusters

问题

Say, I have done clustering on my dataset and have 10 clusters. These clusters are non-overlapping. But now assume I changed some feature in all my data points and do clustering again. Now I have 10 more clusters. If I repeat it say 3 more times, at the end I would have 50 clusters. Each cluster has a score associated with it that is calculated from its constituents data points.

These 50 clusters now have overlapping data points. I want to select all possible non-overlapping clusters out of these 50 clusters but with the highest total score.

One way is a greedy method where I sort the clusters based on the score from highest to smallest. Then select highest scoring cluster. Then from there keep selecting clusters that have non-overlapping data points with already selected clusters. But it doesn't seem to be optimal solution although it is fast.

Example: say I have 5 clusters with following scores:

C1 = (A,B,C,D,E,F) Score = 10

C2 = (A,B,C) Score = 6

C3 = (D,E,F) Score = 6

C4 = (G,H,I,J) Score = 5

C5 = (K,L) Score = 7

The greedy approach will return {C1, C4, C5} with a total score of 10+5+7=22, whereas better option is {C2, C3, C4, C5} with a total score of 6+6+5+7=24.

I am looking for another method that can give an optimal solution or better solution than above mentioned greedy approach.

回答1:

You can solve this using operations research techniques.

Model this problem like a set-partitioning problem with

Objective function: maximize score
Constraints: each data point is covered exactly once

and then solve it using a MIP solver or any other technique (such as Hill climber, Genetic algorithm etc). The scale of your problem is very small, hence solvable by any optimization algorithm. I am also working on a similar problem but in airline crew scheduling domain. The scale of my problem is so huge that the possible crew schedules (equivalent to your clusters) are >zillion combinations for a flight schedule of ~4500 flights (equivalent to your data points) ;)

I have coded your example in python and I have used a MIP solver from Gurobi, available free of cost for academic use. You can use other MIP solvers too.

Here is the python code:

from gurobipy import *
import string

data_points = string.ascii_uppercase[:12]

clusters = []
clusters.append(string.ascii_uppercase[:6])
clusters.append(string.ascii_uppercase[:3])
clusters.append(string.ascii_uppercase[3:6])
clusters.append(string.ascii_uppercase[6:10])
clusters.append(string.ascii_uppercase[10:12])

matrix = {}
for dp in string.ascii_uppercase[:12]:
    matrix[dp] = [0]*5

for i in range(0, len(clusters)):
    for dp in clusters[i]:
        matrix[dp][i] = 1

cost = [10, 6, 6, 5, 7]

# Gurobi MIP model
m = Model("Jitin's cluster optimization problem")
m.params.outputflag = 1
x = m.addVars(len(clusters), vtype=GRB.INTEGER, name='x')
indices = range(0, len(clusters))
coef_x = dict()
obj = 0.0
for i in indices:
    coef_x[i] = cost[i]
    obj += coef_x[i] * x[i]
m.setObjective(obj, GRB.MAXIMIZE)
flight_in_pairings = [[] for i in range(0, 4228)]
for dp,j in zip(data_points, range(0, len(data_points))):
    m.addConstr(sum([x[i]*matrix[dp][i] for i in range(0, len(matrix[dp]))]) == 1, "C"+str(j))
m.optimize()
print('Final Obj:', m.objVal)
m.write('results.sol')

The output of the code:

# Solution for model Jitin's cluster optimization problem
# Objective value = 24
x[0] 0
x[1] 1
x[2] 1
x[3] 1
x[4] 1

来源：https://stackoverflow.com/questions/50263975/selecting-non-overlapping-best-quality-clusters

标签

cluster-analysis

greedy