问题
Say, I have done clustering on my dataset and have 10 clusters. These clusters are non-overlapping. But now assume I changed some feature in all my data points and do clustering again. Now I have 10 more clusters. If I repeat it say 3 more times, at the end I would have 50 clusters. Each cluster has a score associated with it that is calculated from its constituents data points.
These 50 clusters now have overlapping data points. I want to select all possible non-overlapping clusters out of these 50 clusters but with the highest total score.
One way is a greedy method where I sort the clusters based on the score from highest to smallest. Then select highest scoring cluster. Then from there keep selecting clusters that have non-overlapping data points with already selected clusters. But it doesn't seem to be optimal solution although it is fast.
Example: say I have 5 clusters with following scores:
C1 = (A,B,C,D,E,F) Score = 10
C2 = (A,B,C) Score = 6
C3 = (D,E,F) Score = 6
C4 = (G,H,I,J) Score = 5
C5 = (K,L) Score = 7
The greedy approach will return {C1, C4, C5} with a total score of 10+5+7=22, whereas better option is {C2, C3, C4, C5} with a total score of 6+6+5+7=24.
I am looking for another method that can give an optimal solution or better solution than above mentioned greedy approach.
回答1:
You can solve this using operations research techniques.
Model this problem like a set-partitioning problem with
Objective function: maximize score
Constraints: each data point is covered exactly once
and then solve it using a MIP solver or any other technique (such as Hill climber, Genetic algorithm etc). The scale of your problem is very small, hence solvable by any optimization algorithm. I am also working on a similar problem but in airline crew scheduling domain. The scale of my problem is so huge that the possible crew schedules (equivalent to your clusters) are >zillion combinations for a flight schedule of ~4500 flights (equivalent to your data points) ;)
I have coded your example in python and I have used a MIP solver from Gurobi, available free of cost for academic use. You can use other MIP solvers too.
Here is the python code:
from gurobipy import *
import string
data_points = string.ascii_uppercase[:12]
clusters = []
clusters.append(string.ascii_uppercase[:6])
clusters.append(string.ascii_uppercase[:3])
clusters.append(string.ascii_uppercase[3:6])
clusters.append(string.ascii_uppercase[6:10])
clusters.append(string.ascii_uppercase[10:12])
matrix = {}
for dp in string.ascii_uppercase[:12]:
matrix[dp] = [0]*5
for i in range(0, len(clusters)):
for dp in clusters[i]:
matrix[dp][i] = 1
cost = [10, 6, 6, 5, 7]
# Gurobi MIP model
m = Model("Jitin's cluster optimization problem")
m.params.outputflag = 1
x = m.addVars(len(clusters), vtype=GRB.INTEGER, name='x')
indices = range(0, len(clusters))
coef_x = dict()
obj = 0.0
for i in indices:
coef_x[i] = cost[i]
obj += coef_x[i] * x[i]
m.setObjective(obj, GRB.MAXIMIZE)
flight_in_pairings = [[] for i in range(0, 4228)]
for dp,j in zip(data_points, range(0, len(data_points))):
m.addConstr(sum([x[i]*matrix[dp][i] for i in range(0, len(matrix[dp]))]) == 1, "C"+str(j))
m.optimize()
print('Final Obj:', m.objVal)
m.write('results.sol')
The output of the code:
# Solution for model Jitin's cluster optimization problem
# Objective value = 24
x[0] 0
x[1] 1
x[2] 1
x[3] 1
x[4] 1
来源:https://stackoverflow.com/questions/50263975/selecting-non-overlapping-best-quality-clusters