How to determine if two partitions (clusterings) of data points are identical?

后端 未结 3 1769
误落风尘
误落风尘 2020-12-11 19:27

I have n data points in some arbitrary space and I cluster them.
The result of my clustering algorithm is a partition represented by an int vector l

3条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-11 20:05

    If you are going to relabel your partitions, as has been previously suggested, you will potentially need to search through n labels for each of the n items. I.e. the solutions are O(n^2).

    Here is my idea: Scan through both lists simultaneously, maintaining a counter for each partition label in each list. You will need to be able to map partition labels to counter numbers. If the counters for each list do not match, then the partitions do not match. This would be O(n).

    Here is a proof of concept in Python:

    l_1 = [ 1, 1, 1, 0, 0, 2, 6 ]
    
    l_2 = [ 2, 2, 2, 9, 9, 3, 1 ]
    
    l_3 = [ 2, 2, 2, 9, 9, 3, 3 ]
    
    d1 = dict()
    d2 = dict()
    c1 = []
    c2 = []
    
    # assume lists same length
    
    match = True
    for i in range(len(l_1)):
        if l_1[i] not in d1:
            x1 = len(c1)
            d1[l_1[i]] = x1
            c1.append(1)
        else:
            x1 = d1[l_1[i]]
            c1[x1] += 1
    
        if l_2[i] not in d2:
            x2 = len(c2)
            d2[l_2[i]] = x2
            c2.append(1)
        else:
            x2 = d2[l_2[i]]
            c2[x2] += 1
    
        if x1 != x2 or  c1[x1] != c2[x2]:
            match = False
    
    print "match = {}".format(match)
    

提交回复
热议问题