How to determine if two partitions (clusterings) of data points are identical?

后端未结
关注
 3  1773
误落风尘 2020-12-11 19:27
I have n data points in some arbitrary space and I cluster them.
The result of my clustering algorithm is a partition represented by an int vector l

      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   半阙折子戏
                                             
                
                
                (楼主)
            
              
              
                2020-12-11 20:06
              

            
            
                        
When are two partition identical?

Probably if they have the exact same members.

So if you just want to test for identity, you can do the following:

Substitute each partition ID with the smallest object ID in the partition.

Then two partitionings are identical if and only if this representation is identical.

In your example above, lets assume the vector index 1 .. 7 is your object ID. Then I would get the canonical form

[ 1 1 1 4 4 6 7 ]
  ^ first occurrence at pos 1 of 1 in l_1 / 2 in l_2
        ^ first occurrence at pos 4


for  l_1 and l_2, whereas l_3 canonicalizes to

[ 1 1 1 4 4 6 6 ]


To make it more clear, here is another example:

l_4 = [ A B 0 D 0 B A ]


canonicalizes to

      [ 1 2 3 4 3 2 1 ]


since the first occurence of cluster "A" is at position 1, "B" at position 2 etc.

If you want to measure how similar two clusterings are, a good approach is to look at precision/recall/f1 of the object pairs, where the pair (a,b) exists if and only if a and b belong to the same cluster.

Update: Since it was claimed that this is quadratic, I will further clarify.

To produce the canonical form, use the following approach (actual python code):

def canonical_form(li):
  """ Note, this implementation overwrites li """
  first = dict()
  for i in range(len(li)):
    v = first.get(li[i])
    if v is None:
      first[li[i]] = i
      v = i
    li[i] = v
  return li

print canonical_form([ 1, 1, 1, 0, 0, 2, 6 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 1 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 3 ])
# [0, 0, 0, 3, 3, 5, 5]
print canonical_form(['A','B',0,'D',0,'B','A'])
# [0, 1, 2, 3, 2, 1, 0]
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,1])
# True
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,3])
# False

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复