Sorting sub-lists into new sub-lists based on common first items

前端未结

关注

 4  1823

面向向阳花 2020-12-10 19:49

I have a large number of two-membered sub-lists that are members of a list called mylist:

mylist = [[\'AB001\', 22100],
          [\'AB001\', 32


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   暖寄归人
                                             
                
                
                (楼主)
            
              
              
                2020-12-10 20:36
              

            
            
                        
There are a number of alternatives to solve this problem:
def regroup_by_di(items, key=None):
    result = {}
    callable_key = callable(key)
    for item in items:
        key_value = key(item) if callable_key else item
        if key_value not in result:
            result[key_value] = []
        result[key_value].append(item)
    return result

import collections


def regroup_by_dd(items, key=None):
    result = collections.defaultdict(list)
    callable_key = callable(key)
    for item in items:
        result[key(item) if callable_key else item].append(item)
    return dict(result)  # to be in line with other solutions

def regroup_by_sd(items, key=None):
    result = {}
    callable_key = callable(key)
    for item in items:
        key_value = key(item) if callable_key else item
        result.setdefault(key_value, []).append(item)
    return result

import itertools


def regroup_by_it(items, key=None):
    seq = sorted(items, key=key)
    result = {
        key_value: list(group)
        for key_value, group in itertools.groupby(seq, key)}
    return result

def group_by(
        seq,
        key=None):
    items = iter(seq)
    try:
        item = next(items)
    except StopIteration:
        return
    else:
        callable_key = callable(key)
        last = key(item) if callable_key else item
        i = j = 0
        for i, item in enumerate(items, 1):
            current = key(item) if callable_key else item
            if last != current:
                yield last, seq[j:i]
                last = current
                j = i
        if i >= j:
            yield last, seq[j:i + 1]


def regroup_by_gb(items, key=None):
    return dict(group_by(sorted(items, key=key), key))

These can be divided into two categories:

loop through the input creating a dict-like structure (regroup_by_di(), regroup_by_dd(), regroup_by_sd())
sorting the input and then use a uniq-like function (e.g. itertools.groupby()) (regroup_by_it(), regroup_by_gb())

The first class of approaches has O(n) computational complexity, while the second class of approaches has O(n log n).
All of the proposed approach require specifying a key.
For OP's problem, operators.itemgetter(0) or lambda x: x[0] would work. Additionally, to get OP's desired results one should get only the list(dict.values()), e.g.:
from operator import itemgetter


mylist = [['AB001', 22100],
          ['AB001', 32935],
          ['XC013', 99834],
          ['VD126', 18884],
          ['AB001', 4439],
          ['XC013', 86701]]


print(list(regroup_by_di(mylist, key=itemgetter(0)).values()))
# [[['AB001', 22100], ['AB001', 32935], ['AB001', 4439]], [['XC013', 99834], ['XC013', 86701]], [['VD126', 18884]]]


The timings come out as faster for all dict-based (1st class) solutions and slower for all groupby-based (2nd class) solutions.
Within the dict-based solutions, their performances will depend slightly on the "collision rate", which is proportional to the number of times a new item will create a new object.
For higher collision rates, the regroup_by_di() may be the fastest, while for lower collision rates the regroup_by_dd() may be the fastest.
The benchmarks come out as follow:

0.1% collision rate (approx. 1000 elements per group)



10% collision rate (approx. 10 elements per group)



50% collision rate (approx. 2 elements per group)



100% collision rate (approx. 1 element per group)


(More details available here.)
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复