Sorting sub-lists into new sub-lists based on common first items

前端 未结 4 1823
面向向阳花
面向向阳花 2020-12-10 19:49

I have a large number of two-membered sub-lists that are members of a list called mylist:

mylist = [[\'AB001\', 22100],
          [\'AB001\', 32         


        
4条回答
  •  暖寄归人
    2020-12-10 20:36

    There are a number of alternatives to solve this problem:

    def regroup_by_di(items, key=None):
        result = {}
        callable_key = callable(key)
        for item in items:
            key_value = key(item) if callable_key else item
            if key_value not in result:
                result[key_value] = []
            result[key_value].append(item)
        return result
    
    import collections
    
    
    def regroup_by_dd(items, key=None):
        result = collections.defaultdict(list)
        callable_key = callable(key)
        for item in items:
            result[key(item) if callable_key else item].append(item)
        return dict(result)  # to be in line with other solutions
    
    def regroup_by_sd(items, key=None):
        result = {}
        callable_key = callable(key)
        for item in items:
            key_value = key(item) if callable_key else item
            result.setdefault(key_value, []).append(item)
        return result
    
    import itertools
    
    
    def regroup_by_it(items, key=None):
        seq = sorted(items, key=key)
        result = {
            key_value: list(group)
            for key_value, group in itertools.groupby(seq, key)}
        return result
    
    def group_by(
            seq,
            key=None):
        items = iter(seq)
        try:
            item = next(items)
        except StopIteration:
            return
        else:
            callable_key = callable(key)
            last = key(item) if callable_key else item
            i = j = 0
            for i, item in enumerate(items, 1):
                current = key(item) if callable_key else item
                if last != current:
                    yield last, seq[j:i]
                    last = current
                    j = i
            if i >= j:
                yield last, seq[j:i + 1]
    
    
    def regroup_by_gb(items, key=None):
        return dict(group_by(sorted(items, key=key), key))
    

    These can be divided into two categories:

    1. loop through the input creating a dict-like structure (regroup_by_di(), regroup_by_dd(), regroup_by_sd())
    2. sorting the input and then use a uniq-like function (e.g. itertools.groupby()) (regroup_by_it(), regroup_by_gb())

    The first class of approaches has O(n) computational complexity, while the second class of approaches has O(n log n).

    All of the proposed approach require specifying a key. For OP's problem, operators.itemgetter(0) or lambda x: x[0] would work. Additionally, to get OP's desired results one should get only the list(dict.values()), e.g.:

    from operator import itemgetter
    
    
    mylist = [['AB001', 22100],
              ['AB001', 32935],
              ['XC013', 99834],
              ['VD126', 18884],
              ['AB001', 4439],
              ['XC013', 86701]]
    
    
    print(list(regroup_by_di(mylist, key=itemgetter(0)).values()))
    # [[['AB001', 22100], ['AB001', 32935], ['AB001', 4439]], [['XC013', 99834], ['XC013', 86701]], [['VD126', 18884]]]
    

    The timings come out as faster for all dict-based (1st class) solutions and slower for all groupby-based (2nd class) solutions. Within the dict-based solutions, their performances will depend slightly on the "collision rate", which is proportional to the number of times a new item will create a new object. For higher collision rates, the regroup_by_di() may be the fastest, while for lower collision rates the regroup_by_dd() may be the fastest.

    The benchmarks come out as follow:

    • 0.1% collision rate (approx. 1000 elements per group)

    • 10% collision rate (approx. 10 elements per group)

    • 50% collision rate (approx. 2 elements per group)

    • 100% collision rate (approx. 1 element per group)

    (More details available here.)

提交回复
热议问题