Sorting sub-lists into new sub-lists based on common first items

前端 未结 4 1816
面向向阳花
面向向阳花 2020-12-10 19:49

I have a large number of two-membered sub-lists that are members of a list called mylist:

mylist = [[\'AB001\', 22100],
          [\'AB001\', 32         


        
相关标签:
4条回答
  • 2020-12-10 20:36

    There are a number of alternatives to solve this problem:

    def regroup_by_di(items, key=None):
        result = {}
        callable_key = callable(key)
        for item in items:
            key_value = key(item) if callable_key else item
            if key_value not in result:
                result[key_value] = []
            result[key_value].append(item)
        return result
    
    import collections
    
    
    def regroup_by_dd(items, key=None):
        result = collections.defaultdict(list)
        callable_key = callable(key)
        for item in items:
            result[key(item) if callable_key else item].append(item)
        return dict(result)  # to be in line with other solutions
    
    def regroup_by_sd(items, key=None):
        result = {}
        callable_key = callable(key)
        for item in items:
            key_value = key(item) if callable_key else item
            result.setdefault(key_value, []).append(item)
        return result
    
    import itertools
    
    
    def regroup_by_it(items, key=None):
        seq = sorted(items, key=key)
        result = {
            key_value: list(group)
            for key_value, group in itertools.groupby(seq, key)}
        return result
    
    def group_by(
            seq,
            key=None):
        items = iter(seq)
        try:
            item = next(items)
        except StopIteration:
            return
        else:
            callable_key = callable(key)
            last = key(item) if callable_key else item
            i = j = 0
            for i, item in enumerate(items, 1):
                current = key(item) if callable_key else item
                if last != current:
                    yield last, seq[j:i]
                    last = current
                    j = i
            if i >= j:
                yield last, seq[j:i + 1]
    
    
    def regroup_by_gb(items, key=None):
        return dict(group_by(sorted(items, key=key), key))
    

    These can be divided into two categories:

    1. loop through the input creating a dict-like structure (regroup_by_di(), regroup_by_dd(), regroup_by_sd())
    2. sorting the input and then use a uniq-like function (e.g. itertools.groupby()) (regroup_by_it(), regroup_by_gb())

    The first class of approaches has O(n) computational complexity, while the second class of approaches has O(n log n).

    All of the proposed approach require specifying a key. For OP's problem, operators.itemgetter(0) or lambda x: x[0] would work. Additionally, to get OP's desired results one should get only the list(dict.values()), e.g.:

    from operator import itemgetter
    
    
    mylist = [['AB001', 22100],
              ['AB001', 32935],
              ['XC013', 99834],
              ['VD126', 18884],
              ['AB001', 4439],
              ['XC013', 86701]]
    
    
    print(list(regroup_by_di(mylist, key=itemgetter(0)).values()))
    # [[['AB001', 22100], ['AB001', 32935], ['AB001', 4439]], [['XC013', 99834], ['XC013', 86701]], [['VD126', 18884]]]
    

    The timings come out as faster for all dict-based (1st class) solutions and slower for all groupby-based (2nd class) solutions. Within the dict-based solutions, their performances will depend slightly on the "collision rate", which is proportional to the number of times a new item will create a new object. For higher collision rates, the regroup_by_di() may be the fastest, while for lower collision rates the regroup_by_dd() may be the fastest.

    The benchmarks come out as follow:

    • 0.1% collision rate (approx. 1000 elements per group)

    • 10% collision rate (approx. 10 elements per group)

    • 50% collision rate (approx. 2 elements per group)

    • 100% collision rate (approx. 1 element per group)

    (More details available here.)

    0 讨论(0)
  • 2020-12-10 20:36

    Without importing any packages:

    • Build a dictionary and then get the values to a list
      • Use .get to determine if a key exists, and return some specified value, None in this case, if the key is nonexistent.
      • dict.get defaults to None, so this method never raises a KeyError.
        • If None is a value in the dictionary, then change the default value returned by .get.
          • test.get(t[0], 'something here')
    • Because: sub-lists based on common first items.
      • Add index 0 as the key, and then add the list, t, as the dict value.
    test = dict()
    for t in mylist:
        if test.get(t[0]) == None:
            test[t[0]] = [t]
        else:
            test[t[0]].append(t)
            
    final = list(test.values())
    
    # print final results in
    
    [[['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
     [['XC013', 99834], ['XC013', 86701]],
     [['VD126', 18884]]]
    
    0 讨论(0)
  • 2020-12-10 20:53

    You were right about this being a job for groupby:

    from itertools import groupby
    from operator import itemgetter
    
    mylist = [['AB001', 22100],
              ['AB001', 32935],
              ['XC013', 99834],
              ['VD126', 18884],
              ['AB001', 4439],
              ['XC013', 86701]]
    
    print([list(value) for key, value in groupby(sorted(mylist), key=itemgetter(0))])
    

    This will give you a list-of-lists, grouped by the first item in the sublist.

    [[['AB001', 4439], ['AB001', 22100], ['AB001', 32935]], 
     [['VD126', 18884]], 
     [['XC013', 86701], ['XC013', 99834]]]
    
    0 讨论(0)
  • 2020-12-10 20:53

    collections.defaultdict

    An itertools.groupby solution will incur O(n log n) cost since the input must be sorted first. You can a use defaultdict of lists for a guaranteed O(n) solution:

    from collections import defaultdict
    
    dd = defaultdict(list)
    for item in mylist:
        dd[item[0]].append(item)
    
    res = list(dd.values())
    
    print(res)
    
    [[['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
     [['XC013', 99834], ['XC013', 86701]],
     [['VD126', 18884]]]
    
    0 讨论(0)
提交回复
热议问题