I have a large number of two-membered sub-lists that are members of a list called mylist:
mylist = [[\'AB001\', 22100],
[\'AB001\', 32
There are a number of alternatives to solve this problem:
def regroup_by_di(items, key=None):
result = {}
callable_key = callable(key)
for item in items:
key_value = key(item) if callable_key else item
if key_value not in result:
result[key_value] = []
result[key_value].append(item)
return result
import collections
def regroup_by_dd(items, key=None):
result = collections.defaultdict(list)
callable_key = callable(key)
for item in items:
result[key(item) if callable_key else item].append(item)
return dict(result) # to be in line with other solutions
def regroup_by_sd(items, key=None):
result = {}
callable_key = callable(key)
for item in items:
key_value = key(item) if callable_key else item
result.setdefault(key_value, []).append(item)
return result
import itertools
def regroup_by_it(items, key=None):
seq = sorted(items, key=key)
result = {
key_value: list(group)
for key_value, group in itertools.groupby(seq, key)}
return result
def group_by(
seq,
key=None):
items = iter(seq)
try:
item = next(items)
except StopIteration:
return
else:
callable_key = callable(key)
last = key(item) if callable_key else item
i = j = 0
for i, item in enumerate(items, 1):
current = key(item) if callable_key else item
if last != current:
yield last, seq[j:i]
last = current
j = i
if i >= j:
yield last, seq[j:i + 1]
def regroup_by_gb(items, key=None):
return dict(group_by(sorted(items, key=key), key))
These can be divided into two categories:
dict-like structure (regroup_by_di(), regroup_by_dd(), regroup_by_sd())regroup_by_it(), regroup_by_gb())The first class of approaches has O(n) computational complexity, while the second class of approaches has O(n log n).
All of the proposed approach require specifying a key.
For OP's problem, operators.itemgetter(0) or lambda x: x[0] would work. Additionally, to get OP's desired results one should get only the list(dict.values()), e.g.:
from operator import itemgetter
mylist = [['AB001', 22100],
['AB001', 32935],
['XC013', 99834],
['VD126', 18884],
['AB001', 4439],
['XC013', 86701]]
print(list(regroup_by_di(mylist, key=itemgetter(0)).values()))
# [[['AB001', 22100], ['AB001', 32935], ['AB001', 4439]], [['XC013', 99834], ['XC013', 86701]], [['VD126', 18884]]]
The timings come out as faster for all dict-based (1st class) solutions and slower for all groupby-based (2nd class) solutions.
Within the dict-based solutions, their performances will depend slightly on the "collision rate", which is proportional to the number of times a new item will create a new object.
For higher collision rates, the regroup_by_di() may be the fastest, while for lower collision rates the regroup_by_dd() may be the fastest.
The benchmarks come out as follow:
(More details available here.)