Replace list of list with “condensed” list of list while maintaining order

前端 未结 7 1522

I have a list of list as in the code I attached. I want to link each sub list if there are any common values. I then want to replace the list of list with a condensed list

7条回答
  •  無奈伤痛
    2020-11-30 08:11

    I tried to write a fast and readable solution. It is never much slower than other solutions, if I know, but can be sometimes much faster because it is additionally optimized for longer sublist or for many sublists that are subset of any yet existing group. (This is motivated by text of the question "I have a lot of list but not very many different lists.") The code uses less memory only for condensed data that can be much less than the original data. It can work e.g. with a generator collecting data from a realtime process. The estimate of complexity is O(n log n). I think that no algorithm that uses sorting can be of linear complexity.

    def condense(lists):
        groups = {}         # items divided into groups {id(the_set): the_set,...}
        members = {}        # mapping from item to group
        positions = {}      # mapping from item to sequential ordering
        iposition = 0       # counter for positions
        for sublist in lists:
            if not sublist or members.get(sublist[0], set()).issuperset(sublist):
                continue    # speed-up condition if all is from one group
            common = set([x for x in sublist if x in members])
            if common:      # any common group can be a destination for merge
                dst_group = members[common.pop()]
                common = common - dst_group   # are more groups common for sublist?
                while common:
                    grp = members[common.pop()]
                    if len(grp) > len(dst_group):   # merge shorter into longer grp
                        grp, dst_group = dst_group, grp
                    dst_group.update(grp)
                    for item in grp:
                        members[item] = dst_group
                    del groups[id(grp)]
                    common = common - dst_group
            else:           # or create a new group if nothing common
                dst_group = set()
                groups[id(dst_group)] = dst_group
            newitems = []
            for item in sublist:    # merge also new items
                if item not in positions:
                    positions[item] = iposition
                    iposition += 1
                    newitems.append(item)
                    members[item] = dst_group
            dst_group.update(newitems)
        return [sorted(x, key=positions.get) for x in groups.values()]
    

    It is faster than pillmuncher2 for subslists longer than approximately 8 items because it can work on more items together. It is also very fast for lists with many similar sublists or many sublists that are subset of any yet existing group. It is faster by 25% over pillmuncher2 for lst_OP, however slower by 15% for lst_BK.

    An example of test data with long sublists is [list(range(30)) + [-i] for i in range(100)].

    I intentionally wrote "common = common - dst_group" instead of using the set operator -= or "set.difference_update", because the updade in-place is not effective if the set on the right side is much bigger then on the left side.


    Modified pillmuncher's solution for easier readability. The modification is a little slower than the original due to replacing a generator by list.append. Probably the most readable fast solution.

    # Modified pillmuncher's solution
    from collections import defaultdict
    
    def condense(lists):
        neighbors = defaultdict(set)  # mapping from items to sublists
        positions = {}                # mapping from items to sequential ordering
        position = 0
        for each in lists:
            for item in each:
                neighbors[item].update(each)
                if item not in positions:
                    positions[item] = position
                    position += 1
        seen = set()
        see = seen.add
        for node in neighbors:
            if node not in seen:
                unseen = set([node])      # this is a "todo" set
                next_unseen = unseen.pop  # method alias, not called now
                group = []                # collects the output
                while unseen:
                    node = next_unseen()
                    see(node)
                    unseen |= neighbors[node] - seen
                    group.append(node)
                yield sorted(group, key=positions.get)
    

提交回复
热议问题