Replace list of list with “condensed” list of list while maintaining order

前端 未结 7 1525

I have a list of list as in the code I attached. I want to link each sub list if there are any common values. I then want to replace the list of list with a condensed list

相关标签:
7条回答
  • 2020-11-30 08:04

    Here's a brute-force approach (it might be easier to understand):

    from itertools import chain
    
    def condense(*lists):
        # remember original positions
        positions = {}
        for pos, item in enumerate(chain(*lists)):
            if item not in positions:
                positions[item] = pos
    
        # condense disregarding order
        sets = condense_sets(map(set, lists))
    
        # restore order
        result = [sorted(s, key=positions.get) for s in sets]
        return result if len(result) != 1 else result[0]
    
    def condense_sets(sets):
        result = []
        for candidate in sets:
            for current in result:
                if candidate & current:   # found overlap
                    current |= candidate  # combine (merge sets)
    
                    # new items from candidate may create an overlap
                    # between current set and the remaining result sets
                    result = condense_sets(result) # merge such sets
                    break
            else:  # no common elements found (or result is empty)
                result.append(candidate)
        return result
    

    Example

    >>> condense([1,2,3], [10,5], [3,8,5])
    [1, 2, 3, 10, 5, 8]
    >>> a = [1,2,3]
    >>> b = [3,4]
    >>> i = [21,22]
    >>> c = [88,7,8]
    >>> e = [5,4]
    >>> d = [3, 50]
    >>> f = [8,9]
    >>> g=  [9,10]
    >>> h = [20,21]
    >>> condense(*[a,b,c,i,e,d,f,g,h,a,c,i]*1000)
    [[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
    >>> condense([1], [2, 3, 2])
    [[1], [2, 3]]
    

    Performance comparison

    condense_*() functions are from the answers to this question. lst_OP input list from the question (different size), lst_BK - the test list from @Blckknght's answer (different size). See the source.

    Measurements show that solutions based on "disjoint sets" and "connected components of undirected graph" concepts perform similar on both types of input.

    name                       time ratio comment
    condense_hynekcer     5.79 msec  1.00 lst_OP
    condense_hynekcer2     7.4 msec  1.28 lst_OP
    condense_pillmuncher2 11.5 msec  1.99 lst_OP
    condense_blckknght    16.8 msec  2.91 lst_OP
    condense_jfs            26 msec  4.49 lst_OP
    condense_BK           30.5 msec  5.28 lst_OP
    condense_blckknght2   30.9 msec  5.34 lst_OP
    condense_blckknght3   32.5 msec  5.62 lst_OP
    
    
    name                       time  ratio comment
    condense_blckknght     964 usec   1.00 lst_BK
    condense_blckknght2   1.41 msec   1.47 lst_BK
    condense_blckknght3   1.46 msec   1.51 lst_BK
    condense_hynekcer2    1.95 msec   2.03 lst_BK
    condense_pillmuncher2  2.1 msec   2.18 lst_BK
    condense_hynekcer     2.12 msec   2.20 lst_BK
    condense_BK           3.39 msec   3.52 lst_BK
    condense_jfs           544 msec 563.66 lst_BK
    
    
    name                       time ratio comment
    condense_hynekcer     6.86 msec  1.00 lst_rnd
    condense_jfs          16.8 msec  2.44 lst_rnd
    condense_blckknght    28.6 msec  4.16 lst_rnd
    condense_blckknght2   56.1 msec  8.18 lst_rnd
    condense_blckknght3   56.3 msec  8.21 lst_rnd
    condense_BK           70.2 msec 10.23 lst_rnd
    condense_pillmuncher2  324 msec 47.22 lst_rnd
    condense_hynekcer2     334 msec 48.61 lst_rnd
    

    To reproduce results clone gist and run measure-performance-condense-lists.py

    0 讨论(0)
  • 2020-11-30 08:09
    class List(list): pass
    
    rv = dict()
    
    def condense_step():
        """find and merge one overlapping pair in rv"""
        for i, iv in rv.items():
            for j, jv in rv.items():
                if i != j and i.intersection(j):
                    m = i.union(j)
                    del rv[i]
                    del rv[j]
                    rv.setdefault(m, [])
                    rv[m] += iv
                    rv[m] += jv
                    return True
    
    def unique(l):
        """flatten list-of-lists excluding duplicates"""
        seen = set()
        for i in sum(l, []):
            if i not in seen:
                seen.add(i)
                yield i
    
    def condense(inp):
        rv.clear()
        inp = map(List, inp)
    
        for i in range(len(inp)):
            inp[i].order = i
            rv.setdefault(frozenset(inp[i]), [])
            rv[frozenset(inp[i])].append(inp[i])
    
        while condense_step():
            pass
    
        for v in rv.values():
            v.sort(key=lambda x: x.order)
    
        return [list(unique(i)) for i in rv.values()]
    
    0 讨论(0)
  • 2020-11-30 08:11

    I tried to write a fast and readable solution. It is never much slower than other solutions, if I know, but can be sometimes much faster because it is additionally optimized for longer sublist or for many sublists that are subset of any yet existing group. (This is motivated by text of the question "I have a lot of list but not very many different lists.") The code uses less memory only for condensed data that can be much less than the original data. It can work e.g. with a generator collecting data from a realtime process. The estimate of complexity is O(n log n). I think that no algorithm that uses sorting can be of linear complexity.

    def condense(lists):
        groups = {}         # items divided into groups {id(the_set): the_set,...}
        members = {}        # mapping from item to group
        positions = {}      # mapping from item to sequential ordering
        iposition = 0       # counter for positions
        for sublist in lists:
            if not sublist or members.get(sublist[0], set()).issuperset(sublist):
                continue    # speed-up condition if all is from one group
            common = set([x for x in sublist if x in members])
            if common:      # any common group can be a destination for merge
                dst_group = members[common.pop()]
                common = common - dst_group   # are more groups common for sublist?
                while common:
                    grp = members[common.pop()]
                    if len(grp) > len(dst_group):   # merge shorter into longer grp
                        grp, dst_group = dst_group, grp
                    dst_group.update(grp)
                    for item in grp:
                        members[item] = dst_group
                    del groups[id(grp)]
                    common = common - dst_group
            else:           # or create a new group if nothing common
                dst_group = set()
                groups[id(dst_group)] = dst_group
            newitems = []
            for item in sublist:    # merge also new items
                if item not in positions:
                    positions[item] = iposition
                    iposition += 1
                    newitems.append(item)
                    members[item] = dst_group
            dst_group.update(newitems)
        return [sorted(x, key=positions.get) for x in groups.values()]
    

    It is faster than pillmuncher2 for subslists longer than approximately 8 items because it can work on more items together. It is also very fast for lists with many similar sublists or many sublists that are subset of any yet existing group. It is faster by 25% over pillmuncher2 for lst_OP, however slower by 15% for lst_BK.

    An example of test data with long sublists is [list(range(30)) + [-i] for i in range(100)].

    I intentionally wrote "common = common - dst_group" instead of using the set operator -= or "set.difference_update", because the updade in-place is not effective if the set on the right side is much bigger then on the left side.


    Modified pillmuncher's solution for easier readability. The modification is a little slower than the original due to replacing a generator by list.append. Probably the most readable fast solution.

    # Modified pillmuncher's solution
    from collections import defaultdict
    
    def condense(lists):
        neighbors = defaultdict(set)  # mapping from items to sublists
        positions = {}                # mapping from items to sequential ordering
        position = 0
        for each in lists:
            for item in each:
                neighbors[item].update(each)
                if item not in positions:
                    positions[item] = position
                    position += 1
        seen = set()
        see = seen.add
        for node in neighbors:
            if node not in seen:
                unseen = set([node])      # this is a "todo" set
                next_unseen = unseen.pop  # method alias, not called now
                group = []                # collects the output
                while unseen:
                    node = next_unseen()
                    see(node)
                    unseen |= neighbors[node] - seen
                    group.append(node)
                yield sorted(group, key=positions.get)
    
    0 讨论(0)
  • 2020-11-30 08:19

    Here's my approach. It uses the concept of a "disjoint set" to first identify which sublists overlap with each other, then it joins them together, eliminating duplicates.

    from collections import OrderedDict
    def disjoint_set_find(djs, node): # disjoint set find, with path compression
        if node not in djs: # base case, node is a root of a set
            return node
        djs[node] = disjoint_set_find(djs, djs[node]) # recurse, caching results
        return djs[node]
    
    def disjoint_set_union(djs, first, second):
        first = disjoint_set_find(djs, first)   # find root of first set
        second = disjoint_set_find(djs, second) # and of second set
        if first < second: # make smaller root the root of the new combined set
            djs[second] = first
        elif second < first:
            djs[first] = second
        # deliberately ignore the case where first == second (same set)
    
    def condenseBK(*master_list):
        values = OrderedDict() # maps values to the first sublist containing them
        overlaps = {}  # maps sublist indexes to each other to form a disjoint set
        for i, sublist  in enumerate(master_list):
            for v in sublist:
                if v not in values: # no overlap, so just store value
                    values[v] = i
                else: # overlap detected, do a disjoint set union
                    disjoint_set_union(overlaps, values[v], i)
        output = [] # results
        output_map = {} # map from original indexes to output indexes
        for v, i, in values.items(): # iterate over values in order
            root = disjoint_set_find(overlaps, i)
            if root not in output_map:
                output_map[i] = len(output)
                output.append([]) # create new output sublists as necessary
            output[output_map[root]].append(v)
        return output
    

    Sample output:

    >>> a = [1,2,3]
    >>> b = [3,4]
    >>> c = [88,7,8]
    >>> d = [3, 50]
    >>> e = [5,4]
    >>> f = [8,9]
    >>> g = [9,10]
    >>> h = [20,21]
    >>> i = [21,22]
    >>> lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000
    >>> condenseBK(*lst)
    [[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
    

    An explanation of the algorithm:

    By request, here's an explanation for how my code works.

    The first two functions implement the find and union operations of a disjoint set. The data structure is implemented with a dictionary mapping nodes to their parent nodes. Any node that is not a key of the dictionary is the root of a set. The find operation returns the root node of the set containing a given node. To help performance a bit, I've implemented "path compression", which reduces the number of recursive steps needed over time. The union operation combines the sets containing its arguments first and second.

    The main condense function has two parts. First, it sets up a couple of data structures, then it uses them to build the output.

    values is an OrderedDictionary that maps from each value to the index of the first sublist it is contained in. The order each value is added is used to produce the output in the correct order.

    overlaps is the dictionary used as for the disjoint set. Its nodes are the integer indexes of overlapping sublists.

    The first loops fill up those two data structures. They loop over the sublists and their contents. If a value has not been seen before, it is added to the values dictionary. If it has been seen, the current sublist is overlapping a previous sublist containing that value.

    To resolve the overlap, the code does a union of the disjoint sets that contain the two sublists.

    The output is built in the output list. Because there are likely to be fewer output sublists than there were in the input, we need an additional dictionary to map between the old indexes (from the input) to the new indexes that apply to the output list.

    To fill up the output list, we iterate over the values, which happens in the order they were added thanks to using the OrderedDict class. Using the disjoint set, it can combine the overlapping lists correctly.

    This algorithm has very good performance when there are a lot of lists to be processed that don't overlap immediately. For instance, this set of 200 three-element lists ultimately all overlaps, but you only start seeing the overlaps appear after the first 100 have been processed:

    lst2 = [list(range(4*i,   4*(i+1)-1)) for i in range(100)] + \
           [list(range(4*i+2, 4*(i+1)+1)) for i in range(100)]
    
    0 讨论(0)
  • 2020-11-30 08:22

    This solution uses only an Ordered Dictionary.
    deepcopy() is necessary if one wants the original copy to remain unchanged.

    from collections import OrderedDict
    from copy import deepcopy
    
    def treat(passed_list):
        L = deepcopy(passed_list)
    
        dic = OrderedDict()
        for subl in L:
            for x in subl:
                if x not in dic:
                    dic[x] =  subl
        print 'dic at start'
        print '\n'.join('%-3s : %s' % (a,dic[a])
                        for a in dic) + '\n'
    
        for sublist in L:
            short = []
            short.extend(el for el in sublist
                         if el not in short)
            seen  = []
            for k,val in dic.iteritems():
                if val is sublist:
                    break
                if k in short:
                    if val not in seen:
                        seen.append(val)
    
            sumseen = []
            for elseen in seen:
                for y in elseen:
                    sumseen.append(y)
                    dic[y] = sumseen
    
            if seen:
                for el in sublist:
                    if el not in sumseen:
                        sumseen.append(el)
                        dic[el] = sumseen
    
            sublist[:] = short
    
        cumul = []
        cumul.extend(lu for lu in dic.itervalues()
                     if lu not in cumul)
        return cumul
    
    
    plus = [[1,2,3,2,1],[10,5,5,5,10],
            [8,5,3,3,5],[45,50,12,45,40,12]]
    
    lst = [[1,2,3], [10,5], [3,8,5]]
    
    for one_list in (plus,lst):
        print 'one_list before == %r\n' % one_list
        print 'treat(one_list) == %r\n' % treat(one_list)
        print 'one_list after  == %r\n' % one_list
        print '===================================='
    

    result

    one_list before == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]
    
    dic at start
    1   : [1, 2, 3, 2, 1]
    2   : [1, 2, 3, 2, 1]
    3   : [1, 2, 3, 2, 1]
    10  : [10, 5, 5, 5, 10]
    5   : [10, 5, 5, 5, 10]
    8   : [8, 5, 3, 3, 5]
    45  : [45, 50, 12, 45, 40, 12]
    50  : [45, 50, 12, 45, 40, 12]
    12  : [45, 50, 12, 45, 40, 12]
    40  : [45, 50, 12, 45, 40, 12]
    
    treat(one_list) == [[1, 2, 3, 10, 5, 8], [45, 50, 12, 40]]
    
    one_list after  == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]
    
    ====================================
    one_list before == [[1, 2, 3], [10, 5], [3, 8, 5]]
    
    dic at start
    1   : [1, 2, 3]
    2   : [1, 2, 3]
    3   : [1, 2, 3]
    10  : [10, 5]
    5   : [10, 5]
    8   : [3, 8, 5]
    
    treat(one_list) == [[1, 2, 3, 10, 5, 8]]
    
    one_list after  == [[1, 2, 3], [10, 5], [3, 8, 5]]
    
    ====================================
    

    This solution has an inconvenience: it is 2 to 3 times slower than the J.F. Sebastian's solution.

    0 讨论(0)
  • 2020-11-30 08:27

    I am sure there is a cleaner way to do this but I started down a certain path and did what I had to to make it work without any refactoring.

    lookup = {}
    out = []
    index = 0
    for grp in lst:
        keys = [lookup.get(num, None) for num in grp]
        keys = [key for key in keys if key is not None]
        if len(keys):
            if len(set(keys)) != 1:
                for num in grp:
                    out[keys[0]].append(num)
                seen = set()
                keys = [key for key in keys if key not in seen and not seen.add(key)]
                for key in keys[1:]:
                    out[keys[0]].extend(out[key])
                    del out[key]
                seen = set()
                out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
            else:
                for num in grp:
                    lookup[num] = keys[0]
                out[keys[0]].extend(grp)
                seen = set()
                out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
        else:
            out.append(grp)
            for num in grp:
                lookup[num] = index
            index += 1
    
    print out
    

    Thanks to @Steven for the list reduction technique with the set.

    OUTPUT

    [[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
    
    0 讨论(0)
提交回复
热议问题