Replace list of list with “condensed” list of list while maintaining order

前端未结

关注

 7  1540

不要未来只要你来 2020-11-30 07:41

I have a list of list as in the code I attached. I want to link each sub list if there are any common values. I then want to replace the list of list with a condensed list

7条回答

孤街浪徒 (楼主)

2020-11-30 08:19
Here's my approach. It uses the concept of a "disjoint set" to first identify which sublists overlap with each other, then it joins them together, eliminating duplicates.
```
from collections import OrderedDict
def disjoint_set_find(djs, node): # disjoint set find, with path compression
    if node not in djs: # base case, node is a root of a set
        return node
    djs[node] = disjoint_set_find(djs, djs[node]) # recurse, caching results
    return djs[node]

def disjoint_set_union(djs, first, second):
    first = disjoint_set_find(djs, first)   # find root of first set
    second = disjoint_set_find(djs, second) # and of second set
    if first < second: # make smaller root the root of the new combined set
        djs[second] = first
    elif second < first:
        djs[first] = second
    # deliberately ignore the case where first == second (same set)

def condenseBK(*master_list):
    values = OrderedDict() # maps values to the first sublist containing them
    overlaps = {}  # maps sublist indexes to each other to form a disjoint set
    for i, sublist  in enumerate(master_list):
        for v in sublist:
            if v not in values: # no overlap, so just store value
                values[v] = i
            else: # overlap detected, do a disjoint set union
                disjoint_set_union(overlaps, values[v], i)
    output = [] # results
    output_map = {} # map from original indexes to output indexes
    for v, i, in values.items(): # iterate over values in order
        root = disjoint_set_find(overlaps, i)
        if root not in output_map:
            output_map[i] = len(output)
            output.append([]) # create new output sublists as necessary
        output[output_map[root]].append(v)
    return output
```
Sample output:
```
>>> a = [1,2,3]
>>> b = [3,4]
>>> c = [88,7,8]
>>> d = [3, 50]
>>> e = [5,4]
>>> f = [8,9]
>>> g = [9,10]
>>> h = [20,21]
>>> i = [21,22]
>>> lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000
>>> condenseBK(*lst)
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
```
An explanation of the algorithm:

By request, here's an explanation for how my code works.

The first two functions implement the find and union operations of a disjoint set. The data structure is implemented with a dictionary mapping nodes to their parent nodes. Any node that is not a key of the dictionary is the root of a set. The find operation returns the root node of the set containing a given node. To help performance a bit, I've implemented "path compression", which reduces the number of recursive steps needed over time. The union operation combines the sets containing its arguments first and second.

The main condense function has two parts. First, it sets up a couple of data structures, then it uses them to build the output.

values is an OrderedDictionary that maps from each value to the index of the first sublist it is contained in. The order each value is added is used to produce the output in the correct order.

overlaps is the dictionary used as for the disjoint set. Its nodes are the integer indexes of overlapping sublists.

The first loops fill up those two data structures. They loop over the sublists and their contents. If a value has not been seen before, it is added to the values dictionary. If it has been seen, the current sublist is overlapping a previous sublist containing that value.

To resolve the overlap, the code does a union of the disjoint sets that contain the two sublists.

The output is built in the output list. Because there are likely to be fewer output sublists than there were in the input, we need an additional dictionary to map between the old indexes (from the input) to the new indexes that apply to the output list.

To fill up the output list, we iterate over the values, which happens in the order they were added thanks to using the OrderedDict class. Using the disjoint set, it can combine the overlapping lists correctly.

This algorithm has very good performance when there are a lot of lists to be processed that don't overlap immediately. For instance, this set of 200 three-element lists ultimately all overlaps, but you only start seeing the overlaps appear after the first 100 have been processed:
```
lst2 = [list(range(4*i,   4*(i+1)-1)) for i in range(100)] + \
       [list(range(4*i+2, 4*(i+1)+1)) for i in range(100)]
```
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...