I have a list of list as in the code I attached. I want to link each sub list if there are any common values. I then want to replace the list of list with a condensed list
Your problem is essentially a graph theoretic one, the problem of connected components, with an added requirement regarding the order of the elements of each component.
In your program, the set of all lists forms an undirected graph, where each list is a node in the graph. We say two lists are connected directly if they have common elements, and connected indirectly, if there exists a third list to which both are connected, either directly or indirectly. Given e.g. three lists [a, b], [b, c] and [c, d], then [a, b] and [b, c] are connected directly, as well as [b, c] and [c, d], but [a, b] and [c, d] are connected indirectly, since while they don't share common elements, they both share elements with the same list [b, c].
A group of nodes is a connected component if all nodes in the group are connected (directly or indirectly) and no other node in the graph is connected to any node in the group.
There is a fairly simple linear time algorithm that generates all connected components in an undirected graph. Using that, we can define a function that generates all lists of condensed disjoint lists, while keeping the order of their elements:
from itertools import imap, combinations_with_replacement
from collections import defaultdict
def connected_components(edges):
neighbors = defaultdict(set)
for a, b in edges:
neighbors[a].add(b)
neighbors[b].add(a)
seen = set()
def component(node, neighbors=neighbors, seen=seen, see=seen.add):
unseen = set([node])
next_unseen = unseen.pop
while unseen:
node = next_unseen()
see(node)
unseen |= neighbors[node] - seen
yield node
for node in neighbors:
if node not in seen:
yield component(node)
def condense(lists):
tuples = combinations_with_replacement(enumerate(imap(tuple, lists)), 2)
overlapping = ((a, b) for a, b in tuples
if not set(a[1]).isdisjoint(b[1]))
seen = set()
see = seen.add
for component in connected_components(overlapping):
yield [item for each in sorted(component)
for item in each[1]
if not (item in seen or see(item))]
print list(condense([[1, 2, 3], [10, 5], [3, 8, 5], [9]]))
print list(condense([[1, 2, 3], [5, 6], [3, 4], [6, 7]]))
Result:
[[1, 2, 3, 10, 5, 8], [9]]
[[5, 6, 7], [1, 2, 3, 4]]
Time complexity of condense() is quadratic, since every list must be tested against every other list to find out if they have common elements. Therefore, the performance is awful. Can we improve it somehow? Yes.
Two lists are connected directly if they have common elements. We can turn this relationship around: two elements are connected directly if they belong to the same list, and connected indirectly if there exists a third element that is connected (directly or indirectly) to both of them. Given e.g. two lists [a, b] and [b, c], then a and b are connected directly, as well as b and c, and therefore a and c are connected indirectly. If we also adapt J.F.Sebastians idea of storing the position of each element's first occurrence, we can implement it like so:
def condense(lists):
neighbors = defaultdict(set)
positions = {}
position = 0
for each in lists:
for item in each:
neighbors[item].update(each)
if item not in positions:
positions[item] = position
position += 1
seen = set()
def component(node, neighbors=neighbors, seen=seen, see=seen.add):
unseen = set([node])
next_unseen = unseen.pop
while unseen:
node = next_unseen()
see(node)
unseen |= neighbors[node] - seen
yield node
for node in neighbors:
if node not in seen:
yield sorted(component(node), key=positions.get)
It still uses the connected components algorithm, but this time we view elements as connected, not lists. The results are the same as before, but since time complexity is now linear, it runs much faster.