Optimization of an all-paths algorithm

问题

I've been successful using the following algorithm to complete all-path data up to path length of 10 on graphs of ~900 nodes. However, I want to scale it up to larger graphs and I'm wondering if there are further optimizations I can do. So far I have:

After a node has completed it's DFS the paths are saved to a hash table. Should said node be encountered, paths from the hash table are appended so work is not repeated.
Nodes are sorted by their degree (highest first). This way nodes most likely to be encountered will already be in the hash table.

The algorithm in specifics: builds a network of chemicals (nodes) and reactions (edges) and creates paths so it can later be searched much faster for theoretical paths, these can later be experimentally tested.

The algo is based on the networkx all_simple_paths algorithm:

def _all_simple_paths_graph(DG, cutoff):
    memorizedPaths = {}
    nlist = []
    #sorts nodes into highest -> lowest degree order
    degreelist = sorted(DG.degree_iter(),key=itemgetter(1),reverse=True)
    for i in degreelist:
        t = re.findall("'\s*([^\"]*)\s*'",str(i))
        nlist.extend(t)
    with open ('PythonOutput.txt', "wb") as csvfile:
        writer = csv.writer(csvfile, delimiter=' ', quotechar='|')
        numberdone = 0
        #for each node start a new DFS
        for source in nlist:
            print source 
            print numberdone
            numberdone += 1
            uniqueTreePaths = []
            if cutoff < 1:
                return
            visited = [source]
            stack = [iter(DG[source])]
            while stack:
                children = stack[-1]
                child = next(children, None)
                if child is None:
                    stack.pop()
                    visited.pop()
                #If a node has been searched before, append its paths
                elif child in memorizedPaths:
                    for path in memorizedPaths[child]:
                        newPath = (tuple(visited) + tuple(path))
                        if (len(newPath) <= cutoff) and (len(set(visited) & set(path)) == 0):
                            uniqueTreePaths.append(newPath)
                    continue
                elif len(visited) < cutoff:
                    if child not in visited:
                        visited.append(child)
                        stack.append(iter(DG[child]))
                        if visited not in uniqueTreePaths:
                            uniqueTreePaths.append(tuple(visited))
                else: #len(visited) == cutoff:
                    if (visited not in uniqueTreePaths) and (child not in visited):
                        uniqueTreePaths.append(tuple(visited + [child]))
                    stack.pop()
                    visited.pop()
            #for each node, write to disk to save RAM
            for path in uniqueTreePaths:
                writer.writerow(path)
            #add paths for each node to the hash table
            memorizedPaths[source] = uniqueTreePaths

If anyone has any suggestions for further optimizing the algorithm, it would be greatly appreciated.

回答1:

First of all - measurement is your best friend. If you are not collecting information about how long the algorithm takes, then you have no real way of knowing if a change helps or not. Your idea to cache your results is clever, but you should check timings with and without the caching to make sure it actually helps.

One part of your code in particular that I can see room for improvement in is if (visited not in uniqueTreePaths).... You are checking to see if a list is contained in a list of lists. I'm not sure what the best way to fix this would be (again, collect timing data from your code), but one possibility would be to represent the paths as strings instead of lists, allowing them to be stored by hashes.

来源：https://stackoverflow.com/questions/18624660/optimization-of-an-all-paths-algorithm

标签

python

optimization

bioinformatics

networkx

directed-acyclic-graphs