I have a collection of 15M (Million) DAGs (directed acyclic graphs - directed hypercubes actually) that I would like to remove isomorphisms from. What is the common algorith
since you mentioned that testing smaller groups of ~300k graphs can be checked for isomorphy I would try to split the 15M graphs into groups of ~300k nodes and run the test for isomorphy on each group
say: each graph Gi := VixEi (Vertices x Edges)
(1) create buckets of graphs such that the n-th bucket contains only graphs with |V|=n
(2) for each bucket created in (1) create subbuckets such that the (n,m)-th subbucket contains only graphs such that |V|=n and |E|=m
(3) if the groups are still too large, sort the nodes within each graph by their degrees (meaning the nr of edges connected to the node), create a vector from it and distribute the graphs by this vector
example for (3):
assume 4 nodes V = {v1, v2, v3, v4}. Let d(v) be v's degree with d(v1)=3, d(v2)=1, d(v3)=5, d(v4)=4, then find < := transitive hull ( { (v2,v1), (v1,v4), (v4,v3) } ) and create a vector depening on the degrees and the order which leaves you with
(1,3,4,5) = (d(v2), d(v1), d(v4), d(v3)) = d( {v2, v1, v4, v3} ) = d(<)
now you have divided the 15M graphs into buckets where each bucket has the following characteristics:
I assume this to be fine grained enough if you are expecting not to find too many isomorphisms
cost so far: O(n) + O(n) + O(n*log(n))
(4) now, you can assume that members inside each bucket are likely to be isomophic. you can run your isomorphic-check on the bucket and only need to compare the currently tested graph against all representants you have already found within this bucket. by assumption there shouldn't be too many, so I assume this to be quite cheap.
at step 4 you also can happily distribute the computation to several compute nodes, which should really speed up the process