问题
Given a list of objects with multiple attributes I need to find the list of sets created by a union of all intersecting subsets.
Specifically these are Person objects, each with many attributes. I need to create a list of 'master' sets based on a handful of unique identifiers such as SSN, DLN, etc.
For instance, if Person A and Person B have the same SSN they create a set i. Then if Person B and C have the same DLN, they create a set ii. Person D and E have the same SSN but it (and all other identifiers) does not match any of the identifiers of Persons A, B or C. After merging all intersecting subsets I would end up with one set with Persons A,B,C and another set with Persons D, E.
Here is the psuedo-code for my solution. I am curious if anyone has already come up with a more efficient way of merging all possible intersecting sets. Keep in mind that the links between sets could be X Persons long (i.e. A matches B by SSN and B matches C by DLN and C matches D by SSN and D matches E by some other identifier would result in Persons A-E in one set). Also assume that the language this will be implemented in supports set operations.
bigSetList = array of all of the uniq Sets
fullyTested = false
while (bigSetList.size() > 1) or (fullyTested is false)
    foreach thisSet in bigSetList  order by size desc
        if count(sets that intersect with thisSet) > 0
            newThisSet = thisSet
            intersectingSets = []
            bigSetList.delete(thisSet)
            foreach testSet in bigSetList
                if thisSet.intersects(testSet)
                    newThisSet.addAll(testSet)
                    intersectingSets.push(testSetID)
                end if
            end
            bigSetList.delete(intersectingSets)
            bigSetList.push(newThisSet)
            bigSetList.sort()
            break
        end if
    end foreach
    fullyTested = true  // have looped through every set in the list and found 0 intersect partners
end
回答1:
To expand on my comment in the original post, you want to create a list of sets where each member of a given set shares at least one attribute with at least one other member of that set.
Naively, this can be solved either by finding all pairs that share an attribute and merging pairs together that have the same partner iteratively. This would be O(N^3) (N^2 for iterating over pairs, and up to N separate sets to determine membership).
You can also think of this problem as determining the connected component of a graph, where every object and every unique attribute value is a node; each object would be connected to each of its attribute values. Setting up that graph would take linear time, and you could determine the connected components in linear time with a breadth or depth first search.
回答2:
I would guess that you have a relatively small set of attributes for the Person object (as compared to the number of Person objects you're considering). If you want to reduce traversing the list of Person objects multiple times, you can take a Person, put its attributes into a list of known possible connections and then move on to the next Person. With each successive Person, you see if it is connected to any prior connection. If so, then you add its unique attributes to the possible connections. You should be able to process all Person objects in one pass. It's possible that you'll have some disconnected sets in the results, so it may be worth examining the unconnected Person objects after you've created the first graph.
回答3:
while (!people.isEmpty()) {
    Person first = people.get(0);
    people.remove(first);
    Set<Person> set = makeSet(first);
    for (Person person : people) {
        for (Person other : set) {
            if (person.isRelatedTo(other)) {
                set.add(person);
                people.remove(person);
            }
        }
    }
    sets.add(set);
}
for (Set<Person> a : sets) {
    for (Set<Person> b : sets.except(a)) {
        for (Person person : a)
            for (Person other : b) {
                if (person.isRelatedTo(other)) {
                    a.addAll(b);
                    b.clear();
                    sets.remove(b);
                    break;
                }
            }
    }
}
回答4:
First, is there some inherent hierarchy in identifiers, and do contradicting identifiers of a higher sort cancel out the same identifier of a lower sort? For example, if A and B have the same SSN, B and C have the same DLN, and C and D have the same SSN which does not match A and B's SSN, does that mean that there are two groups or one?
Assuming contradictions don't matter, you are dealing with equivalence classes, as user 57368 (unknown Google) states.  For equivalence classes, people often turn to the Union-find structure.  As for how to perform these unions, it's not immediately trivial because I assume you don't have the direct link A-B when both A and B have the same SSN.  Instead, our sets will consist of two kinds of elements.  Each (attribute type, attribute value) = attribute pair is an element.  You also have elements corresponding to objects.  When you iterate through the list of attributes for an object, perform the union (object, attribute).  
One of the important features of the Union-find data structure is that the resulting structure represents the set. It lets you query "What set is A in?" If this is not enough, let us know and we can improve the result.
But the most important feature is that the algorithm has something which resembles constant-time behavior for each union and query operation.
回答5:
So your collection example could look like this:
A { ss |-> 42, dl |-> 123 }
B { ss |-> 42, dl |-> 456 }
C { ss |-> 23, dl |-> 456 }
D { ss |-> 89, dl |-> 789 }
E { ss |-> 89, dl |-> 432 }
Then I would suggest to use an algorithm where you build up multi-collections by incrementally merging or inserting each collection into the multi-collections:
Iteration 1. The first collection becomes the only multi-collection:
{A} { ss |-> [42], dl |-> [123] }
Iteration 2. Merge the next collection into the first since SSN is already present:
{A,B} { ss |-> [42], dl |-> [123,456] }
Iteration 3. Merge again, since the DLN is already there:
{A,B,C} { ss |-> [23,42], dl |-> [123,456] }
Iteration 4. Insert a new multi-collection since there is no match:
{A,B,C} { ss |-> [23,42], dl |-> [123,456] }
{D}     { ss |-> [89],    dl |-> [789]     }
Iteration 5. Merge with second multi collection, since the SSN is there:
{A,B,C} { ss |-> [23,42], dl |-> [123,456] }
{D,E}   { ss |-> [89],    dl |-> [432,789] }
So in each iteration (one for each collection), you must identify all multi-collections that have values in common with the collection you are processing, and merge all these together.
In general, if there are n collections each with a constant k number of attributes, then this algorithm will run in time O(nnk) = O(n2). The worst-case behaviour is exibited if all attribute values are distinct. When there is more sharing between attribute values, the time that it takes to insert and determine membership in the attribute value sets (like [23,42]) gets to be the dominant factor, so the attribute value sets should be efficient.
If you use optimal disjoint sets, then each Find or Merge operation will run in amortized time O(α(n)).
Thus, for each iteration there will be at most n multi-collections (the situation when no multi-collections have been merged so far). To integrate the new collection into the multi-collections, you will have to perform a Find operation on each of the multi-collections k sets to identify all multi-collections to be merged, which takes time bounded by O(nkα(n)). To merge the at most k multi-collections found this way takes O(k2α(n)).
So for all iteration the time is bounded by O(n(nkα(n)+k2α(n))) = O(n(nkα(n))) = O(n2kα(n)) = O(n2α(n)) since k is a constant.
Because α(n) for all practical purposes is also a constant, the total time is bounded by O(n2).
来源:https://stackoverflow.com/questions/967064/union-of-all-intersecting-sets