I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.
I then have several data items:
1 {1,
You can build a reverse index of "haystack" lists that contain each element:
std::set needle; // {4, 7, 12, 18}
std::vector> haystacks;
// A list of your each of your data sets:
// 1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
// 2 {3, 4, 6, 7, 15, 23, 34, 38}
// 3 {4, 7, 12, 18}
// 4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
// 5 {2, 4, 6, 7, 13,
std::hash_map[int, set> element_haystacks;
// element_haystacks maps each integer to the sets that contain it
// (the key is the integers from the haystacks sets, and
// the set values are the index into the 'haystacks' vector):
// 1 -> {1, 4} Element 1 is in sets 1 and 4.
// 2 -> {1, 5} Element 2 is in sets 2 and 4.
// 3 -> {2} Element 3 is in set 3.
// 4 -> {1, 2, 3, 4, 5} Element 4 is in sets 1 through 5.
std::set answer_sets; // The list of haystack sets that contain your set.
for (set::const_iterator it = needle.begin(); it != needle.end(); ++it) {
const std::set &new_answer = element_haystacks[i];
std::set existing_answer;
std::swap(existing_answer, answer_sets);
// Remove all answers that don't occur in the new element list.
std::set_intersection(existing_answer.begin(), existing_answer.end(),
new_answer.begin(), new_answer.end(),
inserter(answer_sets, answer_sets.begin()));
if (answer_sets.empty()) break; // No matches :(
}
// answer_sets now lists the haystack_ids that include all your needle elements.
for (int i = 0; i < answer_sets.size(); ++i) {
cout << "set: " << element_haystacks[answer_sets[i]];
}
If I'm not mistaken, this will have a max runtime of O(k*m)
, where is the avg number of sets that an integer belongs to and m is the avg size of the needle set (<50). Unfortunately, it'll have a significant memory overhead due to building the reverse mapping (element_haystacks
).
I'm sure you could improve this a bit if you stored sorted vectors
instead of sets
and element_haystacks
could be a 50 element vector
instead of a hash_map
.