Data structure for matching sets

前端未结

关注

 13  1241

有刺的猬 2021-02-02 00:14

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

I then have several data items:
1 {1,

13条回答

慢半拍i (楼主)

2021-02-02 00:31

You can build a reverse index of "haystack" lists that contain each element:

std::set needle;  // {4, 7, 12, 18}
std::vector> haystacks;
// A list of your each of your data sets:
// 1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
// 2 {3, 4, 6, 7, 15, 23, 34, 38}
// 3 {4, 7, 12, 18}
// 4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
// 5 {2, 4, 6, 7, 13, 

std::hash_map[int, set>  element_haystacks;
// element_haystacks maps each integer to the sets that contain it
// (the key is the integers from the haystacks sets, and 
// the set values are the index into the 'haystacks' vector):
// 1 -> {1, 4}  Element 1 is in sets 1 and 4.
// 2 -> {1, 5}  Element 2 is in sets 2 and 4.
// 3 -> {2}  Element 3 is in set 3.
// 4 -> {1, 2, 3, 4, 5}  Element 4 is in sets 1 through 5.  
std::set answer_sets;  // The list of haystack sets that contain your set.
for (set::const_iterator it = needle.begin(); it != needle.end(); ++it) {
  const std::set &new_answer = element_haystacks[i];
  std::set existing_answer;
  std::swap(existing_answer, answer_sets);
  // Remove all answers that don't occur in the new element list.
  std::set_intersection(existing_answer.begin(), existing_answer.end(),
                        new_answer.begin(), new_answer.end(),
                        inserter(answer_sets, answer_sets.begin()));
  if (answer_sets.empty()) break;  // No matches :(
}

// answer_sets now lists the haystack_ids that include all your needle elements.
for (int i = 0; i < answer_sets.size(); ++i) {
  cout << "set: " << element_haystacks[answer_sets[i]];
}

If I'm not mistaken, this will have a max runtime of O(k*m), where is the avg number of sets that an integer belongs to and m is the avg size of the needle set (<50). Unfortunately, it'll have a significant memory overhead due to building the reverse mapping (element_haystacks).

I'm sure you could improve this a bit if you stored sorted vectors instead of sets and element_haystacks could be a 50 element vector instead of a hash_map.

0 讨论(0)

查看其它13个回答