Data structure for matching sets

前端 未结 13 1241
有刺的猬
有刺的猬 2021-02-02 00:14

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.

I then have several data items:
1 {1,

13条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-02-02 00:31

    You can build a reverse index of "haystack" lists that contain each element:

    std::set needle;  // {4, 7, 12, 18}
    std::vector> haystacks;
    // A list of your each of your data sets:
    // 1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
    // 2 {3, 4, 6, 7, 15, 23, 34, 38}
    // 3 {4, 7, 12, 18}
    // 4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
    // 5 {2, 4, 6, 7, 13, 
    
    std::hash_map[int, set>  element_haystacks;
    // element_haystacks maps each integer to the sets that contain it
    // (the key is the integers from the haystacks sets, and 
    // the set values are the index into the 'haystacks' vector):
    // 1 -> {1, 4}  Element 1 is in sets 1 and 4.
    // 2 -> {1, 5}  Element 2 is in sets 2 and 4.
    // 3 -> {2}  Element 3 is in set 3.
    // 4 -> {1, 2, 3, 4, 5}  Element 4 is in sets 1 through 5.  
    std::set answer_sets;  // The list of haystack sets that contain your set.
    for (set::const_iterator it = needle.begin(); it != needle.end(); ++it) {
      const std::set &new_answer = element_haystacks[i];
      std::set existing_answer;
      std::swap(existing_answer, answer_sets);
      // Remove all answers that don't occur in the new element list.
      std::set_intersection(existing_answer.begin(), existing_answer.end(),
                            new_answer.begin(), new_answer.end(),
                            inserter(answer_sets, answer_sets.begin()));
      if (answer_sets.empty()) break;  // No matches :(
    }
    
    // answer_sets now lists the haystack_ids that include all your needle elements.
    for (int i = 0; i < answer_sets.size(); ++i) {
      cout << "set: " << element_haystacks[answer_sets[i]];
    }
    

    If I'm not mistaken, this will have a max runtime of O(k*m), where is the avg number of sets that an integer belongs to and m is the avg size of the needle set (<50). Unfortunately, it'll have a significant memory overhead due to building the reverse mapping (element_haystacks).

    I'm sure you could improve this a bit if you stored sorted vectors instead of sets and element_haystacks could be a 50 element vector instead of a hash_map.

提交回复
热议问题