C++ randomly sample k numbers from range 0:n-1 (n > k) without replacement

后端 未结 5 1441
死守一世寂寞
死守一世寂寞 2021-01-02 13:11

I\'m working on porting a MATLAB simulation into C++. To do this, I am trying to replicate MATLAB\'s randsample() function. I haven\'t figured out an efficient way to do thi

5条回答
  •  心在旅途
    2021-01-02 13:53

    So this was a solution I came up with that will generate the samples in a random order, rather than in a deterministic manner that would need to be shuffled later:

    vector GenerateRandomSample(int range, int samples) {
      vector solution; // Populated in the order that the numbers are generated in.
      vector to_exclude; // Inserted into in sorted order.
      for(int i = 0; i < samples; ++i) {
        auto raw_rand = rand() % (range - to_exclude.size());
        // This part can be optimized as a binary search
        int offset = 0;
        while(offset < to_exclude.size() &&
            (raw_rand+offset) >= to_exclude[offset]) {
          ++offset;
        }
        // Alternatively substitute Binary Search to avoid linearly
        // searching for where to put the new element. Arguably not
        // actually a benefit.
        // int offset = ModifiedBinarySearch(to_exclude, raw_rand);
    
        int to_insert = (raw_rand + offset);
        to_exclude.insert(to_exclude.begin() + offset, to_insert);
        solution.push_back(to_insert);
      }  
      return solution;
    }
    

    I added an optional binary search for the location on where to insert the newly generated random member, but after attempting to benchmark its execution over large ranges(N)/and sets (K) (done on codeinterview.io/), I have not found any significant benefit to doing so, over just linearly traversing and exiting early.

    EDIT: After further extensive testing, I've found over a sufficiently large parameters: (eg. N = 1000, K = 500, TRIALS = 10000) The binary search method does in fact offer a considerable improvement: for the given parameters: with binary search: ~2.7 seconds with linear: ~5.1 seconds deterministic (without shuffle as proposed by Barry in the accepted answer based on Robert Floyd): ~3.8 seconds

    int ModifiedBinarySearch(const vector& collection, int raw_rand) {
      int offset = 0;
      int beg = 0, end = collection.size() - 1;
      bool upper_range = 0;
      while (beg <= end) {
        offset = (beg + end) / 2;
        auto to_search_for = (raw_rand+offset);
        auto left = collection[offset];
        auto right = (offset+1 < collection.size() ?
            collection[offset+1] :
            collection[collection.size() - 1]);
        if ((raw_rand+offset) < left) {
          upper_range = false;
          end = offset - 1;
        } else if ((raw_rand+offset+1) >= right) {
          upper_range = true;
          beg = offset + 1;
        } else {
          upper_range = true;
          break;
        }
      }
      offset = ((beg + end) / 2)  + (upper_range ? 1 : 0);
      return offset;
    }
    

提交回复
热议问题