I\'m working on porting a MATLAB simulation into C++. To do this, I am trying to replicate MATLAB\'s randsample() function. I haven\'t figured out an efficient way to do thi
Here's an approach that doesn't require generating and shuffling a huge list, in case N is huge but k is not:
std::vector pick(int N, int k) {
std::random_device rd;
std::mt19937 gen(rd());
std::unordered_set elems = pickSet(N, k, gen);
// ok, now we have a set of k elements. but now
// it's in a [unknown] deterministic order.
// so we have to shuffle it:
std::vector result(elems.begin(), elems.end());
std::shuffle(result.begin(), result.end(), gen);
return result;
}
Now the naive approach of implementing pickSet is:
std::unordered_set pickSet(int N, int k, std::mt19937& gen)
{
std::uniform_int_distribution<> dis(1, N);
std::unordered_set elems;
while (elems.size() < k) {
elems.insert(dis(gen));
}
return elems;
}
But if k is large relative to N, this algorithm could lead to lots of collisions and could be pretty slow. We can do better by guaranteeing that we can add one element on each insertion (brought to you by Robert Floyd):
std::unordered_set pickSet(int N, int k, std::mt19937& gen)
{
std::unordered_set elems;
for (int r = N - k; r < N; ++r) {
int v = std::uniform_int_distribution<>(1, r)(gen);
// there are two cases.
// v is not in candidates ==> add it
// v is in candidates ==> well, r is definitely not, because
// this is the first iteration in the loop that we could've
// picked something that big.
if (!elems.insert(v).second) {
elems.insert(r);
}
}
return elems;
}