问题
I have a std::collections::HashSet
, and I want to sample and remove a uniformly random element.
Currently, what I'm doing is randomly sampling an index using rand.gen_range
, then iterating over the HashSet
to that index to get the element. Then I remove the selected element. This works, but it's not efficient. Is there an efficient way to do randomly sample an element?
Here's a stripped down version of what my code looks like:
use std::collections::HashSet;
extern crate rand;
use rand::thread_rng;
use rand::Rng;
let mut hash_set = HashSet::new();
// ... Fill up hash_set ...
let index = thread_rng().gen_range(0, hash_set.len());
let element = hash_set.iter().nth(index).unwrap().clone();
hash_set.remove(&element);
// ... Use element ...
回答1:
The only data structures allowing uniform sampling in constant time are data structures with constant time index access. HashSet
does not provide indexing, so you can't generate random samples in constant time.
I suggest to convert your hash set to a Vec
first, and then sample from the vector. To remove an element, simply move the last element in its place – the order of the elements in the vector is immaterial anyway.
If you want to consume all elements from the set in random order, you can also shuffle the vector once and then iterate over it.
Here is an example implementation for removing a random element from a Vec
in constant time:
use rand::{thread_rng, Rng};
pub trait RemoveRandom {
type Item;
fn remove_random<R: Rng>(&mut self, rng: &mut R) -> Option<Self::Item>;
}
impl<T> RemoveRandom for Vec<T> {
type Item = T;
fn remove_random<R: Rng>(&mut self, rng: &mut R) -> Option<Self::Item> {
if self.len() == 0 {
None
} else {
let index = rng.gen_range(0, self.len());
Some(self.swap_remove(index))
}
}
}
(Playground)
回答2:
Thinking about Sven Marnach's answer, I want to use a vector, but I also need constant time insertion without duplication. Then I realized that I can maintain both a vector and a set, and ensure that they both had the same elements at all times. This will allow both constant time insertion with deduplication and constant time random removal.
Here's the implementation I ended up with:
struct VecSet<T> {
set: HashSet<T>,
vec: Vec<T>,
}
impl<T> VecSet<T>
where
T: Clone + Eq + std::hash::Hash,
{
fn new() -> Self {
Self {
set: HashSet::new(),
vec: Vec::new(),
}
}
fn insert(&mut self, elem: T) {
assert_eq!(self.set.len(), self.vec.len());
let was_new = self.set.insert(elem.clone());
if was_new {
self.vec.push(elem);
}
}
fn remove_random(&mut self) -> T {
assert_eq!(self.set.len(), self.vec.len());
let index = thread_rng().gen_range(0, self.vec.len());
let elem = self.vec.swap_remove(index);
let was_present = self.set.remove(&elem);
assert!(was_present);
elem
}
fn is_empty(&self) -> bool {
assert_eq!(self.set.len(), self.vec.len());
self.vec.is_empty()
}
}
来源:https://stackoverflow.com/questions/53755017/can-i-randomly-sample-from-a-hashset-efficiently