I have a long python generator that I want to \"thin out\" by randomly selecting a subset of values. Unfortunately, random.sample()
will not work with arbitrary
Use O(n)
Algorithm R https://en.wikipedia.org/wiki/Reservoir_sampling, to select k
random elements from iterable
:
import itertools
import random
def reservoir_sample(iterable, k):
it = iter(iterable)
if not (k > 0):
raise ValueError("sample size must be positive")
sample = list(itertools.islice(it, k)) # fill the reservoir
random.shuffle(sample) # if number of items less then *k* then
# return all items in random order.
for i, item in enumerate(it, start=k+1):
j = random.randrange(i) # random [0..i)
if j < k:
sample[j] = item # replace item with gradually decreasing probability
return sample
Example:
>>> reservoir_sample(iter('abcdefghijklmnopqrstuvwxyz'), 5)
['w', 'i', 't', 'b', 'e']
reservoir_sample()
code is from this answer.