Python has my_sample = random.sample(range(100), 10)
to randomly sample without replacement from [0, 100)
.
Suppose I have sampled n
It's surprising this is not already implemented in one of the core functions, but here is the clean version, that returns the sampled values and the list without replacement:
def sample_n_points_without_replacement(n, set_of_points):
sampled_point_indices = random.sample(range(len(set_of_points)), n)
sampled_point_indices.sort(reverse=True)
sampled_points = [set_of_points[sampled_point_index] for sampled_point_index in sampled_point_indices]
for sampled_point_index in sampled_point_indices:
del(set_of_points[sampled_point_index])
return sampled_points, set_of_points
This is a side note: suppose you want to solve exactly the same problem of sampling without replacement on a list (that I'll call sample_space
), but instead of sampling uniformly over the set of elements you have not sampled already, you are given an initial probability distribution p
that tells you the probability of sampling the i^th
element of the distribution, were you to sample over the whole space.
Then the following implementation using numpy is numerically stable:
import numpy as np
def iterative_sampler(sample_space, p=None):
"""
Samples elements from a sample space (a list)
with a given probability distribution p (numPy array)
without replacement. If called until StopIteration is raised,
effectively produces a permutation of the sample space.
"""
if p is None:
p = np.array([1/len(sample_space) for _ in sample_space])
try:
assert isinstance(sample_space, list)
assert isinstance(p, np.ndarray)
except AssertionError:
raise TypeError("Required types: \nsample_space: list \np type: np.ndarray")
# Main loop
n = len(sample_space)
idxs_left = list(range(n))
for i in range(n):
idx = np.random.choice(
range(n-i),
p= p[idxs_left] / p[idxs_left].sum()
)
yield sample_space[idxs_left[idx]]
del idxs_left[idx]
It's short and concise, I like it. Let me know what you guys think!
Ok, here we go. This should be the fastest possible non-probabilistic algorithm. It has runtime of O(k⋅log²(s) + f⋅log(f)) ⊂ O(k⋅log²(f+k) + f⋅log(f)))
and space O(k+f)
. f
is the amount of forbidden numbers, s
is the length of the longest streak of forbidden numbers. The expectation for that is more complicated, but obviously bound by f
. If you assume that s^log₂(s)
is bigger than f
or are just unhappy about the fact that s
is once again probabilistic, you can change the log part to a bisection search in forbidden[pos:]
to get O(k⋅log(f+k) + f⋅log(f))
.
The actual implementation here is O(k⋅(k+f)+f⋅log(f))
, as insertion in the list forbid
is O(n)
. This is easy to fix by replacing that list with a blist sortedlist.
I also added some comments, because this algorithm is ridiculously complex. The lin
part does the same as the log
part, but needs s
instead of log²(s)
time.
import bisect
import random
def sample(k, end, forbid):
forbidden = sorted(forbid)
out = []
# remove the last block from forbidden if it touches end
for end in reversed(xrange(end+1)):
if len(forbidden) > 0 and forbidden[-1] == end:
del forbidden[-1]
else:
break
for i in xrange(k):
v = random.randrange(end - len(forbidden) + 1)
# increase v by the number of values < v
pos = bisect.bisect(forbidden, v)
v += pos
# this number might also be already taken, find the
# first free spot
##### linear
#while pos < len(forbidden) and forbidden[pos] <=v:
# pos += 1
# v += 1
##### log
while pos < len(forbidden) and forbidden[pos] <= v:
step = 2
# when this is finished, we know that:
# • forbidden[pos + step/2] <= v + step/2
# • forbidden[pos + step] > v + step
# so repeat until (checked by outer loop):
# forbidden[pos + step/2] == v + step/2
while (pos + step <= len(forbidden)) and \
(forbidden[pos + step - 1] <= v + step - 1):
step = step << 1
pos += step >> 1
v += step >> 1
if v == end:
end -= 1
else:
bisect.insort(forbidden, v)
out.append(v)
return out
Now to compare that to the “hack” (and the default implementation in python) that Veedrac proposed, which has space O(f+k)
and (n/(n-(f+k))
is the expected number of “guesses”) time:
I just plotted this for k=10
and a reasonably big n=10000
(it only gets more extreme for bigger n
). And I have to say: I only implemented this because it seemed like a fun challenge, but even I am surprised by how extreme this is:
Let’s zoom in to see what’s going on:
Yes – the guesses are even faster for the 9998th number you generate. Note that, as you can see in the first plot, even my one-liner is probably faster for bigger f/n
(but still has rather horrible space requirements for big n
).
To drive the point home: The only thing you are spending time on here is generating the set, as that’s the f
factor in Veedrac’s method.
So I hope my time here was not wasted and I managed to convince you that Veedrac’s method is simply the way to go. I can kind of understand why that probabilistic part troubles you, but maybe think of the fact that hashmaps (= python dict
s) and tons of other algorithms work with similar methods and they seem to be doing just fine.
You might be afraid of the variance in the number of repetitions. As noted above, this follows a geometric distribution with p=n-f/n
. So the standard deviation (=the amount you “should expect” the result to deviate from the expected average) is
Which is basically the same as the mean (√f⋅n < √n² = n
).
****edit**:
I just realized that s
is actually also n/(n-(f+k))
. So a more exact runtime for my algorithm is O(k⋅log²(n/(n-(f+k))) + f⋅log(f))
. Which is nice since given the graphs above, it proves my intuition that that is quite a bit faster than O(k⋅log(f+k) + f⋅log(f))
. But rest assured that that also does not change anything about the results above, as the f⋅log(f)
is the absolutely dominant part in the runtime.
This is a rewritten version of @necromancer's cool solution. Wraps it in a class to make it much easier to use correctly, and uses more dict methods to cut the lines of code.
from random import randrange
class Sampler:
def __init__(self, n):
self.n = n # number remaining from original range(n)
# i is a key iff i < n and i already returned;
# in that case, state[i] is a value to return
# instead of i.
self.state = dict()
def get(self):
n = self.n
if n <= 0:
raise ValueError("range exhausted")
result = i = randrange(n)
state = self.state
# Most of the fiddling here is just to get
# rid of state[n-1] (if it exists). It's a
# space optimization.
if i == n - 1:
if i in state:
result = state.pop(i)
elif i in state:
result = state[i]
if n - 1 in state:
state[i] = state.pop(n - 1)
else:
state[i] = n - 1
elif n - 1 in state:
state[i] = state.pop(n - 1)
else:
state[i] = n - 1
self.n = n-1
return result
Here's a basic driver:
s = Sampler(100)
allx = [s.get() for _ in range(100)]
assert sorted(allx) == list(range(100))
from collections import Counter
c = Counter()
for i in range(6000):
s = Sampler(3)
one = tuple(s.get() for _ in range(3))
c[one] += 1
for k, v in sorted(c.items()):
print(k, v)
and sample output:
(0, 1, 2) 1001
(0, 2, 1) 991
(1, 0, 2) 995
(1, 2, 0) 1044
(2, 0, 1) 950
(2, 1, 0) 1019
By eyeball, that distribution is fine (run a chi-squared test if you're skeptical). Some of the solutions here don't give each permutation with equal probability (even though they return each k-subset of n with equal probability), so are unlike random.sample()
in that respect.
Reasonably fast one-liner (O(n + m)
, n=range,m=old samplesize):
next_sample = random.sample(set(range(100)).difference(my_sample), 10)
Edit: see cleaner versions below by @TimPeters and @Chronial. A minor edit pushed this to the top.
Here is what I believe is the most efficient solution for incremental sampling. Instead of a list of previously sampled numbers, the state to be maintained by the caller comprises a dictionary that is ready for use by the incremental sampler, and a count of numbers remaining in the range.
The following is a demonstrative implementation. Compared to other solutions:
O(log(number_previously_sampled))
O(number_previously_sampled)
Code:
import random
def remove (i, n, state):
if i == n - 1:
if i in state:
t = state[i]
del state[i]
return t
else:
return i
else:
if i in state:
t = state[i]
if n - 1 in state:
state[i] = state[n - 1]
del state[n - 1]
else:
state[i] = n - 1
return t
else:
if n - 1 in state:
state[i] = state[n - 1]
del state[n - 1]
else:
state[i] = n - 1
return i
s = dict()
for n in range(100, 0, -1):
print remove(random.randrange(n), n, s)