How to incrementally sample without replacement?

前端未结

关注

 13  1613

Python has my_sample = random.sample(range(100), 10) to randomly sample without replacement from [0, 100).

Suppose I have sampled n


                      
              相关标签:


      
      
        
          13条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  误落风尘        
                
              
                            
                2020-12-05 01:19
              
            
            
                                                                       
It's surprising this is not already implemented in one of the core functions, but here is the clean version, that returns the sampled values and the list without replacement:

def sample_n_points_without_replacement(n, set_of_points):
    sampled_point_indices = random.sample(range(len(set_of_points)), n)
    sampled_point_indices.sort(reverse=True)
    sampled_points = [set_of_points[sampled_point_index] for sampled_point_index in sampled_point_indices]
    for sampled_point_index in sampled_point_indices:
        del(set_of_points[sampled_point_index])
    return sampled_points, set_of_points

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南旧        
                
              
                            
                2020-12-05 01:19
              
            
            
                                                                       
This is a side note: suppose you want to solve exactly the same problem of sampling without replacement on a list (that I'll call sample_space), but instead of sampling uniformly over the set of elements you have not sampled already, you are given an initial probability distribution p that tells you the probability of sampling the i^th element of the distribution, were you to sample over the whole space. 

Then the following implementation using numpy is numerically stable: 

import numpy as np

def iterative_sampler(sample_space, p=None):
    """
        Samples elements from a sample space (a list) 
        with a given probability distribution p (numPy array) 
        without replacement. If called until StopIteration is raised,
        effectively produces a permutation of the sample space.
    """
    if p is None:
        p = np.array([1/len(sample_space) for _ in sample_space])

    try:
        assert isinstance(sample_space, list)
        assert isinstance(p, np.ndarray)
    except AssertionError:
        raise TypeError("Required types: \nsample_space: list \np type: np.ndarray")

    # Main loop
    n = len(sample_space)   
    idxs_left = list(range(n))
    for i in range(n):
        idx = np.random.choice(
            range(n-i), 
            p= p[idxs_left] / p[idxs_left].sum()
        )
        yield sample_space[idxs_left[idx]]
        del idxs_left[idx]


It's short and concise, I like it. Let me know what you guys think!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2020-12-05 01:21
              
            
            
                                                                       
Ok, here we go. This should be the fastest possible non-probabilistic algorithm. It has runtime of O(k⋅log²(s) + f⋅log(f)) ⊂ O(k⋅log²(f+k) + f⋅log(f))) and space O(k+f). f is the amount of forbidden numbers, s is the length of the longest streak of forbidden numbers. The expectation for that is more complicated, but obviously bound by f. If you assume that s^log₂(s) is bigger than f or are just unhappy about the fact that s is once again probabilistic, you can change the log part to a bisection search in forbidden[pos:] to get O(k⋅log(f+k) + f⋅log(f)). 

The actual implementation here is O(k⋅(k+f)+f⋅log(f)), as insertion in the list forbid is O(n). This is easy to fix by replacing that list with a blist sortedlist.

I also added some comments, because this algorithm is ridiculously complex. The lin part does the same as the log part, but needs s instead of log²(s) time.

import bisect
import random

def sample(k, end, forbid):
    forbidden = sorted(forbid)
    out = []
    # remove the last block from forbidden if it touches end
    for end in reversed(xrange(end+1)):
        if len(forbidden) > 0 and forbidden[-1] == end:
            del forbidden[-1]
        else:
            break

    for i in xrange(k):
        v = random.randrange(end - len(forbidden) + 1)
        # increase v by the number of values < v
        pos = bisect.bisect(forbidden, v)
        v += pos
        # this number might also be already taken, find the
        # first free spot
        ##### linear
        #while pos < len(forbidden) and forbidden[pos] <=v:
        #    pos += 1
        #    v += 1
        ##### log
        while pos < len(forbidden) and forbidden[pos] <= v:
            step = 2
            # when this is finished, we know that:
            # • forbidden[pos + step/2] <= v + step/2
            # • forbidden[pos + step]   >  v + step
            # so repeat until (checked by outer loop):
            #   forbidden[pos + step/2] == v + step/2
            while (pos + step <= len(forbidden)) and \
                  (forbidden[pos + step - 1] <= v + step - 1):
                step = step << 1
            pos += step >> 1
            v += step >> 1

        if v == end:
            end -= 1
        else:
            bisect.insort(forbidden, v)
        out.append(v)
    return out


Now to compare that to the “hack” (and the default implementation in python) that Veedrac proposed, which has space O(f+k) and (n/(n-(f+k)) is the expected number of “guesses”) time:



I just plotted this for k=10 and a reasonably big n=10000 (it only gets more extreme for bigger n). And I have to say: I only implemented this because it seemed like a fun challenge, but even I am surprised by how extreme this is:



Let’s zoom in to see what’s going on:



Yes – the guesses are even faster for the 9998th number you generate. Note that, as you can see in the first plot, even my one-liner is probably faster for bigger f/n (but still has rather horrible space requirements for big n).

To drive the point home: The only thing you are spending time on here is generating the set, as that’s the f factor in Veedrac’s method. 



So I hope my time here was not wasted and I managed to convince you that Veedrac’s method is simply the way to go. I can kind of understand why that probabilistic part troubles you, but maybe think of the fact that hashmaps (= python dicts) and tons of other algorithms work with similar methods and they seem to be doing just fine.

You might be afraid of the variance in the number of repetitions. As noted above, this follows a geometric distribution with p=n-f/n. So the standard deviation (=the amount you “should expect” the result to deviate from the expected average) is 



Which is basically the same as the mean (√f⋅n < √n² = n).

****edit**:

I just realized that s is actually also n/(n-(f+k)). So a more exact runtime for my algorithm is O(k⋅log²(n/(n-(f+k))) + f⋅log(f)). Which is nice since given the graphs above, it proves my intuition that that is quite a bit faster than O(k⋅log(f+k) + f⋅log(f)). But rest assured that that also does not change anything about the results above, as the f⋅log(f) is the absolutely dominant part in the runtime.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  鱼传尺愫        
                
              
                            
                2020-12-05 01:22
              
            
            
                                                                       
This is a rewritten version of @necromancer's cool solution.  Wraps it in a class to make it much easier to use correctly, and uses more dict methods to cut the lines of code.

from random import randrange

class Sampler:
    def __init__(self, n):
        self.n = n # number remaining from original range(n)
        # i is a key iff i < n and i already returned;
        # in that case, state[i] is a value to return
        # instead of i.
        self.state = dict()

    def get(self):
        n = self.n
        if n <= 0:
            raise ValueError("range exhausted")
        result = i = randrange(n)
        state = self.state
        # Most of the fiddling here is just to get
        # rid of state[n-1] (if it exists).  It's a
        # space optimization.
        if i == n - 1:
            if i in state:
                result = state.pop(i)
        elif i in state:
            result = state[i]
            if n - 1 in state:
                state[i] = state.pop(n - 1)
            else:
                state[i] = n - 1
        elif n - 1 in state:
            state[i] = state.pop(n - 1)
        else:
            state[i] = n - 1
        self.n = n-1
        return result


Here's a basic driver:

s = Sampler(100)
allx = [s.get() for _ in range(100)]
assert sorted(allx) == list(range(100))

from collections import Counter
c = Counter()
for i in range(6000):
    s = Sampler(3)
    one = tuple(s.get() for _ in range(3))
    c[one] += 1
for k, v in sorted(c.items()):
    print(k, v)


and sample output:

(0, 1, 2) 1001
(0, 2, 1) 991
(1, 0, 2) 995
(1, 2, 0) 1044
(2, 0, 1) 950
(2, 1, 0) 1019


By eyeball, that distribution is fine (run a chi-squared test if you're skeptical).  Some of the solutions here don't give each permutation with equal probability (even though they return each k-subset of n with equal probability), so are unlike random.sample() in that respect.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2020-12-05 01:24
              
            
            
                                                                       
Reasonably fast one-liner (O(n + m), n=range,m=old samplesize):

next_sample = random.sample(set(range(100)).difference(my_sample), 10)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长发绾君心        
                
              
                            
                2020-12-05 01:25
              
            
            
                                                                       
Edit: see cleaner versions below by @TimPeters and @Chronial. A minor edit pushed this to the top.

Here is what I believe is the most efficient solution for incremental sampling. Instead of a list of previously sampled numbers, the state to be maintained by the caller comprises a dictionary that is ready for use by the incremental sampler, and a count of numbers remaining in the range.

The following is a demonstrative implementation. Compared to other solutions:


no loops (no Standard Python/Veedrac hack; shared credit between Python impl and Veedrac)
time complexity is O(log(number_previously_sampled))
space complexity is O(number_previously_sampled)


Code:

import random

def remove (i, n, state):
  if i == n - 1:
    if i in state:
      t = state[i]
      del state[i]
      return t
    else:
      return i
  else:
    if i in state:
      t = state[i]
      if n - 1 in state:
        state[i] = state[n - 1]
        del state[n - 1]
      else:
        state[i] = n - 1
      return t
    else:
      if n - 1 in state:
        state[i] = state[n - 1]
        del state[n - 1]
      else:
        state[i] = n - 1
      return i

s = dict()
for n in range(100, 0, -1):
  print remove(random.randrange(n), n, s)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复