Python random sample generator (comfortable with huge population sizes)

后端 未结 5 676
既然无缘
既然无缘 2021-01-14 06:57

As you might know random.sample(population,sample_size) quickly returns a random sample, but what if you don\'t know in advance the size of the sample? You end

5条回答
  •  既然无缘
    2021-01-14 07:39

    You can get a sample of size K out of a population of size N by picking K non-repeating random-numbers in the range [0...N[ and treat them as indexes.

    Option a)

    You could generate such a index-sample using the well-known sample method.

    random.sample(xrange(N), K)
    

    From the Python docs about random.sample:

    To choose a sample from a range of integers, use an xrange() object as an argument. This is especially fast and space efficient for sampling from a large population

    Option b)

    If you don't like the fact that random.sample already returns a list instead of a lazy generator of non-repeating random numbers, you can go fancy with Format-Preserving Encryption to encrypt a counter.

    This way you get a real generator of random indexes, and you can pick as many as you want and stop at any time, without getting any duplicates, which gives you dynamically sized sample sets.

    The idea is to construct an encryption scheme to encrypt the numbers from 0 to N. Now, for each time you want to get a sample from your population, you pick a random key for your encryption and start to encrypt the numbers from 0, 1, 2, ... onwards (this is the counter). Since every good encryption creates a random-looking 1:1 mapping you end up with non-repeating random integers you can use as indexes. The storage requirements during this lazy generation is just the initial key plus the current value of the counter.

    The idea was already discussed in Generating non-repeating random numbers in Python. There even is a python snippet linked: formatpreservingencryption.py

    A sample code using this snippet could be implemented like this:

    def itersample(population):
        # Get the size of the population
        N = len(population)
        # Get the number of bits needed to represent this number
        bits = (N-1).bit_length()
        # Generate some random key
        key = ''.join(random.choice(string.ascii_letters + string.digits) for _ in range(32))
        # Create a new crypto instance that encrypts binary blocks of width 
        # Thus, being able to encrypt all numbers up to the nearest power of two
        crypter = FPEInteger(key=key, radix=2, width=bits)
    
        # Count up 
        for i in xrange(1<

提交回复
热议问题