Random sample from a very long iterable, in python

后端 未结 5 1809
南笙
南笙 2020-12-11 07:58

I have a long python generator that I want to \"thin out\" by randomly selecting a subset of values. Unfortunately, random.sample() will not work with arbitrary

5条回答
  •  南方客
    南方客 (楼主)
    2020-12-11 08:23

    Use O(n) Algorithm R https://en.wikipedia.org/wiki/Reservoir_sampling, to select k random elements from iterable:

    import itertools
    import random
    
    def reservoir_sample(iterable, k):
        it = iter(iterable)
        if not (k > 0):
            raise ValueError("sample size must be positive")
    
        sample = list(itertools.islice(it, k)) # fill the reservoir
        random.shuffle(sample) # if number of items less then *k* then
                               #   return all items in random order.
        for i, item in enumerate(it, start=k+1):
            j = random.randrange(i) # random [0..i)
            if j < k:
                sample[j] = item # replace item with gradually decreasing probability
        return sample
    

    Example:

    >>> reservoir_sample(iter('abcdefghijklmnopqrstuvwxyz'), 5)
    ['w', 'i', 't', 'b', 'e']
    

    reservoir_sample() code is from this answer.

提交回复
热议问题