Random sample from a very long iterable, in python

后端 未结 5 1811
南笙
南笙 2020-12-11 07:58

I have a long python generator that I want to \"thin out\" by randomly selecting a subset of values. Unfortunately, random.sample() will not work with arbitrary

5条回答
  •  春和景丽
    2020-12-11 08:26

    Since you know the length the data returned by your iterable, you can use xrange() to quickly generate indices into your iterable. Then you can just run the iterable until you've grabbed all of the data:

    import random
    
    def sample(it, length, k):
        indices = random.sample(xrange(length), k)
        result = [None]*k
        for index, datum in enumerate(it):
            if index in indices:
                result[indices.index(index)] = datum
        return result
    
    print sample(iter("abcd"), 4, 2)
    

    In the alternative, here is an implementation of resevior sampleing using "Algorithm R":

    import random
    
    def R(it, k):
        '''https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R'''
        it = iter(it)
        result = []
        for i, datum in enumerate(it):
            if i < k:
                result.append(datum)
            else:
                j = random.randint(0, i-1)
                if j < k:
                    result[j] = datum
        return result
    
    print R(iter("abcd"), 2)
    

    Note that algorithm R doesn't provide a random order for the results. In the example given, 'b' will never precede 'a' in the results.

提交回复
热议问题