Python random sample generator (comfortable with huge population sizes)

后端 未结 5 673
既然无缘
既然无缘 2021-01-14 06:57

As you might know random.sample(population,sample_size) quickly returns a random sample, but what if you don\'t know in advance the size of the sample? You end

5条回答
  •  自闭症患者
    2021-01-14 07:38

    I wrote (in Python 2.7.9) a random sampler generator (of indexes) whose speed depends only on sample size (it should be O(ns log(ns)) where ns is sample size). So it is fast when sample size is small compared to population size, because it does NOT depend at all on population size. It doesn't build any population collection, it just picks random indexes and uses a kind of bisect method on sampled indexes to avoid duplicates and keep then sorted. Given an iterable population, here's how to use itersample generator:

    import random
    sampler=itersample(len(population))
    next_pick=sampler.next() # pick the next random (index of) element
    

    or

    import random
    sampler=itersample(len(population))
    sample=[]
    for index in sampler:
        # do something with (index of) picked element
        sample.append(index) # build a sample
        if some_condition: # stop sampling when needed
            break
    

    If you need the actual elements and not just the indexes, just apply population iterable to the index when needed (population[sampler.next()] and population[index] respectively for first and second example).

    The results of some tests show that speed does NOT depend on population size, so if you need to randomly pick only 10 elements from a population of 100 billions, you pay only for 10 (remember, we don't know in advance how many elements we'll pick, otherwise you'd better use random.sample).

    Sampling 1000 from 1000000
    Using itersample 0.0324 s
    
    Sampling 1000 from 10000000
    Using itersample 0.0304 s
    
    Sampling 1000 from 100000000
    Using itersample 0.0311 s
    
    Sampling 1000 from 1000000000
    Using itersample 0.0329 s
    

    Other tests confirm that running time is slightly more than linear with sample size:

    Sampling 100 from 1000000000
    Using itersample 0.0018 s
    
    Sampling 1000 from 1000000000
    Using itersample 0.0294 s
    
    Sampling 10000 from 1000000000
    Using itersample 0.4438 s
    
    Sampling 100000 from 1000000000
    Using itersample 8.8739 s
    

    Finally, here is the generator function itersample:

    import random
    def itersample(c): # c: population size
        sampled=[]
        def fsb(a,b): # free spaces before middle of interval a,b
            fsb.idx=a+(b+1-a)/2
            fsb.last=sampled[fsb.idx]-fsb.idx if len(sampled)>0 else 0
            return fsb.last
        while len(sampled)sample_index:
                yielding=sample_index
                sampled.insert(0,yielding)
                yield yielding
            elif fsb(b,b)

提交回复
热议问题