Most efficient way of randomly choosing a set of distinct integers

后端 未结 8 713
囚心锁ツ
囚心锁ツ 2020-12-01 07:04

I\'m looking for the most efficient algorithm to randomly choose a set of n distinct integers, where all the integers are in some range [0..maxValue].

Constraints:<

相关标签:
8条回答
  • 2020-12-01 07:29

    For small values of maxValue such that it is reasonable to generate an array of all the integers in memory then you can use a variation of the Fisher-Yates shuffle except only performing the first n steps.


    If n is much smaller than maxValue and you don't wish to generate the entire array then you can use this algorithm:

    1. Keep a sorted list l of number picked so far, initially empty.
    2. Pick a random number x between 0 and maxValue - (elements in l)
    3. For each number in l if it smaller than or equal to x, add 1 to x
    4. Add the adjusted value of x into the sorted list and repeat.

    If n is very close to maxValue then you can randomly pick the elements that aren't in the result and then find the complement of that set.


    Here is another algorithm that is simpler but has potentially unbounded execution time:

    1. Keep a set s of element picked so far, initially empty.
    2. Pick a number at random between 0 and maxValue.
    3. If the number is not in s, add it to s.
    4. Go back to step 2 until s has n elements.

    In practice if n is small and maxValue is large this will be good enough for most purposes.

    0 讨论(0)
  • 2020-12-01 07:30

    UPDATE: I am wrong. The output of this is not uniformly distributed. Details on why are here.


    I think this algorithm below is optimum. I.e. you cannot get better performance than this.

    For choosing n numbers out of m numbers, the best offered algorithm so far is presented below. Its worst run time complexity is O(n), and needs only a single array to store the original numbers. It partially shuffles the first n elements from the original array, and then you pick those first n shuffled numbers as your solution.

    This is also a fully working C program. What you find is:

    • Function getrand: This is just a PRNG that returns a number from 0 up to upto.
    • Function randselect: This is the function that randmoly chooses n unique numbers out of m many numbers. This is what this question is about.
    • Function main: This is only to demonstrate a use for other functions, so that you could compile it into a program and have fun.
    #include <stdio.h>
    #include <stdlib.h>
    
    int getrand(int upto) {
        long int r;
        do {
            r = rand();
        } while (r > upto);
        return r;
    }
    
    void randselect(int *all, int end, int select) {
        int upto = RAND_MAX - (RAND_MAX % end);
        int binwidth = upto / end;
    
        int c;
        for (c = 0; c < select; c++) {
            /* randomly choose some bin */
            int bin = getrand(upto)/binwidth;
    
            /* swap c with bin */
            int tmp = all[c];
            all[c] = all[bin];
            all[bin] = tmp;
        }
    }
    
    int main() {
        int end = 1000;
        int select = 5;
    
        /* initialize all numbers up to end */
        int *all = malloc(end * sizeof(int));
        int c;
        for (c = 0; c < end; c++) {
            all[c] = c;
        }
    
        /* select select unique numbers randomly */
        srand(0);
        randselect(all, end, select);
        for (c = 0; c < select; c++) printf("%d ", all[c]);
        putchar('\n');
    
        return 0;
    }
    

    Here is the output of an example code where I randomly output 4 permutations out of a pool of 8 numbers for 100,000,000 many times. Then I use those many permutations to compute the probability of having each unique permutation occur. I then sort them by this probability. You notice that the numbers are fairly close, which I think means that it is uniformly distributed. The theoretical probability should be 1/1680 = 0.000595238095238095. Note how the empirical test is close to the theoretical one.

    0 讨论(0)
  • 2020-12-01 07:32

    My solution is the same as Mark Byers'. It takes O(n^2) time, hence it's useful when n is much smaller than maxValue. Here's the implementation in python:

    def pick(n, maxValue):
        chosen = []
        for i in range(n):
            r = random.randint(0, maxValue - i)
            for e in chosen:
                if e <= r:
                    r += 1
                else:
                    break;
            bisect.insort(chosen, r)
        return chosen
    
    0 讨论(0)
  • 2020-12-01 07:36

    If you are selecting M elements out of N, the strategy changes depending on whether M is of the same order as N or much less (i.e. less than about N/log N).

    If they are similar in size, then you go through each item from 1 to N. You keep track of how many items you've got so far (let's call that m items picked out of n that you've gone through), and then you take the next number with probability (M-m)/(N-n) and discard it otherwise. You then update m and n appropriately and continue. This is a O(N) algorithm with low constant cost.

    If, on the other hand, M is significantly less than N, then a resampling strategy is a good one. Here you will want to sort M so you can find them quickly (and that will cost you O(M log M) time--stick them into a tree, for example). Now you pick numbers uniformly from 1 to N and insert them into your list. If you find a collision, pick again. You will collide about M/N of the time (actually, you're integrating from 1/N to M/N), which will require you to pick again (recursively), so you'll expect to take M/(1-M/N) selections to complete the process. Thus, your cost for this algorithm is approximately O(M*(N/(N-M))*log(M)).

    These are both such simple methods that you can just implement both--assuming you have access to a sorted tree--and pick the one that is appropriate given the fraction of numbers that will be picked.

    (Note that picking numbers is symmetric with not picking them, so if M is almost equal to N, then you can use the resampling strategy, but pick those numbers to not include; this can be a win, even if you have to push all almost-N numbers around, if your random number generation is expensive.)

    0 讨论(0)
  • 2020-12-01 07:39

    Here is an optimal algorithm, assuming that we are allowed to use hashmaps. It runs in O(n) time and space (and not O(maxValue) time, which is too expensive).

    It is based on Floyd's random sample algorithm. See my blog post about it for details. The code is in Java:

    private static Random rnd = new Random();
    
    public static Set<Integer> randomSample(int max, int n) {
        HashSet<Integer> res = new HashSet<Integer>(n);
        int count = max + 1;
        for (int i = count - n; i < count; i++) {
            Integer item = rnd.nextInt(i + 1);
            if (res.contains(item))
                res.add(i);
            else
                res.add(item);
        }
        return res;
    }
    
    0 讨论(0)
  • 2020-12-01 07:44

    The trick is to use a variation of shuffle or in other words a partial shuffle.

    function random_pick( a, n ) 
    {
      N = len(a);
      n = min(n, N);
      picked = array_fill(0, n, 0); backup = array_fill(0, n, 0);
      // partially shuffle the array, and generate unbiased selection simultaneously
      // this is a variation on fisher-yates-knuth shuffle
      for (i=0; i<n; i++) // O(n) times
      { 
        selected = rand( 0, --N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
        value = a[ selected ];
        a[ selected ] = a[ N ];
        a[ N ] = value;
        backup[ i ] = selected;
        picked[ i ] = value;
      }
      // restore partially shuffled input array from backup
      // optional step, if needed it can be ignored
      for (i=n-1; i>=0; i--) // O(n) times
      { 
        selected = backup[ i ];
        value = a[ N ];
        a[ N ] = a[ selected ];
        a[ selected ] = value;
        N++;
      }
      return picked;
    }
    

    NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and does not need hasmaps (which may not be available and/or usualy hide a complexity behind their implementation, e.g fetch time is not O(1), it might even be O(n) in worst case)

    adapted from here

    0 讨论(0)
提交回复
热议问题