why does this simple shuffle algorithm produce biased results? what is a simple reason?

前端 未结 12 1481
旧时难觅i
旧时难觅i 2020-11-27 03:17

it seems that this simple shuffle algorithm will produce biased results:

# suppose $arr is filled with 1 to 52

for ($i < 0; $i < 52; $i++) { 
  $j = r         


        
12条回答
  •  旧巷少年郎
    2020-11-27 03:24

    an illustrative approach might be this:

    1) consider only 3 cards.

    2) for the algorithm to give evenly distributed results, the chance of "1" ending up as a[0] must be 1/3, and the chance of "2" ending up in a[1] must be 1/3 too, and so forth.

    3) so if we look at the second algorithm:

    probability that "1" ends up at a[0]: when 0 is the random number generated, so 1 case out of (0,1,2), therefore, is 1 out of 3 = 1/3

    probability that "2" ends up at a[1]: when it didn't get swapped to a[0] the first time, and it didn't get swapped to a[2] the second time: 2/3 * 1/2 = 1/3

    probability that "3" ends up at a[2]: when it didn't get swapped to a[0] the first time, and it didn't get swapped to a[1] the second time: 2/3 * 1/2 = 1/3

    they are all perfectly 1/3, and we don't see any error here.

    4) if we try to calculate the probability of of "1" ending up as a[0] in the first algorithm, the calculation will be a bit long, but as the illustration in lassevk's answer shows, it is 9/27 = 1/3, but "2" ending up as a[1] has a chance of 8/27, and "3" ending up as a[2] has a chance of 9/27 = 1/3.

    as a result, "2" ending up as a[1] is not 1/3 and therefore the algorithm will produce pretty skewed result (about 3.7% error, as opposed to any negligible case such as 3/10000000000000 = 0.00000000003%)

    5) the proof that Joel Coehoorn has, actually can prove that some cases will be over-represented. I think the explanation that why it is n^n is this: at each iteration, there are n possibility that the random number can be, so after n iterations, there can be n^n cases = 27. This number doesn't divid the number of permuations (n! = 3! = 6) evenly in the case of n = 3, so some results are over-represented. they are over-represented in a way that instead of showing up 4 times, it shows up 5 times, so if you shuffle the cards millions of times from the initial order of 1 to 52, the over-represented case will show up 5 million times as opposed to 4 million times, which is quite big a difference.

    6) i think the over-representation is shown, but "why" will the over-representation happen?

    7) an ultimate test for the algorithm to be correct is that any number has a 1/n probability to end up at any slot.

提交回复
热议问题