What is the probability of collision with a 6 digit random alphanumeric code?

*爱你&永不变心* 提交于 2019-12-03 01:54:07

Because of the Birthday Paradox it's more likely than you might think.

There are 2,176,782,336 possible codes, but even inserting just 50,000 rows there is already a quite high chance of a collision. For 1,000,000 rows it is almost inevitable that there will be many collisions (I think about 250 on average).

I ran a few tests and this is the number of codes I could generate before the first collision occurred:

  • 73366
  • 59307
  • 79297
  • 36909

Collisions will become more frequent as the number of codes increases.

Here was my test code (written in Python):

>>> import random
>>> codes = set()
>>> while 1:
        code=''.join(random.choice('1234567890qwertyuiopasdfghjklzxcvbnm')for x in range(6))
        if code in codes: break
        codes.add(code)

>>> len(codes)
36909
Tom Zych

Well, you have 36**6 possible codes, which is about 2 billion. Call this d. Using a formula found here, we find that the probability of a collision, for n codes, is approximately

1 - ((d-1)/d)**(n*(n-1)/2)

For any n over 50,000 or so, that's pretty high.

Looks like a 10-character code has a collision probability of only about 1/800. So go with 10 or more.

csgillespie

As mentioned previously, the birthday paradox makes this event quite likely. In particular, a accurate approximation can be determined when the problem is cast as a collision problem. Let p(n; d) be the probability that at least two numbers are the same, d be the number of combinations and n the number of trails. Then, we can show that p(n; d) is approximately equal to:

1 - ((d-1)/d)^(n*(n-1)/2)

We can easily plot this in R:

> d = 2176782336
> n = 1:100000
> plot(n,1 - ((d-1)/d)^(n*(n-1)/2), type='l')

which gives

As you can see the collision probability increases very quickly with the number of trials/rows

Based on the equations given at http://en.wikipedia.org/wiki/Birthday_paradox#Approximation_of_number_of_people, there is a 50% chance of encountering at least one collision after inserting only 55,000 records or so into a universe of this size:

http://wolfr.am/niaHIF

Trying to insert two to six times as many records will almost certainly lead to a collision. You'll need to assign codes nonrandomly, or use a larger code.

Alexey Frunze

While I don't know the specifics of exactly how you want to use these pseudo-random IDs, you may want to consider generating an array of 3000000 integers (from 1 to 3000000) and randomly shuffling it. That would guarantee that the numbers are unique. See Fisher-Yates shuffle on Wikipedia.

A caution: Beware of relying on the built-in rand where the quality of the pseudo random number generator matters. I recently found out about Math::Random::MT::Auto:

The Mersenne Twister is a fast pseudorandom number generator (PRNG) that is capable of providing large volumes (> 10^6004) of "high quality" pseudorandom data to applications that may exhaust available "truly" random data sources or system-provided PRNGs such as rand.

The module provides a drop in replacement for rand which is handy.

You can generate the sequence of keys with the following code:

#!/usr/bin/env perl

use warnings; use strict;
use Math::Random::MT::Auto qw( rand );

my $SEQUENCE_LENGTH = 1_000_000;
my %dict;
my $picks;

for my $i (1 .. $SEQUENCE_LENGTH) {
    my $pick = pick_one();
    $picks += 1;

    redo if exists $dict{ $pick };

    $dict{ $pick } = undef;
}

printf "Generated %d keys with %d picks\n", scalar keys %dict, $picks;

sub pick_one {
    join '', map { ("A".."Z", 0..9)[rand 36] } 1..6;
}

Some time ago, I wrote about the limited range of built-in rand on Windows. You may not be on Windows, but there might be other limitations or pitfalls on your system.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!