CRC32 Collision Probability

问题

I've done quite a bit of checking up on other questions and I'm still uncertain on the issue.

Here's my usage case:

I have an online shopping cart. Occassionaly, certain clients find the ordering process either too tedious, or there are some clients where an online order will not cut it, and they need an actual PDF estimate (quote) in order to purchase a product.

So I coded in a module that takes the shopping cart contents, and lays out neatly as a PDF estimate.

Now because this process only uses the cart contents, and nothing else is used, not even the database, I have to create a unique Estimate document number, so that should the client pay for the quote, they have a reference to use in their payment instruction.

The shopping cart currently generates a 5 digit cart ID, unique to each customer based on their session. I've taken this 5 digit cart ID, and I've then added UNIX time to it, which gives me a nice long number to use as the Estimate document number.

So I end up with something like this: 363821482812537 [36382 is the cart ID and 1482812537 is unix time at the time the PDF estimate was generated]

The problem with this is that it is too long, and WILL be an issue as bank payment references are limited. Ideally, I'd like to keep it to 10 characters or less.

I've decided to look at CRC32 to shorten the generated estimate numbers, and it seems capable of shortening the estimate number to an acceptable amount of characters.

But, can anyone shed some light on what kind of collision I might be up against?

Few things to consider:

Cart ID will always be 5 digits.
Unix time will always be 10 digits up until the year 2286.

[So we will always end up with 15 digits that needs to be encoded, and no more]

There is a safeguard in place, that if by some chance, a duplicate occurs, an error is thrown, and the the option is provided to retry and generate the estimate. This is done by the estimate saving to a filename matching the estimate number (or in this case, the CRC32 hash of the estimate number) - and then checking first to see if a filename with the hash exists.
Customers will for the moment not be allowed to generate estimates themselves, for reasons not important to my question. So it will only be admins who can generate estimates.

My concern is simple, will I find myself running into collisions very often with my 15 digit to CRC32 hash encoding, or is it going to be pretty rare to run into collisions?

回答1:

Why not just maintain an estimate number that you simply increment each time you need a new one? You are already effectively maintaining a list of used numbers to check against for collisions, so just put your counter there. Then you only need to look at one thing instead of n things. By taking the CRC, you are discarding information you might try to extract from the estimate number, so there was no point in making the ID out of that information in the first place. Your approach seems way more complicated than it needs to be.

The probability of an individual collision is 2^-32. The data content doesn't matter, so long as it's more than 32 bits, which it is in this case, since a CRC does a very good job mixing the bits. However you have n chances at a collision if you have previously generated n estimates. So as n grows, the chance of a collision grows accordingly. (See the Birthday Problem.) As a result, after only 77,164 estimates there is a probability of 50% that two of their hashes collided.

来源：https://stackoverflow.com/questions/41339204/crc32-collision-probability

标签

crc32

hash-collision