How do I assess the hash collision probability?

前端未结

关注

 5  1321

情歌与酒 2020-11-27 03:02

I\'m developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary

5条回答

余生分开走 (楼主)

2020-11-27 03:45
I came up with a Monte Carlo approach to be able to sleep safely while using UUID for distributed systems that have to serialize without collisions.
```
from random import randint
from math import log
from collections import Counter

def colltest(exp):
    uniques = []
    while True:
        r = randint(0,2**exp)
        if r in uniques:
            return log(len(uniques) + 1, 2)
        uniques.append(r)

for k,v in Counter([colltest(20) for i in xrange(1000)]):
    print k, "hash orders of magnitude events before collission:",v
```
would print something like:
```
5 hash orders of magnitude events before collission: 1
6 hash orders of magnitude events before collission: 5
7 hash orders of magnitude events before collission: 21
8 hash orders of magnitude events before collission: 91
9 hash orders of magnitude events before collission: 274
10 hash orders of magnitude events before collission: 469
11 hash orders of magnitude events before collission: 138
12 hash orders of magnitude events before collission: 1
```
I had heard the formula before: If you need to store log(x/2) keys, use a hashing function that has at least keyspace e**(x).

Repeated experiments show that for a population of 1000 log-20 spaces, you sometimes get a collision as early as log(x/4).

For uuid4 which is 122 bits that means I sleep safely while several computers pick random uuid's till I have about 2**31 items. Peak transactions in the system I am thinking about is roughly 10-20 events per second, I'm assuming an average of 7. That gives me an operating window of roughly 10 years, given that extreme paranoia.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...