How do I assess the hash collision probability?

前端 未结 5 1321
情歌与酒
情歌与酒 2020-11-27 03:02

I\'m developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary

5条回答
  •  余生分开走
    2020-11-27 03:45

    I came up with a Monte Carlo approach to be able to sleep safely while using UUID for distributed systems that have to serialize without collisions.

    from random import randint
    from math import log
    from collections import Counter
    
    def colltest(exp):
        uniques = []
        while True:
            r = randint(0,2**exp)
            if r in uniques:
                return log(len(uniques) + 1, 2)
            uniques.append(r)
    
    for k,v in Counter([colltest(20) for i in xrange(1000)]):
        print k, "hash orders of magnitude events before collission:",v
    

    would print something like:

    5 hash orders of magnitude events before collission: 1
    6 hash orders of magnitude events before collission: 5
    7 hash orders of magnitude events before collission: 21
    8 hash orders of magnitude events before collission: 91
    9 hash orders of magnitude events before collission: 274
    10 hash orders of magnitude events before collission: 469
    11 hash orders of magnitude events before collission: 138
    12 hash orders of magnitude events before collission: 1
    

    I had heard the formula before: If you need to store log(x/2) keys, use a hashing function that has at least keyspace e**(x).

    Repeated experiments show that for a population of 1000 log-20 spaces, you sometimes get a collision as early as log(x/4).

    For uuid4 which is 122 bits that means I sleep safely while several computers pick random uuid's till I have about 2**31 items. Peak transactions in the system I am thinking about is roughly 10-20 events per second, I'm assuming an average of 7. That gives me an operating window of roughly 10 years, given that extreme paranoia.

提交回复
热议问题