I\'m developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary
I came up with a Monte Carlo approach to be able to sleep safely while using UUID for distributed systems that have to serialize without collisions.
from random import randint
from math import log
from collections import Counter
def colltest(exp):
uniques = []
while True:
r = randint(0,2**exp)
if r in uniques:
return log(len(uniques) + 1, 2)
uniques.append(r)
for k,v in Counter([colltest(20) for i in xrange(1000)]):
print k, "hash orders of magnitude events before collission:",v
would print something like:
5 hash orders of magnitude events before collission: 1
6 hash orders of magnitude events before collission: 5
7 hash orders of magnitude events before collission: 21
8 hash orders of magnitude events before collission: 91
9 hash orders of magnitude events before collission: 274
10 hash orders of magnitude events before collission: 469
11 hash orders of magnitude events before collission: 138
12 hash orders of magnitude events before collission: 1
I had heard the formula before: If you need to store log(x/2) keys, use a hashing function that has at least keyspace e**(x).
Repeated experiments show that for a population of 1000 log-20 spaces, you sometimes get a collision as early as log(x/4).
For uuid4 which is 122 bits that means I sleep safely while several computers pick random uuid's till I have about 2**31 items. Peak transactions in the system I am thinking about is roughly 10-20 events per second, I'm assuming an average of 7. That gives me an operating window of roughly 10 years, given that extreme paranoia.