Fastest way to generate a random-like unique string with random length in Python 3

后端未结

关注

 5  1594

我寻月下人不归 2020-12-13 14:45

I know how to create random string, like:

\'\'.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(N))

However, ther

5条回答

一生所求 (楼主)

2020-12-13 15:00

A simple and fast one:

def b36(n, N, chars=string.ascii_uppercase + string.digits):
    s = ''
    for _ in range(N):
        s += chars[n % 36]
        n //= 36
    return s

def produce_amount_keys(amount_of_keys):
    keys = set()
    while len(keys) < amount_of_keys:
        N = np.random.randint(12, 20)
        keys.add(b36(secrets.randbelow(36**N), N))
    return keys

-- Edit: The below refers to a previous revision of Martijn's answer. After our discussion he added another solution to it, which is essentially the same as mine but with some optimizations. They don't help much, though, it's only about 3.4% faster than mine in my testing, so in my opinion they mostly just complicate things. --

Compared with Martijn's final solution in his accepted answer mine is a lot simpler, about factor 1.7 faster, and not biased:

Stefan
8.246490597876106 seconds.
8 different lengths from 12 to 19
  Least common length 19 appeared 124357 times.
  Most common length 16 appeared 125424 times.
36 different characters from 0 to Z
  Least common character Q appeared 429324 times.
  Most common character Y appeared 431433 times.
36 different first characters from 0 to Z
  Least common first character C appeared 27381 times.
  Most common first character Q appeared 28139 times.
36 different last characters from 0 to Z
  Least common last character Q appeared 27301 times.
  Most common last character E appeared 28109 times.

Martijn
14.253227412021943 seconds.
8 different lengths from 12 to 19
  Least common length 13 appeared 124753 times.
  Most common length 15 appeared 125339 times.
36 different characters from 0 to Z
  Least common character 9 appeared 428176 times.
  Most common character C appeared 434029 times.
36 different first characters from 0 to Z
  Least common first character 8 appeared 25774 times.
  Most common first character A appeared 31620 times.
36 different last characters from 0 to Z
  Least common last character Y appeared 27440 times.
  Most common last character X appeared 28168 times.

Martijn's has a bias in the first character, A appears far too often and 8 far to seldom. I ran my test ten times, his most common first character was always A or B (five times each), and his least common character was always 7, 8 or 9 (two, three and five times, respectively). I also checked the lengths separately, length 17 was particularly bad, his most common first character always appeared about 51500 times while his least common first character appeared about 25400 times.

Fun side note: I'm using the secrets module that Martijn dismissed :-)

My whole script:

import string
import secrets
import numpy as np
import os
from itertools import islice, filterfalse
import math

#------------------------------------------------------------------------------------
#   Stefan
#------------------------------------------------------------------------------------

def b36(n, N, chars=string.ascii_uppercase + string.digits):
    s = ''
    for _ in range(N):
        s += chars[n % 36]
        n //= 36
    return s

def produce_amount_keys_stefan(amount_of_keys):
    keys = set()
    while len(keys) < amount_of_keys:
        N = np.random.randint(12, 20)
        keys.add(b36(secrets.randbelow(36**N), N))
    return keys

#------------------------------------------------------------------------------------
#   Martijn
#------------------------------------------------------------------------------------

def b36encode(b, 
        _range=range, _ceil=math.ceil, _log=math.log, _fb=int.from_bytes, _len=len, _b=bytes,
        _c=(string.ascii_uppercase + string.digits).encode()):
    b_int = _fb(b, 'big')
    length = _len(b) and _ceil(_log((256 ** _len(b)) - 1, 36))
    return _b(_c[(b_int // 36 ** i) % 36] for i in _range(length - 1, -1, -1))

def produce_amount_keys_martijn(amount_of_keys):
    def gen_keys(_urandom=os.urandom, _encode=b36encode, _randint=np.random.randint, _factor=math.log(256, 36)):
        while True:
            count = _randint(12, 20)
            yield _encode(_urandom(math.ceil(count / _factor)))[-count:].decode('ascii')
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))

#------------------------------------------------------------------------------------
#   Needed for Martijn
#------------------------------------------------------------------------------------

def unique_everseen(iterable, key=None):
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

#------------------------------------------------------------------------------------
#   Benchmark and quality check
#------------------------------------------------------------------------------------

from timeit import timeit
from collections import Counter

def check(name, func):
    print()
    print(name)

    # Get 999999 keys and report the time.
    keys = None
    def getkeys():
        nonlocal keys
        keys = func(999999)
    t = timeit(getkeys, number=1)
    print(t, 'seconds.')

    # Report statistics about lengths and characters
    def statistics(label, values):
        ctr = Counter(values)
        least = min(ctr, key=ctr.get)
        most = max(ctr, key=ctr.get)
        print(len(ctr), f'different {label}s from', min(ctr), 'to', max(ctr))
        print(f'  Least common {label}', least, 'appeared', ctr[least], 'times.')
        print(f'  Most common {label}', most, 'appeared', ctr[most], 'times.')
    statistics('length', map(len, keys))
    statistics('character', ''.join(keys))
    statistics('first character', (k[0] for k in keys))
    statistics('last character', (k[-1] for k in keys))

for _ in range(2):
    check('Stefan', produce_amount_keys_stefan)
    check('Martijn', produce_amount_keys_martijn)

0 讨论(0)

查看其它5个回答