Fastest way to generate a random-like unique string with random length in Python 3

后端 未结 5 1618
我寻月下人不归
我寻月下人不归 2020-12-13 14:45

I know how to create random string, like:

\'\'.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(N))

However, ther

5条回答
  •  醉酒成梦
    2020-12-13 15:07

    So it's a speed race is it?

    Building on the work of Martijn Pieters, I've got a solution which cleverly leverages another library for generating random strings: uuid.

    My solution is to generate a uuid4, base64 encode it and uppercase it, to get only the characters we're after, then slice it to a random length.

    This works for this case because the length of outputs we're after, (12-20), is shorter than the shortest base64 encoding of a uuid4. It's also really fast, because uuid is very fast.

    I also made it a generator instead of a regular function, because they can be more efficient.

    Interestingly, using the standard library's randint function was faster than numpy's.

    Here is the test output:

    Timing 40k keys 10 times with produce_amount_keys
    20.899942063027993
    Timing 40k keys 10 times with produce_amount_keys, stdlib randint
    20.85920040300698
    Timing 40k keys 10 times with uuidgen
    3.852462349983398
    Timing 40k keys 10 times with uuidgen, stdlib randint
    3.136272903997451
    

    Here is the code for uuidgen():

    def uuidgen(count, _randint=np.random.randint):
        generated = set()
    
        while True:
            if len(generated) == count:
                return
    
            candidate = b64encode(uuid4().hex.encode()).upper()[:_randint(12, 20)]
            if candidate not in generated:
                generated.add(candidate)
                yield candidate
    

    And here is the entire project. (At commit d9925d at the time of writing).


    Thanks to feedback from Martijn Pieters, I've improved the method somewhat, increasing the entropy, and speeding it up by a factor of about 1/6th.

    There is still a lot of entropy lost in casting all lowercase letters to uppercase. If that's important, then it's possibly advisable to use b32encode() instead, which has the characters we want, minus 0, 1, 8, and 9.

    The new solution reads as follows:

    def urandomgen(count):
        generated = set()
    
        while True:
            if len(generated) == count:
                return
    
            desired_length = randint(12, 20)
    
            # # Faster than math.ceil
            # urandom_bytes = urandom(((desired_length + 1) * 3) // 4)
            #
            # candidate = b64encode(urandom_bytes, b'//').upper()
            #
            # The above is rolled into one line to cut down on execution
            # time stemming from locals() dictionary access.
    
            candidate = b64encode(
                urandom(((desired_length + 1) * 3) // 4),
                b'//',
            ).upper()[:desired_length]
    
            while b'/' in candidate:
                candidate = candidate.replace(b'/', choice(ALLOWED_CHARS), 1)
    
            if candidate not in generated:
                generated.add(candidate)
                yield candidate.decode()
    

    And the test output:

    Timing 40k keys 10 times with produce_amount_keys, stdlib randint
    19.64966493297834
    Timing 40k keys 10 times with uuidgen, stdlib randint
    4.063803717988776
    Timing 40k keys 10 times with urandomgen, stdlib randint
    2.4056471119984053
    

    The new commit in my repository is 5625fd.


    Martijn's comments on entropy got me thinking. The method I used with base64 and .upper() makes letters SO much more common than numbers. I revisited the problem with a more binary mind on.

    The idea was to take output from os.urandom(), interpret it as a long string of 6-bit unsigned numbers, and use those numbers as an index to a rolling array of the allowed characters. The first 6-bit number would select a character from the range A..Z0..9A..Z01, the second 6-bit number would select a character from the range 2..9A..Z0..9A..T, and so on.

    This has a slight crushing of entropy in that the first character will be slightly less likely to contain 2..9, the second character less likely to contain U..Z0, and so on, but it's so much better than before.

    It's slightly faster than uuidgen(), and slightly slower than urandomgen(), as shown below:

    Timing 40k keys 10 times with produce_amount_keys, stdlib randint
    20.440480664998177
    Timing 40k keys 10 times with uuidgen, stdlib randint
    3.430628580001212
    Timing 40k keys 10 times with urandomgen, stdlib randint
    2.0875444510020316
    Timing 40k keys 10 times with bytegen, stdlib randint
    2.8740892770001665
    

    I'm not entirely sure how to eliminate the last bit of entropy crushing; offsetting the start point for the characters will just move the pattern along a little, randomising the offset will be slow, shuffling the map will still have a period... I'm open to ideas.

    The new code is as follows:

    from os import urandom
    from random import randint
    from string import ascii_uppercase, digits
    
    # Masks for extracting the numbers we want from the maximum possible
    # length of `urandom_bytes`.
    bitmasks = [(0b111111 << (i * 6), i) for i in range(20)]
    allowed_chars = (ascii_uppercase + digits) * 16  # 576 chars long
    
    
    def bytegen(count):
        generated = set()
    
        while True:
            if len(generated) == count:
                return
    
            # Generate 9 characters from 9x6 bits
            desired_length = randint(12, 20)
            bytes_needed = (((desired_length * 6) - 1) // 8) + 1
    
            # Endianness doesn't matter.
            urandom_bytes = int.from_bytes(urandom(bytes_needed), 'big')
    
            chars = [
                allowed_chars[
                    (((urandom_bytes & bitmask) >> (i * 6)) + (0b111111 * i)) % 576
                ]
                for bitmask, i in bitmasks
            ][:desired_length]
    
            candidate = ''.join(chars)
    
            if candidate not in generated:
                generated.add(candidate)
                yield candidate
    

    And the full code, along with a more in-depth README on the implementation, is over at de0db8.

    I tried several things to speed the implementation up, as visible in the repo. Something that would definitely help is a character encoding where the numbers and ASCII uppercase letters are sequential.

提交回复
热议问题