Fastest method to generate big random string with lower Latin letters

后端 未结 8 1906
难免孤独
难免孤独 2020-12-06 05:38

I\'m trying to solve this problem from Timus Online Judge. To solve this problem you need generate a sequence of 1 000 000 lowercase Latin letters and write it to stdin in

8条回答
  •  北荒
    北荒 (楼主)
    2020-12-06 05:52

    Here's Python 3 code that generates 1000000 "random" lowercase letters in 0.28 seconds (see also 0.11-seconds solution at the end; @Ashwini Chaudhary's code from the question takes 0.55 seconds on my machine, @Markku K.'s code -- 0.53):

    #!/usr/bin/env python3
    import os
    import sys
    
    def write_random_lowercase(n):
        min_lc = ord(b'a')
        len_lc = 26
        ba = bytearray(os.urandom(n))
        for i, b in enumerate(ba):
            ba[i] = min_lc + b % len_lc # convert 0..255 to 97..122
        sys.stdout.buffer.write(ba)
    
    write_random_lowercase(1000000)
    

    % len_lc skews the distribution (see at the end on how to fix it) though It still satisfies the conditions (ascii, lowercase, frequencies of 1, 2, 3 letter sequences):

    $ python3 generate-random.py | python3 check-seq.py
    

    where check-seq.py:

    #!/usr/bin/env python3
    import sys
    from collections import Counter
    from string import ascii_lowercase
    
    def main():
        limits = [40000, 2000, 100]
    
        s = sys.stdin.buffer.readline() # a single line
        assert 1000000 <= len(s) <= 1000002 # check length +/- newline
        s.decode('ascii','strict') # check ascii
        assert set(s) == set(ascii_lowercase.encode('ascii')) # check lowercase
    
        for n, lim in enumerate(limits, start=1):
            freq = Counter(tuple(s[i:i+n]) for i in range(len(s)))
            assert max(freq.values()) <= lim, freq
    
    main()
    

    Note: on acm.timus.ru generate-random.py gives "Output limit exceeded".

    To improve performance, you could use bytes.translate() method (0.11 seconds):

    #!/usr/bin/env python3
    import os
    import sys
    
    # make translation table from 0..255 to 97..122
    tbl = bytes.maketrans(bytearray(range(256)),
                          bytearray([ord(b'a') + b % 26 for b in range(256)]))
    # generate random bytes and translate them to lowercase ascii
    sys.stdout.buffer.write(os.urandom(1000000).translate(tbl))
    

    How to fix % len_lc skew

    256 (number of bytes) is not evenly divisible by 26 (number of lower Latin letters) therefore the formula min_lc + b % len_lc makes some values appear less often than others e.g.:

    #!/usr/bin/env python3
    """Find out skew: x = 97 + y % 26 where y is uniform from [0, 256) range."""
    from collections import Counter, defaultdict
    
    def find_skew(random_bytes):
        char2freq = Counter(chr(ord(b'a') + b % 26) for b in random_bytes)
        freq2char = defaultdict(set)
        for char, freq in char2freq.items():
            freq2char[freq].add(char)
        return {f: ''.join(sorted(c)) for f, c in freq2char.items()}
    
    print(find_skew(range(256)))
    # -> {9: 'wxyz', 10: 'abcdefghijklmnopqrstuv'}
    

    Here, the input range(256) is uniformly distributed (each byte occurs exactly once) but 'wxyz' letters in the output are less often then the rest 9 vs. 10 occurrences. To fix it, unaligned bytes could be dropped:

    print(find_skew(range(256 - (256 % 26))))
    # -> {9: 'abcdefghijklmnopqrstuvwxyz'}
    

    Here, the input is uniformly distributed bytes in the range [0, 234) the output is uniformly distributed ascii lowercase letters.

    bytes.translate() accepts the second argument to specify bytes to delete:

    #!/usr/bin/env python3
    import os
    import sys
    
    nbytes = 256
    nletters = 26
    naligned = nbytes - (nbytes % nletters)
    tbl = bytes.maketrans(bytearray(range(naligned)),
                          bytearray([ord(b'a') + b % nletters
                                     for b in range(naligned)]))
    bytes2delete = bytearray(range(naligned, nbytes))
    R = lambda n: os.urandom(n).translate(tbl, bytes2delete)
    
    def write_random_ascii_lowercase_letters(write, n):
        """*write* *n* random ascii lowercase letters."""    
        while n > 0:
            # R(n) expected to drop `(nbytes - nletters) / nbytes` bytes
            # to compensate, increase the initial size        
            n -= write(memoryview(R(n * nbytes // naligned + 1))[:n])
    
    write = sys.stdout.buffer.write
    write_random_ascii_lowercase_letters(write, 1000000)
    

    If the random generator (os.urandom here) produces long sequences of the bytes that are outside of the aligned range (>=234) then the while loop may execute many times.

    The time performance can be improved by another order of magnitude if random.getrandbits(8*n).to_bytes(n, 'big') is used instead of os.urandom(n). The former uses Mersenne Twister as the core generator that may be faster than os.urandom() that uses sources provided by the operating system. The latter is more secure if you use the random string for secrets.

提交回复
热议问题