Fastest method to generate big random string with lower Latin letters

有些话、适合烂在心里 提交于 2019-11-28 09:57:45

Here's Python 3 code that generates 1000000 "random" lowercase letters in 0.28 seconds (see also 0.11-seconds solution at the end; @Ashwini Chaudhary's code from the question takes 0.55 seconds on my machine, @Markku K.'s code -- 0.53):

#!/usr/bin/env python3
import os
import sys

def write_random_lowercase(n):
    min_lc = ord(b'a')
    len_lc = 26
    ba = bytearray(os.urandom(n))
    for i, b in enumerate(ba):
        ba[i] = min_lc + b % len_lc # convert 0..255 to 97..122
    sys.stdout.buffer.write(ba)

write_random_lowercase(1000000)

% len_lc skews the distribution (see at the end on how to fix it) though It still satisfies the conditions (ascii, lowercase, frequencies of 1, 2, 3 letter sequences):

$ python3 generate-random.py | python3 check-seq.py

where check-seq.py:

#!/usr/bin/env python3
import sys
from collections import Counter
from string import ascii_lowercase

def main():
    limits = [40000, 2000, 100]

    s = sys.stdin.buffer.readline() # a single line
    assert 1000000 <= len(s) <= 1000002 # check length +/- newline
    s.decode('ascii','strict') # check ascii
    assert set(s) == set(ascii_lowercase.encode('ascii')) # check lowercase

    for n, lim in enumerate(limits, start=1):
        freq = Counter(tuple(s[i:i+n]) for i in range(len(s)))
        assert max(freq.values()) <= lim, freq

main()

Note: on acm.timus.ru generate-random.py gives "Output limit exceeded".

To improve performance, you could use bytes.translate() method (0.11 seconds):

#!/usr/bin/env python3
import os
import sys

# make translation table from 0..255 to 97..122
tbl = bytes.maketrans(bytearray(range(256)),
                      bytearray([ord(b'a') + b % 26 for b in range(256)]))
# generate random bytes and translate them to lowercase ascii
sys.stdout.buffer.write(os.urandom(1000000).translate(tbl))

How to fix % len_lc skew

256 (number of bytes) is not evenly divisible by 26 (number of lower Latin letters) therefore the formula min_lc + b % len_lc makes some values appear less often than others e.g.:

#!/usr/bin/env python3
"""Find out skew: x = 97 + y % 26 where y is uniform from [0, 256) range."""
from collections import Counter, defaultdict

def find_skew(random_bytes):
    char2freq = Counter(chr(ord(b'a') + b % 26) for b in random_bytes)
    freq2char = defaultdict(set)
    for char, freq in char2freq.items():
        freq2char[freq].add(char)
    return {f: ''.join(sorted(c)) for f, c in freq2char.items()}

print(find_skew(range(256)))
# -> {9: 'wxyz', 10: 'abcdefghijklmnopqrstuv'}

Here, the input range(256) is uniformly distributed (each byte occurs exactly once) but 'wxyz' letters in the output are less often then the rest 9 vs. 10 occurrences. To fix it, unaligned bytes could be dropped:

print(find_skew(range(256 - (256 % 26))))
# -> {9: 'abcdefghijklmnopqrstuvwxyz'}

Here, the input is uniformly distributed bytes in the range [0, 234) the output is uniformly distributed ascii lowercase letters.

bytes.translate() accepts the second argument to specify bytes to delete:

#!/usr/bin/env python3
import os
import sys

nbytes = 256
nletters = 26
naligned = nbytes - (nbytes % nletters)
tbl = bytes.maketrans(bytearray(range(naligned)),
                      bytearray([ord(b'a') + b % nletters
                                 for b in range(naligned)]))
bytes2delete = bytearray(range(naligned, nbytes))
R = lambda n: os.urandom(n).translate(tbl, bytes2delete)

def write_random_ascii_lowercase_letters(write, n):
    """*write* *n* random ascii lowercase letters."""    
    while n > 0:
        # R(n) expected to drop `(nbytes - nletters) / nbytes` bytes
        # to compensate, increase the initial size        
        n -= write(memoryview(R(n * nbytes // naligned + 1))[:n])

write = sys.stdout.buffer.write
write_random_ascii_lowercase_letters(write, 1000000)

If the random generator (os.urandom here) produces long sequences of the bytes that are outside of the aligned range (>=234) then the while loop may execute many times.

The time performance can be improved by another order of magnitude if random.getrandbits(8*n).to_bytes(n, 'big') is used instead of os.urandom(n). The former uses Mersenne Twister as the core generator that may be faster than os.urandom() that uses sources provided by the operating system. The latter is more secure if you use the random string for secrets.

Use string.ascii_lowercase instead of chr to generate lowercase charaters:

from sys import stdin
from random import choice
from string import ascii_lowercase

s = ''.join([choice(ascii_lowercase) for _ in range(1000000)])
stdout.write(s)

Also writing to stdout directly appears to be faster, encoding yourself in python is not faster than having it all handled in the C code.

I also use a list comprehension; str.join() needs to scan through the input sequence twice, once to determine the length of the output, once to actually copy the input elements to output string. A list comprehension then beats out the slower generator-to-list code.

Just using choice(ascii_lowercase) over your method of generating each character from an integer is over twice as fast:

>>> timeit.timeit('f()', 'from __main__ import yours as f', number=3)
11.299837955011753
>>> timeit.timeit('f()', 'from __main__ import mine as f', number=3)
5.330044150992762

You could try and avoid the ''.join() overhead by writing individual characters directly to stdout:

from sys import stdout
from random import choice
from string import ascii_lowercase

for _ in range(1000000):
    stdout.write(choice(ascii_lowercase))

Next to try is to write raw bytes:

from sys import stdout
from random import choice
from string import ascii_lowercase
bal = [c.encode('ascii') for c in ascii_lowercase]
out = stdout.buffer

for _ in range(1000000):
    out.write(choice(bal))

but these are no improvements over ''.join() in my tests.

Next we move to encoding the ASCII characters to bytes once, then using bytes.join():

from sys import stdout
from random import choice
from string import ascii_lowercase

bal = [c.encode('ascii') for c in ascii_lowercase]
stdout.buffer.write(b''.join([choice(bal) for _ in range(1000000)]))

bal is a list of lowercase ASCII characters encoded to bytes, from which we random pick 1 million items, join them to into a large byte string then write that in one go to the binary stdout buffer.

The bytes join is just as 'slow' as the string version:

>>> timeit.timeit('f()', 'from __main__ import bytes as f', number=3)
5.41390264898655

but we encode 26 characters, not 1 million so the write stage is faster.

Markku K.

I get a huge speed improvement by changing from randint(0,25) to int(random()*25) in your original solution. On my machine, the time went from about 2 seconds, to about 0.6 seconds. If you take a look at the random.py code, you will see that randint is full of checks that you don't want or need.

update: Oops, off by one. You need int(random()*26). Thanks Ashwini

Ashwini Chaudhary

My solution which just got accepted (python 2.7, Execution time: 0.984):

from random import choice
from string import ascii_lowercase

lis = list(ascii_lowercase)
print ''.join(choice(lis) for _ in xrange(1000000)) 

Accessing elements of a list is faster is than for strings.

In [13]: from random import choice

In [14]: from string import ascii_lowercase

In [15]: lis = list(ascii_lowercase)

In [16]: %timeit ''.join(choice(lis) for _ in xrange(10**5))
1 loops, best of 3: 128 ms per loop

In [17]: %timeit ''.join(choice(ascii_lowercase) for _ in xrange(10**5))
1 loops, best of 3: 134 ms per loop

And you don't need stdout or stdin here as most online judges us something like this to test your script:

$python script.py <in.txt >out.txt

So you can use print instead of stdout and raw_input() instead of stdin, though for huge inputs stdin.readline is faster than raw_input().

Update 1:

Using @Markku's tip execution time was reduced to .64 in py2.7:

from random import random
from string import ascii_lowercase

lis = list(ascii_lowercase)
print "".join( [lis[int(random() * 26)] for _ in xrange(1000000)] )

Try turning some part of it into C++ or another compiled language. That will almost guaranteed make it faster. Python, unfortunately, isn't too fast, especially when it comes to things like this. Try C++, C, or Pascal.

EDIT: Also see the Python Performance Tips

Generate and write in chunks that are a larger power of 2 in size.

Maybe use a string or array of the 26 lower-case letters and randomly pick then instead of generating chars.

Use random.choices?

On Python 3.6:

import random
import string

%timeit ''.join(random.choices(string.ascii_lowercase, k=10**6))
1 loop, best of 3: 235 ms per loop
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!