Generate random UTF-8 string in Python

后端 未结 8 1441
清酒与你
清酒与你 2020-12-09 08:38

I\'d like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module?

8条回答
  •  情话喂你
    2020-12-09 08:56

    Here is an example function that probably creates a random well-formed UTF-8 sequence, as defined in Table 3–7 of Unicode 5.0.0:

    #!/usr/bin/env python3.1
    
    # From Table 3–7 of the Unicode Standard 5.0.0
    
    import random
    
    def byte_range(first, last):
        return list(range(first, last+1))
    
    first_values = byte_range(0x00, 0x7F) + byte_range(0xC2, 0xF4)
    trailing_values = byte_range(0x80, 0xBF)
    
    def random_utf8_seq():
        first = random.choice(first_values)
        if first <= 0x7F:
            return bytes([first])
        elif first <= 0xDF:
            return bytes([first, random.choice(trailing_values)])
        elif first == 0xE0:
            return bytes([first, random.choice(byte_range(0xA0, 0xBF)), random.choice(trailing_values)])
        elif first == 0xED:
            return bytes([first, random.choice(byte_range(0x80, 0x9F)), random.choice(trailing_values)])
        elif first <= 0xEF:
            return bytes([first, random.choice(trailing_values), random.choice(trailing_values)])
        elif first == 0xF0:
            return bytes([first, random.choice(byte_range(0x90, 0xBF)), random.choice(trailing_values), random.choice(trailing_values)])
        elif first <= 0xF3:
            return bytes([first, random.choice(trailing_values), random.choice(trailing_values), random.choice(trailing_values)])
        elif first == 0xF4:
            return bytes([first, random.choice(byte_range(0x80, 0x8F)), random.choice(trailing_values), random.choice(trailing_values)])
    
    print("".join(str(random_utf8_seq(), "utf8") for i in range(10)))
    

    Because of the vastness of the Unicode standard I cannot test this thoroughly. Also note that the characters are not equally distributed (but each byte in the sequence is).

提交回复
热议问题