Generate random UTF-8 string in Python

后端 未结 8 1442
清酒与你
清酒与你 2020-12-09 08:38

I\'d like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module?

8条回答
  •  臣服心动
    2020-12-09 08:58

    It depends how thoroughly you want to do the testing and how accurately you want to do the generation. In full, Unicode is a 21-bit code set (U+0000 .. U+10FFFF). However, some quite large chunks of that range are set aside for custom characters. Do you want to worry about generating combining characters at the start of a string (because they should only appear after another character)?

    The basic approach I'd adopt is randomly generate a Unicode code point (say U+2397 or U+31232), validate it in context (is it a legitimate character; can it appear here in the string) and encode valid code points in UTF-8.

    If you just want to check whether your code handles malformed UTF-8 correctly, you can use much simpler generation schemes.

    Note that you need to know what to expect given the input - otherwise you are not testing; you are experimenting.

提交回复
热议问题