Truncating unicode so it fits a maximum size when encoded for wire transfer

前端 未结 5 1536
借酒劲吻你
借酒劲吻你 2020-12-29 21:57

Given a Unicode string and these requirements:

  • The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
  • The encoded st
5条回答
  •  情话喂你
    2020-12-29 22:25

    This will do for UTF8, If you like to do it in regex.

    import re
    
    partial="\xc2\x80\xc2\x80\xc2"
    
    re.sub("([\xf6-\xf7][\x80-\xbf]{0,2}|[\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)
    
    "\xc2\x80\xc2\x80"
    

    Its cover from U+0080 (2 bytes) to U+10FFFF (4 bytes) utf8 strings

    Its really straight forward just like UTF8 algorithm

    From U+0080 to U+07FF It will need 2 bytes 110yyyxx 10xxxxxx Its mean, if you see only one byte in the end like 110yyyxx (0b11000000 to 0b11011111) It is [\xc0-\xdf], it will be partial one.

    From U+0800 to U+FFFF is 3 bytes needed 1110yyyy 10yyyyxx 10xxxxxx If you see only 1 or 2 bytes in the end, it will be partial one. It will match with this pattern [\xe0-\xef][\x80-\xbf]{0,1}

    From U+10000–U+10FFFF is 4 bytes needed 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx If you see only 1 to 3 bytes in the end, it will be partial one It will match with this pattern [\xf6-\xf7][\x80-\xbf]{0,2}

    Update :

    If you only need Basic Multilingual Plane, You can drop last Pattern. This will do.

    re.sub("([\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)
    

    Let me know if there is any problem with that regex.

提交回复
热议问题