Truncating unicode so it fits a maximum size when encoded for wire transfer

前端 未结 5 1523
借酒劲吻你
借酒劲吻你 2020-12-29 21:57

Given a Unicode string and these requirements:

  • The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
  • The encoded st
5条回答
  •  时光取名叫无心
    2020-12-29 22:27

    For JSON formatting (unicode escape, e.g. \uabcd), I am using the following algorithm to achieve this:

    • Encode the Unicode string into the backslash-escape format which it would eventually be in the JSON version
    • Truncate 3 bytes more than my final limit
    • Use a regular expression to detect and chop off a partial encoding of a Unicode value

    So (in Python 2.5), with some_string and a requirement to cut to around 100 bytes:

    # Given some_string is a long string with arbitrary Unicode data.
    encoded_string = some_string.encode('unicode_escape')
    partial_string = re.sub(r'([^\\])\\(u|$)[0-9a-f]{0,3}$', r'\1', encoded_string[:103])
    final_string   = partial_string.decode('unicode_escape')
    

    Now final_string is back in Unicode but guaranteed to fit within the JSON packet later. I truncated to 103 because a purely-Unicode message would be 102 bytes encoded.

    Disclaimer: Only tested on the Basic Multilingual Plane. Yeah yeah, I know.

提交回复
热议问题