发表新帖

发表新帖

Truncating unicode so it fits a maximum size when encoded for wire transfer

前端未结

关注

 5  1523

借酒劲吻你 2020-12-29 21:57

Given a Unicode string and these requirements:

The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
The encoded st

5条回答

时光取名叫无心 (楼主)

2020-12-29 22:27
For JSON formatting (unicode escape, e.g. \uabcd), I am using the following algorithm to achieve this:
- Encode the Unicode string into the backslash-escape format which it would eventually be in the JSON version
- Truncate 3 bytes more than my final limit
- Use a regular expression to detect and chop off a partial encoding of a Unicode value
So (in Python 2.5), with some_string and a requirement to cut to around 100 bytes:
```
# Given some_string is a long string with arbitrary Unicode data.
encoded_string = some_string.encode('unicode_escape')
partial_string = re.sub(r'([^\\])\\(u|$)[0-9a-f]{0,3}$', r'\1', encoded_string[:103])
final_string   = partial_string.decode('unicode_escape')
```
Now final_string is back in Unicode but guaranteed to fit within the JSON packet later. I truncated to 103 because a purely-Unicode message would be 102 bytes encoded.

Disclaimer: Only tested on the Basic Multilingual Plane. Yeah yeah, I know.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题