发表新帖

发表新帖

Truncating unicode so it fits a maximum size when encoded for wire transfer

前端未结

关注

 5  1536

借酒劲吻你 2020-12-29 21:57

Given a Unicode string and these requirements:

The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
The encoded st

5条回答

情话喂你 (楼主)

2020-12-29 22:25
This will do for UTF8, If you like to do it in regex.
```
import re

partial="\xc2\x80\xc2\x80\xc2"

re.sub("([\xf6-\xf7][\x80-\xbf]{0,2}|[\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)

"\xc2\x80\xc2\x80"
```
Its cover from U+0080 (2 bytes) to U+10FFFF (4 bytes) utf8 strings

Its really straight forward just like UTF8 algorithm

From U+0080 to U+07FF It will need 2 bytes 110yyyxx 10xxxxxx Its mean, if you see only one byte in the end like 110yyyxx (0b11000000 to 0b11011111) It is [\xc0-\xdf], it will be partial one.

From U+0800 to U+FFFF is 3 bytes needed 1110yyyy 10yyyyxx 10xxxxxx If you see only 1 or 2 bytes in the end, it will be partial one. It will match with this pattern [\xe0-\xef][\x80-\xbf]{0,1}

From U+10000–U+10FFFF is 4 bytes needed 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx If you see only 1 to 3 bytes in the end, it will be partial one It will match with this pattern [\xf6-\xf7][\x80-\xbf]{0,2}

Update :

If you only need Basic Multilingual Plane, You can drop last Pattern. This will do.
```
re.sub("([\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)
```
Let me know if there is any problem with that regex.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题