How to get a reliable unicode character count in Python?

后端未结

关注

 2  1682

一个人的身影 2021-01-05 01:48

Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u\'\\ud834\\udd0c\' (length 2) to the datasto

2条回答

耶瑟儿～ (楼主)

2021-01-05 02:27
I know I can just encode it to UTF-8 and then decode again

Yes, that's the usual idiom to fix up the problem when you have “UTF-16 surrogates in UCS-4 string” input. But as Mechanical snail said, this input is malformed and you should be fixing whatever produced it in preference.

is there a more straightforward/efficient way?

Well... you could do it manually with a regex, like:
```
re.sub(
    u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
    lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
    s
)
```
Certainly not more straightforward... I also have my doubts as to whether it's actually more efficient!
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...