How to get a reliable unicode character count in Python?

后端 未结 2 1682
一个人的身影
一个人的身影 2021-01-05 01:48

Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u\'\\ud834\\udd0c\' (length 2) to the datasto

2条回答
  •  耶瑟儿~
    2021-01-05 02:27

    I know I can just encode it to UTF-8 and then decode again

    Yes, that's the usual idiom to fix up the problem when you have “UTF-16 surrogates in UCS-4 string” input. But as Mechanical snail said, this input is malformed and you should be fixing whatever produced it in preference.

    is there a more straightforward/efficient way?

    Well... you could do it manually with a regex, like:

    re.sub(
        u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
        lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
        s
    )
    

    Certainly not more straightforward... I also have my doubts as to whether it's actually more efficient!

提交回复
热议问题