Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u\'\\ud834\\udd0c\' (length 2) to the datasto
I know I can just encode it to UTF-8 and then decode again
Yes, that's the usual idiom to fix up the problem when you have “UTF-16 surrogates in UCS-4 string” input. But as Mechanical snail said, this input is malformed and you should be fixing whatever produced it in preference.
is there a more straightforward/efficient way?
Well... you could do it manually with a regex, like:
re.sub(
u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
s
)
Certainly not more straightforward... I also have my doubts as to whether it's actually more efficient!