How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?

后端 未结 5 922
生来不讨喜
生来不讨喜 2020-11-30 03:13

For example, if I have a unicode string, I can encode it as an ASCII string like so:

>>> u\'\\u003cfoo/\\u003e\'.encode(\'ascii\')         


        
相关标签:
5条回答
  • 2020-11-30 03:47

    On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).

    I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.

    Anyway, this is it.

    0 讨论(0)
  • 2020-11-30 03:47

    At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:

    UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)
    

    For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors

    >>> s = '\u003cfoo\u003e'
    >>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
    >>> <foo>
    
    0 讨论(0)
  • 2020-11-30 03:57

    Ned Batchelder said:

    It's a little dangerous depending on where the string is coming from, but how about:

    >>> s = '\u003cfoo\u003e'
    >>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
    '<foo>'
    

    Actually this method can be made safe like so:

    >>> s = '\u003cfoo\u003e'
    >>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]
    

    Mind the triple-quote string and the dash right before the closing 3-quotes.

    1. Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
    2. The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]

    So there would be no need to worry about what the users enter, as long as it is captured in raw format.

    0 讨论(0)
  • 2020-11-30 04:02

    It's a little dangerous depending on where the string is coming from, but how about:

    >>> s = '\u003cfoo\u003e'
    >>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
    '<foo>'
    
    0 讨论(0)
  • 2020-11-30 04:05

    It took me a while to figure this one out, but this page had the best answer:

    >>> s = '\u003cfoo/\u003e'
    >>> s.decode( 'unicode-escape' )
    u'<foo/>'
    >>> s.decode( 'unicode-escape' ).encode( 'ascii' )
    '<foo/>'
    

    There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).

    EDIT: See also Python Standard Encodings.

    0 讨论(0)
提交回复
热议问题