Python's handling of shell strings

南笙酒味 提交于 2019-12-04 09:31:05

str objects contain bytes. What those bytes represent Python doesn't dictate. If you produced ASCII-compatible bytes, you can decode them as ASCII. If they contain bytes representing UTF-8 data they can be decoded as such. If they contain bytes representing an image, then you can decode that information and display an image somewhere. When you use repr() on a str object Python will leave any bytes that are ASCII printable as such, the rest are converted to escape sequences; this keeps debugging such information practical even in ASCII-only environments.

Your terminal or console in which you are running the interactive interpreter writes bytes to the stdin stream that Python reads from when you type. Those bytes are encoded according to the configuration of that terminal or console.

In your case, your console encoded the input you typed to a Windows codepage, most likely. You'll need to figure out the exact codepage and use that codec to decode the bytes. Codepage 1252 seems to fit:

>>> print 'ko\x9eu\x9a\xe8ek'.decode('cp1252')
kožušèek

When you print those same bytes, your console is reading those bytes and interpreting them in the same codec it is already configured with.

Python can tell you what codec it thinks your console is set to; it tries to detect this information for Unicode literals, where the input has to be decoded for you. It uses the locale.getpreferredencoding() function to determine this, and the sys.stdin and sys.stdout objects have an encoding attribute; mine is set to UTF-8:

>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
>>> 'kožušèek'
'ko\xc5\xbeu\xc5\xa1\xc3\xa8ek'
>>> u'kožušèek'
u'ko\u017eu\u0161\xe8ek'
>>> print u'kožušèek'
kožušèek

Because my terminal has been configured for UTF-8 and Python has detected this, using a Unicode literal u'...' works. The data is automatically decoded by Python.

Why exactly your console lost a whole letter I don't know; I'd have to have access to your console and do some more experiments, see the output of print repr(s2), and test all bytes between 0x00 and 0xFF to see if this is on the input or output side of the console.

I recommend you read up on Python and Unicode:

Your system does not necessarily use the sys.getdefaultencoding() encoding; it is merely the default used when you convert without telling it the encoding, as in:

>>> sys.getdefaultencoding()
'ascii'
>>> unicode(s1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 2: ordinal not in range(128)

Python's idea of your system locale is in the locale module:

>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> locale.getpreferredencoding()
'UTF-8'

And using this we can decode the string:

>>> u1=s1.decode(locale.getdefaultlocale()[1])
>>> u1
u'ko\u017eu\u0161\u010dek'
>>> print u1
kožušček

There's a chance the locale has not been set up, as is the case for the 'C' locale. That may cause the reported encoding to be None even though the default is 'ascii'. Normally figuring this out is the job of setlocale, which getpreferredencoding will automatically call. I would suggest calling it once in your program startup and saving the value returned for all further use. The encoding used for filenames may also be yet another case, reported in sys.getfilesystemencoding().

The Python-internal default encoding is set up by the site module, which contains:

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !

So if you want it set by default in every run of Python, you can change that first if 0 to if 1.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!