Best way to decode command line inputs to Unicode Python 2.7 scripts

前端 未结 2 1451
野趣味
野趣味 2021-01-07 13:16

All my scripts use Unicode literals throughout, with

from __future__ import unicode_literals

but this creates a problem when there is the p

2条回答
  •  无人及你
    2021-01-07 14:13

    I don't think getfilesystemencoding will necessarily get the right encoding for the shell, it depends on the shell (and can be customised by the shell, independent of the filesystem). The file system encoding is only concerned with how non-ascii filenames are stored.

    Instead, you should probably be looking at sys.stdin.encoding which will give you the encoding for standard input.

    Additionally, you might consider using the type keyword argument when you add an argument:

    import sys
    import argparse as ap
    
    def foo(str_, encoding=sys.stdin.encoding):
        return str_.decode(encoding)
    
    parser = ap.ArgumentParser()
    parser.add_argument('my_int', type=int)
    parser.add_argument('my_arg', type=foo)
    args = parser.parse_args()
    
    print repr(args)
    

    Demo:

    $ python spam.py abc hello
    usage: spam.py [-h] my_int my_arg
    spam.py: error: argument my_int: invalid int value: 'abc'
    $ python spam.py 123 hello
    Namespace(my_arg=u'hello', my_int=123)
    $ python spam.py 123 ollǝɥ
    Namespace(my_arg=u'oll\u01dd\u0265', my_int=123)
    

    If you have to work with non-ascii data a lot, I would highly recommend upgrading to python3. Everything is a lot easier there, for example, parsed arguments will already be unicode on python3.


    Since there is conflicting information about the command line argument encoding around, I decided to test it by changing my shell encoding to latin-1 whilst leaving the file system encoding as utf-8. For my tests I use the c-cedilla character which has a different encoding in these two:

    >>> u'Ç'.encode('ISO8859-1')
    '\xc7'
    >>> u'Ç'.encode('utf-8')
    '\xc3\x87'
    

    Now I create an example script:

    #!/usr/bin/python2.7
    import argparse as ap
    import sys
    
    print 'sys.stdin.encoding is ', sys.stdin.encoding
    print 'sys.getfilesystemencoding() is', sys.getfilesystemencoding()
    
    def encoded(s):
        print 'encoded', repr(s)
        return s
    
    def decoded_filesystemencoding(s):
        try:
            s = s.decode(sys.getfilesystemencoding())
        except UnicodeDecodeError:
            s = 'failed!'
        return s
    
    def decoded_stdinputencoding(s):
        try:
            s = s.decode(sys.stdin.encoding)
        except UnicodeDecodeError:
            s = 'failed!'
        return s
    
    parser = ap.ArgumentParser()
    parser.add_argument('first', type=encoded)
    parser.add_argument('second', type=decoded_filesystemencoding)
    parser.add_argument('third', type=decoded_stdinputencoding)
    args = parser.parse_args()
    
    print repr(args)
    

    Then I change my shell encoding to ISO/IEC 8859-1:

    And I call the script:

    wim-macbook:tmp wim$ ./spam.py Ç Ç Ç
    sys.stdin.encoding is  ISO8859-1
    sys.getfilesystemencoding() is utf-8
    encoded '\xc7'
    Namespace(first='\xc7', second='failed!', third=u'\xc7')
    

    As you can see, the command line arguments were encoding in latin-1, and so the second command line argument (using sys.getfilesystemencoding) fails to decode. The third command line argument (using sys.stdin.encoding) decodes correctly.

提交回复
热议问题