Read Unicode characters from command-line arguments in Python 2.x on Windows

后端 未结 4 1082
萌比男神i
萌比男神i 2020-11-27 04:30

I want my Python script to be able to read Unicode command line arguments in Windows. But it appears that sys.argv is a string encoded in some local encoding, rather than Un

相关标签:
4条回答
  • 2020-11-27 05:01

    The command line might be in Windows encoding. Try decoding the arguments into unicode objects:

    args = [unicode(x, "iso-8859-9") for x in sys.argv]
    
    0 讨论(0)
  • 2020-11-27 05:05

    Try this:

    import sys
    print repr(sys.argv[1].decode('UTF-8'))
    

    Maybe you have to substitute CP437 or CP1252 for UTF-8. You should be able to infer the proper encoding name from the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP

    0 讨论(0)
  • 2020-11-27 05:13

    Here is a solution that is just what I'm looking for, making a call to the Windows GetCommandLineArgvW function:
    Get sys.argv with Unicode characters under Windows (from ActiveState)

    But I've made several changes, to simplify its usage and better handle certain uses. Here is what I use:

    win32_unicode_argv.py

    """
    win32_unicode_argv.py
    
    Importing this will replace sys.argv with a full Unicode form.
    Windows only.
    
    From this site, with adaptations:
          http://code.activestate.com/recipes/572200/
    
    Usage: simply import this module into a script. sys.argv is changed to
    be a list of Unicode strings.
    """
    
    
    import sys
    
    def win32_unicode_argv():
        """Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
        strings.
    
        Versions 2.x of Python don't support Unicode in sys.argv on
        Windows, with the underlying Windows API instead replacing multi-byte
        characters with '?'.
        """
    
        from ctypes import POINTER, byref, cdll, c_int, windll
        from ctypes.wintypes import LPCWSTR, LPWSTR
    
        GetCommandLineW = cdll.kernel32.GetCommandLineW
        GetCommandLineW.argtypes = []
        GetCommandLineW.restype = LPCWSTR
    
        CommandLineToArgvW = windll.shell32.CommandLineToArgvW
        CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
        CommandLineToArgvW.restype = POINTER(LPWSTR)
    
        cmd = GetCommandLineW()
        argc = c_int(0)
        argv = CommandLineToArgvW(cmd, byref(argc))
        if argc.value > 0:
            # Remove Python executable and commands if present
            start = argc.value - len(sys.argv)
            return [argv[i] for i in
                    xrange(start, argc.value)]
    
    sys.argv = win32_unicode_argv()
    

    Now, the way I use it is simply to do:

    import sys
    import win32_unicode_argv
    

    and from then on, sys.argv is a list of Unicode strings. The Python optparse module seems happy to parse it, which is great.

    0 讨论(0)
  • 2020-11-27 05:18

    Dealing with encodings is very confusing.

    I believe if your inputing data via the commandline it will encode the data as whatever your system encoding is and is not unicode. (Even copy/paste should do this)

    So it should be correct to decode into unicode using the system encoding:

    import sys
    
    first_arg = sys.argv[1]
    print first_arg
    print type(first_arg)
    
    first_arg_unicode = first_arg.decode(sys.getfilesystemencoding())
    print first_arg_unicode
    print type(first_arg_unicode)
    
    f = codecs.open(first_arg_unicode, 'r', 'utf-8')
    unicode_text = f.read()
    print type(unicode_text)
    print unicode_text.encode(sys.getfilesystemencoding())
    

    running the following Will output: Prompt> python myargv.py "PC・ソフト申請書08.09.24.txt"

    PC・ソフト申請書08.09.24.txt
    <type 'str'>
    <type 'unicode'>
    PC・ソフト申請書08.09.24.txt
    <type 'unicode'>
    ?日本語
    

    Where the "PC・ソフト申請書08.09.24.txt" contained the text, "日本語". (I encoded the file as utf8 using windows notepad, I'm a little stumped as to why there's a '?' in the begining when printing. Something to do with how notepad saves utf8?)

    The strings 'decode' method or the unicode() builtin can be used to convert an encoding into unicode.

    unicode_str = utf8_str.decode('utf8')
    unicode_str = unicode(utf8_str, 'utf8')
    

    Also, if your dealing with encoded files you may want to use the codecs.open() function in place of the built-in open(). It allows you to define the encoding of the file, and will then use the given encoding to transparently decode the content to unicode.

    So when you call content = codecs.open("myfile.txt", "r", "utf8").read() content will be in unicode.

    codecs.open: http://docs.python.org/library/codecs.html?#codecs.open

    If I'm miss-understanding something please let me know.

    If you haven't already I recommend reading Joel's article on unicode and encoding: http://www.joelonsoftware.com/articles/Unicode.html

    0 讨论(0)
提交回复
热议问题