Unicode encoding for filesystem in Mac OS X not correct in Python?

后端 未结 2 734
暖寄归人
暖寄归人 2020-12-06 01:01

Having a bit of struggle with Unicode file names in OS X and Python. I am trying to use filenames as input for a regular expression later in the code, but the encoding used

相关标签:
2条回答
  • 2020-12-06 01:31

    MacOS X uses a special kind of decomposed UTF-8 to store filenames. If you need to e.g. read in filenames and write them to a "normal" UTF-8 file, you must normalize them :

    filename = unicodedata.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')
    

    from here: https://web.archive.org/web/20120423075412/http://boodebr.org/main/python/all-about-python-and-unicode

    0 讨论(0)
  • 2020-12-06 01:48

    getfilesystemencoding() is giving you the correct response (the encoding), but it does not tell you the unicode normalisation form.

    In particular, the HFS+ filesystem uses UTF-8 encoding, and a normalisation form close to "D" (which requires composed characters like ö to be decomposed into ). HFS+ is also tied to the normalisation form as it existed in Unicode version 3.2—as detailed in Apple's documentation for the HFS+ format.

    Python's unicodedata.normalize method converts between forms, and if you prefix the call with the ucd_3_2_0 object, you can constrain it to Unicode version 3.2:

    filename = unicodedata.ucd_3_2_0.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')
    
    0 讨论(0)
提交回复
热议问题