Unicode encoding for filesystem in Mac OS X not correct in Python?

后端 未结 2 743
暖寄归人
暖寄归人 2020-12-06 01:01

Having a bit of struggle with Unicode file names in OS X and Python. I am trying to use filenames as input for a regular expression later in the code, but the encoding used

2条回答
  •  谎友^
    谎友^ (楼主)
    2020-12-06 01:48

    getfilesystemencoding() is giving you the correct response (the encoding), but it does not tell you the unicode normalisation form.

    In particular, the HFS+ filesystem uses UTF-8 encoding, and a normalisation form close to "D" (which requires composed characters like ö to be decomposed into ). HFS+ is also tied to the normalisation form as it existed in Unicode version 3.2—as detailed in Apple's documentation for the HFS+ format.

    Python's unicodedata.normalize method converts between forms, and if you prefix the call with the ucd_3_2_0 object, you can constrain it to Unicode version 3.2:

    filename = unicodedata.ucd_3_2_0.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')
    

提交回复
热议问题