UTF8 Filenames in PHP and Different Unicode Encodings

后端未结

关注

 3  1183

I have a file containing Unicode characters on a server running linux. If I SSH into the server and use tab-completion to navigate to the file/folder containing unicode char

相关标签:

3条回答

夕颜

2020-12-04 02:39

Firstly: You should try to avoid imposing semantics on the names of files. I can't really tell why PHP is generating filenames in your scenario, so I can't suggest how you should apply this rule.

The different (two byte and three byte) representations of é are UTF-8 encodings of the composed and decomposed variations of this character in Unicode. In Unicode these are distinct ways to represent the same visual character. Unicode has the concept of "canonicalisation" in which all representations of the same character are converted to a single representation, sort of like squashing two strings to lowercase to perform a caseless comparison.

Linux does not perform canonicalisation or any other processing automatically for file names, so a file may be named with precomposed (like the two byte sequence) or decomposed (like the three byte sequence) characters or any mix of the two, it's up to whoever named the file. If you are creating the files, you could set a policy (e.g. always use precomposed characters) and write some code to enforce it. Otherwise, you can't rely on any particular rule here.

0 讨论(0)
发布评论:

提交评论
- 加载中...
-上瘾入骨i

2020-12-04 02:45
Thanks to the tips given in the two answers I was able to poke around and find some methods for normalizing the different unicode decompositions of a given character. In the situation I was faced with I was accessing files created by a OS X Carbon application. It is a fairly popular application and thus its file names seemed to adhere to a specific unicode decomposition.

In PHP 5.3 a new set of functions was introduced that allows you to normalize a unicode string to a particular decomposition. Apparently there are four decomposition standards which you can decompose you unicode string into. Python has had unicode normalization capabilties since version 2.3 via unicode.normalize. This article on python's handling of unicode strings was helpful in understanding encoding / string handling a bit better.

Here is a quick example on normalizing a unicode filepath:
```
filePath = unicodedata.normalize('NFD', filePath)
```
I found that the NFD format worked for all my purposes, I wonder if this is this is the standard decomposition for unicode filenames.
0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-12-04 02:58

The three byte sequence is actually the utf8 representation of an e (0x65) followed by a combining ´ (0xcc 0x81), while 0xc3 0xa9 stands "directly" for é.
An utf-8 aware collation should be aware of the possible decompositions, but I don't know how you can enable that (and probably recompile the php source) on a mac.
Best I can offer is the "Using UTF-8 with Gentoo" description.

0 讨论(0)
发布评论:

提交评论
- 加载中...