问题
I'm trying to verify a file exists in Bash
. I know the file name (in a variable) but not the extension (can be .pmdl
or .umdl
).
on OSX, this works:
$> ls
ecole.pmdl
$> filename="ecole"
$> ls "$filename."[pu]mdl
ecole.pmdl
But it doesn't when the file name contains an accent:
$> ls
école.pmdl
$> filename="école"
$> ls "$filename."[pu]mdl
ls: école.[pu]mdl: No such file or directory
However it works if I don't use globbing:
$> ls "$filename."pmdl
école.pmdl
I'm looking for a simple solution that works in both Linux & OSX. This is the closest question I found on that topic.
Edit:
$> bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16)
Copyright (C) 2007 Free Software Foundation, Inc.
Edit 2:
Short version to prove that the scenario fails (systematically) with same é
char on OSX Bash v3.2.57. The same scenario on Linux Bash 4.3.30 works systematically (found).
$> touch é.txt
$> ls é*
ls: é*: No such file or directory
回答1:
tl;dr
Either: Use one of the following workarounds:
ls "$(iconv -t UTF-8-MAC <<<'école')."[pu]mdl
- most generic, but cumbersome.ls $'e\x{cc}\x{81}cole'.[pu]mdl
- hard to remember, and specific to the diacritic at hand (acute accent,´
).ls e?cole.[pu]mdl
- simple to type and remember, but limited to 1 combining diacritic and can yield false positives.
Or: install Bash 4.3.30 or higher via Homebrew and use it instead of the Bash 3.x that macOS still comes with:
brew install bash
.
Gory details below.
With respect to non-ASCII characters,
the macOS filesystem, HFS+, speaks only NFD (decomposed Unicode normalization form), where accented letters are represented by 2 or more Unicode codepoints: the base letter, followed by the combining diacritic(s) (accent mark(s)):
- In the case of
é
:- The ASCII base letter -
e
(U+0065
, UTF-8 encoding0x65
) - followed by the combining acute accent (the
´
above the preceding base letter,U+0301
, UTF-8 encoding0xcc 0x81
).
- The ASCII base letter -
- Some accented characters decompose to a base letter followed by multiple combining diacritics, such as in the case of
Ṹ
. - Note that the filesystem accepts NFC strings (see next point) when creating files and matching filenames literally, and automatically translates them to their NFD equivalent (decomposes them).
- As an aside: a notable critic of HFS+ in general and its use of NFD in particular is Linus Torvalds, as expressed in this article.
- In the case of
Typically, however - such as when you type characters in a terminal or in most editors - NFC (composed Unicode normalization form) is used, where (customary) accented letters are represented by 1 Unicode codepoint:
- In the case of
é
: single Unicode characterU+00E9
, UTF-8 encoding0xc3 0xa9
. - NFD and NFC should be treated as equivalent, but as of Bash 3.x - as found on macOS - aren't: NFC (and also NFD) input is taken as-is when globbing (either as typed in the terminal or as saved by most editors in UTF-8-encoded scripts) and matches it codepoint by codepoint against the filesystem's NFD representation, without recognizing equivalent NFC and NFD representations.
In effect, that means that accented NFC characters typed in the terminal or as produced by most editors do NOT match their NFD equivalents in the HFS+ filesystem. - By contrast, specifying literal filenames - without globbing - is not affected:
ls école
, expressed as NFC, does find the file namedécole
, which is stored in NFD - presumably, because Bash just passes the NFC representation to a system function that does recognize the equivalence.
- In the case of
Read about these Unicode normal (normalization) forms here.
In short: Bash should recognize NFD and NFC representations as equivalent, but doesn't, as of the obsolete version that macOS 10.12.1 comes with - Bash 3.2.57.
While the problem has been fixed as of at least Bash 4.3.30 when run on macOS, Apple isn't updating to Bash 4.x versions for licensing reasons (see below for a solution).
See the bottom of this post for a look at the Linux world.
There are workarounds for globbing filenames with accented characters on macOS:
[if feasible] Using Homebrew, install the latest 4.x Bash version and use it instead of the one that comes with macOS:
brew install bash
.- Note that if you use such a Bash version (>= 4.3.30), not only are the other workarounds described below no longer necessary, they actually stop working, because Bash then only supports NFC input as part of globbing patterns (but maps it correctly onto NFD equivalents in the filesystem).
[robust, but more elaborate] Use
iconv -t UTF-8-MAC
to convert your Bash string literal from NFC to NFD so that it matches the filesystem representation:ls "$(iconv -t UTF-8-MAC <<<'école')."[pu]mdl
- Alternatively, it is possible, but obscure and cumbersome, to use an ANSI C-quoted string to represent the exact NFD UTF-8 byte sequence:
ls $'e\x{cc}\x{81}cole'.[pu]mdl
- Alternatively, it is possible, but obscure and cumbersome, to use an ANSI C-quoted string to represent the exact NFD UTF-8 byte sequence:
[simpler, but suboptimal] Represent each accented character as
<base-char>?
, because, from Bash's perspective, the accented character, as reported by the filesystem, amounts to the base charactere
followed by another character (the combining diacritic; adjust accordingly for multiple combining diacritics). (This approach is obviously suboptimal, because it won't match justé
, but any two-character sequence starting withe
):ls e?cole.[pu]mdl
The ext filesystem used by many Linux distros stores filenames exactly as specified:
In other words: a file created with an NFC name is stored as such, as is a file with an NFD name.
Therefore, ext
considers NFC and NFD distinct forms, because their byte-level representations differ, so it even allows files of the (conceptually) same name that differ only in Unicode normal form - for instance, files named $'e\xcc\x81cole'
and $'\xc3\xa9cole'
are indistinguishable when printed by ls
(école
), but are distinct files(!).
Consequently - and appropriately - Bash versions on Linux do not recognize NFC / NFD equivalence, even in versions >= 4.3.30 (unlike on macOS).
Caveat: dash
, which acts as /bin/sh
on Ubuntu, for instance, as of Ubuntu 16.04 is not locale-aware (multi-byte character-encoding aware), at least when globbing: globbing symbol ?
matches a single byte rather than a single character (as defined by the active locale's character encoding, as reflected in locale category LC_CTYPE
, which is typically UTF-8). Thus, in order to match a single non-ASCII character, you need to know how many bytes the UTF-8 encoding of that character is composed of, and use a ?
for each byte; for instance, NFC é
(2 bytes) would have to be matched with ??
.[1]
This may matter when you use globbing inside scripts whose shebang line is #!/bin/sh
.
In practice, NFD strings are rarely encountered, so with NFC strings used both to create files and match them later by globs, the problem with differing Unicode normal forms that macOS experiences rarely surfaces on Linux.
[1] dash
aims to be a fast, POSIX-compliant shell implementation (that is largely confined to POSIX features), but in this case it appears to fall short: the part of the POSIX spec. describing the pattern-matching notation clearly talks about characters, not bytes: A <question-mark> is a pattern that shall match any character.
Support for multi-byte character encodings is described in the section on Character Sets.
回答2:
It is a requirement of HFS+here and here (Apple filesystem) to store Unicode strings in decomposed form (as opposed to a pre-composed character).
It is then that a character like é
of Unicode code position U+0E9 is decomposed into the two characters e
and ´
of Unicode code positions U+065 and U+0301 respectively.
You can see this difference by creating a clean empty directory and doing:
$ a='é'
$ echo "$a" >.text
$ touch "$a"
$ ls > .list
And then comparing the output of this two commands:
$ od -vAn -tx1c .text
c3 a9 0a
303 251 \n
$ od -vAn -tx1c .list
65 cc 81 0a
e 314 201 \n
Which are not equal.
You may try using this pattern in your system:
ls "e$(echo -e '\xcc\x81')cole".[pu]mdl
Which is simply the expression that the é
is represented by two characters in the filesystem.
Understand that this problem has been resolved in newer bash versions.
Reference:
How to enter special characters so that bash terminal understands them
回答3:
é != é
$ echo "école." | xxd
00000000: c3a9 636f 6c65 0a ..cole.
$ echo "école." | xxd
00000000: 65cc 8163 6f6c 650a e..cole.
So by this we can see they are different characters:
$ echo -e "\x65\xCC\x81"
é
$ echo -e "\xC3\xA9"
é
You are not using the same character in your filename as set in your variable.
for i in {1..3}; do f="école"; ls "$f."[pu]mdl; echo "$i: $f."[pu]mdl; done
for i in {1..3}; do f="école"; ls "$f."[pu]mdl; echo "$i: $f."[pu]mdl; done
ls: école.[pu]mdl: No such file or directory
1: école.[pu]mdl
ls: école.[pu]mdl: No such file or directory
2: école.[pu]mdl
ls: école.[pu]mdl: No such file or directory
3: école.[pu]mdl
école.pmdl
1: école.[pu]mdl
école.pmdl
2: école.[pu]mdl
école.pmdl
3: école.[pu]mdl
This error can be difficult to reproduce simply because copying and pasting the character from one place to another can get translated by the editor, shell, etc. completely changing it. It may look like the same character, but it's genuinely different by seemingly indistinguishable details.
来源:https://stackoverflow.com/questions/40062427/globbing-accented-files-in-bash