问题

I'm trying to verify a file exists in Bash. I know the file name (in a variable) but not the extension (can be .pmdl or .umdl).

on OSX, this works:

$> ls
ecole.pmdl
$> filename="ecole"
$> ls "$filename."[pu]mdl
ecole.pmdl

But it doesn't when the file name contains an accent:

$> ls
école.pmdl
$> filename="école"
$> ls "$filename."[pu]mdl
ls: école.[pu]mdl: No such file or directory

However it works if I don't use globbing:

$> ls "$filename."pmdl
école.pmdl

I'm looking for a simple solution that works in both Linux & OSX. This is the closest question I found on that topic.

Edit:

$> bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16)
Copyright (C) 2007 Free Software Foundation, Inc.

Edit 2:

Short version to prove that the scenario fails (systematically) with same é char on OSX Bash v3.2.57. The same scenario on Linux Bash 4.3.30 works systematically (found).

$> touch é.txt
$> ls é*
ls: é*: No such file or directory

回答1:

tl;dr

Either: Use one of the following workarounds:
- ls "$(iconv -t UTF-8-MAC <<<'école')."[pu]mdl - most generic, but cumbersome.
- ls $'e\x{cc}\x{81}cole'.[pu]mdl - hard to remember, and specific to the diacritic at hand (acute accent, ´).
- ls e?cole.[pu]mdl - simple to type and remember, but limited to 1 combining diacritic and can yield false positives.
Or: install Bash 4.3.30 or higher via Homebrew and use it instead of the Bash 3.x that macOS still comes with: brew install bash.

Gory details below.

With respect to non-ASCII characters,

the macOS filesystem, HFS+, speaks only NFD (decomposed Unicode normalization form), where accented letters are represented by 2 or more Unicode codepoints: the base letter, followed by the combining diacritic(s) (accent mark(s)):
- In the case of é:
  - The ASCII base letter - e (U+0065, UTF-8 encoding 0x65)
  - followed by the combining acute accent (the ´ above the preceding base letter, U+0301, UTF-8 encoding 0xcc 0x81).
- Some accented characters decompose to a base letter followed by multiple combining diacritics, such as in the case of Ṹ.
- Note that the filesystem accepts NFC strings (see next point) when creating files and matching filenames literally, and automatically translates them to their NFD equivalent (decomposes them).
- As an aside: a notable critic of HFS+ in general and its use of NFD in particular is Linus Torvalds, as expressed in this article.
Typically, however - such as when you type characters in a terminal or in most editors - NFC (composed Unicode normalization form) is used, where (customary) accented letters are represented by 1 Unicode codepoint:
- In the case of é: single Unicode character U+00E9, UTF-8 encoding 0xc3 0xa9.
- NFD and NFC should be treated as equivalent, but as of Bash 3.x - as found on macOS - aren't: NFC (and also NFD) input is taken as-is when globbing (either as typed in the terminal or as saved by most editors in UTF-8-encoded scripts) and matches it codepoint by codepoint against the filesystem's NFD representation, without recognizing equivalent NFC and NFD representations.
  In effect, that means that accented NFC characters typed in the terminal or as produced by most editors do NOT match their NFD equivalents in the HFS+ filesystem.
- By contrast, specifying literal filenames - without globbing - is not affected: ls école, expressed as NFC, does find the file named école, which is stored in NFD - presumably, because Bash just passes the NFC representation to a system function that does recognize the equivalence.

Read about these Unicode normal (normalization) forms here.

In short: Bash should recognize NFD and NFC representations as equivalent, but doesn't, as of the obsolete version that macOS 10.12.1 comes with - Bash 3.2.57.

While the problem has been fixed as of at least Bash 4.3.30 when run on macOS, Apple isn't updating to Bash 4.x versions for licensing reasons (see below for a solution).

See the bottom of this post for a look at the Linux world.

There are workarounds for globbing filenames with accented characters on macOS:

[if feasible] Using Homebrew, install the latest 4.x Bash version and use it instead of the one that comes with macOS: brew install bash.
- Note that if you use such a Bash version (>= 4.3.30), not only are the other workarounds described below no longer necessary, they actually stop working, because Bash then only supports NFC input as part of globbing patterns (but maps it correctly onto NFD equivalents in the filesystem).
[robust, but more elaborate] Use iconv -t UTF-8-MAC to convert your Bash string literal from NFC to NFD so that it matches the filesystem representation:
ls "$(iconv -t UTF-8-MAC <<<'école')."[pu]mdl
- Alternatively, it is possible, but obscure and cumbersome, to use an ANSI C-quoted string to represent the exact NFD UTF-8 byte sequence:
  ls $'e\x{cc}\x{81}cole'.[pu]mdl
[simpler, but suboptimal] Represent each accented character as <base-char>?, because, from Bash's perspective, the accented character, as reported by the filesystem, amounts to the base character e followed by another character (the combining diacritic; adjust accordingly for multiple combining diacritics). (This approach is obviously suboptimal, because it won't match just é, but any two-character sequence starting with e):
ls e?cole.[pu]mdl

The ext filesystem used by many Linux distros stores filenames exactly as specified:

In other words: a file created with an NFC name is stored as such, as is a file with an NFD name.

Therefore, ext considers NFC and NFD distinct forms, because their byte-level representations differ, so it even allows files of the (conceptually) same name that differ only in Unicode normal form - for instance, files named $'e\xcc\x81cole' and $'\xc3\xa9cole' are indistinguishable when printed by ls (école), but are distinct files(!).

Consequently - and appropriately - Bash versions on Linux do not recognize NFC / NFD equivalence, even in versions >= 4.3.30 (unlike on macOS).

Caveat: dash, which acts as /bin/sh on Ubuntu, for instance, as of Ubuntu 16.04 is not locale-aware (multi-byte character-encoding aware), at least when globbing: globbing symbol ? matches a single byte rather than a single character (as defined by the active locale's character encoding, as reflected in locale category LC_CTYPE, which is typically UTF-8). Thus, in order to match a single non-ASCII character, you need to know how many bytes the UTF-8 encoding of that character is composed of, and use a ? for each byte; for instance, NFC é (2 bytes) would have to be matched with ??.^[1]

This may matter when you use globbing inside scripts whose shebang line is #!/bin/sh.

In practice, NFD strings are rarely encountered, so with NFC strings used both to create files and match them later by globs, the problem with differing Unicode normal forms that macOS experiences rarely surfaces on Linux.

^{[1] dash aims to be a fast, POSIX-compliant shell implementation (that is largely confined to POSIX features), but in this case it appears to fall short: the part of the POSIX spec. describing the pattern-matching notation clearly talks about characters, not bytes: A <question-mark> is a pattern that shall match any character.

Support for multi-byte character encodings is described in the section on Character Sets.}

回答2:

It is a requirement of HFS+here and here (Apple filesystem) to store Unicode strings in decomposed form (as opposed to a pre-composed character).

It is then that a character like é of Unicode code position U+0E9 is decomposed into the two characters e and ´ of Unicode code positions U+065 and U+0301 respectively.

You can see this difference by creating a clean empty directory and doing:

$ a='é'
$ echo "$a" >.text
$ touch "$a"
$ ls > .list

And then comparing the output of this two commands:

$ od -vAn -tx1c .text
  c3  a9  0a
 303 251  \n

$ od -vAn -tx1c .list
  65  cc  81  0a
   e 314 201  \n

Which are not equal.

You may try using this pattern in your system:

ls "e$(echo -e '\xcc\x81')cole".[pu]mdl

Which is simply the expression that the é is represented by two characters in the filesystem.

Understand that this problem has been resolved in newer bash versions.

Reference:

How to enter special characters so that bash terminal understands them

回答3:

é != é

$ echo "école." | xxd 
00000000: c3a9 636f 6c65 0a                        ..cole.

$ echo "école." | xxd
00000000: 65cc 8163 6f6c 650a                      e..cole.

So by this we can see they are different characters:

$ echo -e "\x65\xCC\x81"
é
$ echo -e "\xC3\xA9"
é

You are not using the same character in your filename as set in your variable.

for i in {1..3}; do f="école"; ls "$f."[pu]mdl; echo "$i: $f."[pu]mdl; done
for i in {1..3}; do f="école"; ls "$f."[pu]mdl; echo "$i: $f."[pu]mdl; done
ls: école.[pu]mdl: No such file or directory
1: école.[pu]mdl
ls: école.[pu]mdl: No such file or directory
2: école.[pu]mdl
ls: école.[pu]mdl: No such file or directory
3: école.[pu]mdl
école.pmdl
1: école.[pu]mdl
école.pmdl
2: école.[pu]mdl
école.pmdl
3: école.[pu]mdl

This error can be difficult to reproduce simply because copying and pasting the character from one place to another can get translated by the editor, shell, etc. completely changing it. It may look like the same character, but it's genuinely different by seemingly indistinguishable details.

来源：https://stackoverflow.com/questions/40062427/globbing-accented-files-in-bash

标签

bash

macos

glob

Globbing accented files in Bash

问题

回答1:

回答2:

回答3:

é != é