问题
I have the following shell command:
awk -F'\[|\]' '{print $2}'
What is this command doing? Split into fields using as delimiter [sometext]?
E.g.:
$ echo "this [line] passed to awk" | awk -F'\[|\]' '{print $2}'
line
Editor's note: Only Mawk, as used on Ubuntu by default, produces the output above.
回答1:
The apparent intent is to treat literal [ and ] as field-separator characters, i.e., to split each input record into fields by each occurrence of [ and/or ], which, with the sample line, yields this as field 1 ($1), line as field 2 ($2), and passed to awk as the last field ($3).
This is achieved by a regex (regular expression) that uses alternation (|), either side of which defines a field separator (delimiter): \[ and \] in a regex are needed to represent literal [ and ], because, by default, [ and ] are so-called metacharacters (characters with special syntactical meaning).
Note that awk always interprets the value of the FS variable (-F option) as a regex.
However, the correct form is '\\[|\\]':
$ echo "this [line] passed to awk" | awk -F'\\[|\\]' '{print $2}'
line
That said, a more concise version that uses a character set ([...]) rather than alternation (|) is:
$ echo "this [line] passed to awk" | awk -F'[][]' '{print $2}'
line
Note the careful placement of ] before [ inside the enclosing [...] to make this work, and how the enclosing [...] now have special meaning: they enclose a set of characters, any of which matches.
As for why 2 \ instances are needed in '\\[|\\]':
Taken as a regex in isolation, \[|\] would work:
\[matches literal[\]matches literal]|is an alternation that matches one or the other.
However, Awk's string processing comes first:
It should, due to
\handling in a string, reduce\[|\]to[|]before interpretation as a regex.- Unfortunately, however, Mawk, the default Awk on Ubuntu, for instance, resorts to guesswork in this particular scenario.[1]
[|], interpreted as a regex, would then only match a single, literal|
Thus, the robust and portable way is to use \\ in a string literal when you mean to pass a single \ as part of a regex.
This quote from the relevant section of the GNU Awk manual sums it up well:
To get a backslash into a regular expression inside a string, you have to type two backslashes.
[1] Implementation differences:
Unfortunately, at least 1 major Awk implementation resorts to guesswork in the presence of a single \ before a regex metacharacter inside a string literal.
BSD/macOS Awk and GNU Awk act predictably and GNU Awk also issues a helpful warning when a singly \-prefixed regex metacharacter is found:
# GNU Awk: Predictable string-first processing + a helpful warning.
echo 'a[b]|c' | gawk -F'\[|\]' '{print $2}'
gawk: warning: escape sequence '\[' treated as plain '['
gawk: warning: escape sequence '\]' treated as plain ']'
c
# BSD/macOS Awk: Predictable string-first processing, no warning.
echo 'a[b]|c' | awk -F'\[|\]' '{print $2}'
c
# Mawk: *Guesses* that a *regex* was intended.
# The unambiguous form -F'\\[|\\]' works too, fortunately.
echo 'a[b]|c' | mawk -F'\[|\]' '{print $2}'
b
Optional reading: regex literals inside Awk scripts
Awk supports regex literals enclosed in /.../, the use of which bypasses the double-escaping problem.
However:
- These literals (which are invariably constant) are only available inside an Awk script,
- and, it seems, you can only use them as patterns or function arguments - you cannot store them in a variable.
Therefore, even though /\[|\]/ is in principle equivalent to "\\[|\\]", you can not use the following, because the regex literal cannot be assigned to (special) variable FS:
# !! DOES NOT WORK in any of the 3 major Awk implementations.
# Note that nothing is output, and no error/warning is displayed.
$ echo 'a[b]|c' | awk 'BEGIN { FS=/\[|\]/ } { print $2 }'
# Using a double-escaped *string* to house the regex again works as expected:
$ echo 'a[b]|c' | awk 'BEGIN { FS="\\[|\\]" } { print $2 }'
b
来源:https://stackoverflow.com/questions/44072715/understanding-awk-delimiter-escaping-in-a-regex-based-field-separator