问题
I have a problem using the match function in awk on a string containing special characters. Consider the file test.awk:
{
match($0,"(^.*)kon",a);
print a[1];
}
and a corresponding test file "test.txt" with contents "Testing Håkon" (note the norwegian character "å"). The file is encoded in "iso-8859-1" with a length of 14 bytes. The hex dump of the file is given by xxd -p test.txt as
54657374696e672048e56b6f6e0a
From which we can see that the norwegian character "å" has been encoded with the hexadecimal number "e5".. That is, the file is encoded using iso-8859-1 encoding..
Running
awk -f test.awk test.txt
Gives nothing at the terminal.. Whereas the correct output should have been "Testing Hå"..
The output of running the locale command is:
LANG=en_DK.UTF-8
LANGUAGE=en_US:
LC_CTYPE="en_DK.UTF-8"
LC_NUMERIC="en_DK.UTF-8"
LC_TIME="en_DK.UTF-8"
LC_COLLATE="en_DK.UTF-8"
LC_MONETARY="en_DK.UTF-8"
LC_MESSAGES="en_DK.UTF-8"
LC_PAPER="en_DK.UTF-8"
LC_NAME="en_DK.UTF-8"
LC_ADDRESS="en_DK.UTF-8"
LC_TELEPHONE="en_DK.UTF-8"
LC_MEASUREMENT="en_DK.UTF-8"
LC_IDENTIFICATION="en_DK.UTF-8"
LC_ALL=
which shows that the "LANG" variable is set to utf-8 encoding..
回答1:
This isn't a problem with awk see here. Your locale is expecting UTF-8 encoding but your file is using iso-8859-1 so either set your locale to match your file or vice versa.
Note: the second argument of match() should be a regexp and the trailing ; are not required
{
match($0,/(^.*)kon/,a)
print a[1]
}
回答2:
I've modified your code as:
{
match($0,"(^.*)kon",a);
print ">>>" a[1] "<<<";
}
The result running GNU Awk 3.1.6 under Windows 7:
>>>Hå<<<
Under Ubuntu running GNU Awk 3.1.8 I get:
>>><<<
To get the desired output, I had to temporarily change the locale settings and translate:
LC_ALL=ISO_8859-1 awk -f test.awk test.txt | iconv -f ISO_8859-1 -t UTF-8
来源:https://stackoverflow.com/questions/16760493/using-special-characters-in-a-string-argument-to-the-awk-match-function-current