问题
I have two linux machines, on which unix sort seems to behave differently. I believe I've narrowed it down to the treatment of the underscore character.
If I run sort tmp
, where tmp contains the following two lines:
aa_d_hh
aa_dh_ey
one machine outputs
aa_d_hh
aa_dh_ey
(i.e. '_' precedes 'h') while the other outputs
aa_dh_ey
aa_d_hh
(i.e. 'h' precedes '_'). I need these machines to behave together (as I use sort -m later, to merge very large files).
Is there any way I can force sort to behave in one way or the other?
Thanks.
回答1:
You can set LC_COLLATE
to traditional sort order just for your command:
env LC_COLLATE=C sort tmp
This won't change the current environment just the one in which the sort command executes. You should have the same behaviour with this.
回答2:
sort order depends on the current value of the environment variable LC_COLLATE. Check your local documentation for 'locale', 'setlocale', etc. Set LC_COLLATE to 'POSIX' on both machines, and the results should match.
回答3:
This is likely caused by a difference in locale. In the en_US.UTF-8
locale, underscores (_
) sort after letters and numbers, whereas in the POSIX C locale they sort after uppercase letters and numbers, but before lowercase numbers.
# won't change LC_COLLATE=C after execution
$ LC_COLLATE=C sort filename
You can also use sort --debug
to show more information about the sorting behavior in general:
$ (echo 'foo_bar'; echo 'fooAbar'; echo 'foo0bar'; echo 'fooabar') |
LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules
foo0bar
fooabar
fooAbar
foo_bar
$ (echo 'foo_bar'; echo 'fooAbar'; echo 'foo0bar'; echo 'fooabar') |
LC_COLLATE=C sort --debug
sort: using simple byte comparison
foo0bar
fooAbar
foo_bar
fooabar
As also shown in this answer, you can use the above formula to force LC_COLLATE=C
for a single command, without modifying your shell environment:
回答4:
The difference is due to your locale. Use the locale command to check the current settings.
There are a number of different locale categories, such as LC_COLLATE
, LC_TIME
, and LC_MESSAGES
. You can change them all by setting the environment variable LC_ALL
or LANG
, or only the collation (sort) order by setting the environment variable LC_COLLATE
. The locale C
or POSIX
is a basic locale defined by the standard; others include en_US
(US English), fr_FR
(French), etc.
来源:https://stackoverflow.com/questions/1184268/unix-sort-treatment-of-underscore-character