Unix sort treatment of underscore character

我与影子孤独终老i 提交于 2019-12-20 12:07:05

问题


I have two linux machines, on which unix sort seems to behave differently. I believe I've narrowed it down to the treatment of the underscore character.

If I run sort tmp, where tmp contains the following two lines:

aa_d_hh
aa_dh_ey

one machine outputs

aa_d_hh
aa_dh_ey

(i.e. '_' precedes 'h') while the other outputs

aa_dh_ey
aa_d_hh

(i.e. 'h' precedes '_'). I need these machines to behave together (as I use sort -m later, to merge very large files).

Is there any way I can force sort to behave in one way or the other?

Thanks.


回答1:


You can set LC_COLLATE to traditional sort order just for your command:

env LC_COLLATE=C sort tmp

This won't change the current environment just the one in which the sort command executes. You should have the same behaviour with this.




回答2:


sort order depends on the current value of the environment variable LC_COLLATE. Check your local documentation for 'locale', 'setlocale', etc. Set LC_COLLATE to 'POSIX' on both machines, and the results should match.




回答3:


This is likely caused by a difference in locale. In the en_US.UTF-8 locale, underscores (_) sort after letters and numbers, whereas in the POSIX C locale they sort after uppercase letters and numbers, but before lowercase numbers.

# won't change LC_COLLATE=C after execution
$ LC_COLLATE=C sort filename

You can also use sort --debug to show more information about the sorting behavior in general:

$ (echo 'foo_bar'; echo 'fooAbar'; echo 'foo0bar'; echo 'fooabar') |
      LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules
foo0bar
fooabar
fooAbar
foo_bar

$ (echo 'foo_bar'; echo 'fooAbar'; echo 'foo0bar'; echo 'fooabar') | 
      LC_COLLATE=C sort --debug
sort: using simple byte comparison
foo0bar
fooAbar
foo_bar
fooabar

As also shown in this answer, you can use the above formula to force LC_COLLATE=C for a single command, without modifying your shell environment:




回答4:


The difference is due to your locale. Use the locale command to check the current settings.

There are a number of different locale categories, such as LC_COLLATE, LC_TIME, and LC_MESSAGES. You can change them all by setting the environment variable LC_ALL or LANG, or only the collation (sort) order by setting the environment variable LC_COLLATE. The locale C or POSIX is a basic locale defined by the standard; others include en_US (US English), fr_FR (French), etc.



来源:https://stackoverflow.com/questions/1184268/unix-sort-treatment-of-underscore-character

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!