Unix sort treatment of underscore character

匿名 (未验证) 提交于 2019-12-03 02:56:01

问题:

I have two linux machines, on which unix sort seems to behave differently. I believe I've narrowed it down to the treatment of the underscore character.

If I run sort tmp, where tmp contains the following two lines:

aa_d_hh aa_dh_ey

one machine outputs

aa_d_hh aa_dh_ey

(i.e. '_' precedes 'h') while the other outputs

aa_dh_ey aa_d_hh

(i.e. 'h' precedes '_'). I need these machines to behave together (as I use sort -m later, to merge very large files).

Is there any way I can force sort to behave in one way or the other?

Thanks.

回答1:

You can set LC_COLLATE to traditional sort order just for your command:

env LC_COLLATE=C sort tmp

This won't change the current environment just the one in which the sort command executes. You should have the same behaviour with this.



回答2:

sort order depends on the current value of the environment variable LC_COLLATE. Check your local documentation for 'locale', 'setlocale', etc. Set LC_COLLATE to 'POSIX' on both machines, and the results should match.



回答3:

The difference is due to your locale. Use the locale command to check the current settings.

There are a number of different locale categories, such as LC_COLLATE, LC_TIME, and LC_MESSAGES. You can change them all by setting the environment variable LC_ALL or LANG, or only the collation (sort) order by setting the environment variable LC_COLLATE. The locale C or POSIX is a basic locale defined by the standard; others include en_US (US English), fr_FR (French), etc.



回答4:

This is likely caused by a difference in locale. In the en_US.UTF-8 locale, underscores (_) sort after letters and numbers, whereas in the POSIX C locale they sort after uppercase letters and numbers, but before lowercase numbers.

# won't change LC_COLLATE=C after execution $ LC_COLLATE=C sort filename

You can also use sort --debug to show more information about the sorting behavior in general:

$ (echo 'foo_bar'; echo 'fooAbar'; echo 'foo0bar'; echo 'fooabar') |       LC_COLLATE=en_US.UTF-8 sort --debug sort: using en_US.UTF-8 sorting rules foo0bar fooabar fooAbar foo_bar  $ (echo 'foo_bar'; echo 'fooAbar'; echo 'foo0bar'; echo 'fooabar') |        LC_COLLATE=C sort --debug sort: using simple byte comparison foo0bar fooAbar foo_bar fooabar

As also shown in this answer, you can use the above formula to force LC_COLLATE=C for a single command, without modifying your shell environment:



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!