Output from sort does not appear to be sorted

半世苍凉 提交于 2020-11-30 00:34:49

问题


I have the following text file (sort_test.txt):

PGA_scaffold1__77
PGA_scaffold2__36
PGA_scaffold3__111
PGA_scaffold4__129
PGA_scaffold5__109
PGA_scaffold6__104
PGA_scaffold7__69
PGA_scaffold8__63
PGA_scaffold9__45
PGA_scaffold10__49
PGA_scaffold11__79
PGA_scaffold12__71
PGA_scaffold13__52
PGA_scaffold14__91
PGA_scaffold15__101
PGA_scaffold16__33
PGA_scaffold17__51
PGA_scaffold18__69

When I try to sort the file with the following code, the sort output seems to be out of order (specifically, lines 9 and 10):

IN: awk -F"_" '{print $1"_"$2"_"$3"_"$4}' sort_test.txt | sort

OUT:

PGA_scaffold10__49
PGA_scaffold11__79
PGA_scaffold12__71
PGA_scaffold13__52
PGA_scaffold14__91
PGA_scaffold15__101
PGA_scaffold16__33
PGA_scaffold17__51
PGA_scaffold1__77
PGA_scaffold18__69
PGA_scaffold2__36
PGA_scaffold3__111
PGA_scaffold4__129
PGA_scaffold5__109
PGA_scaffold6__104
PGA_scaffold7__69
PGA_scaffold8__63
PGA_scaffold9__45

Why do lines 9 and 10 seem to be out of order?

Desired output:

PGA_scaffold10__49
PGA_scaffold11__79
PGA_scaffold12__71
PGA_scaffold13__52
PGA_scaffold14__91
PGA_scaffold15__101
PGA_scaffold16__33
PGA_scaffold17__51
PGA_scaffold18__69
PGA_scaffold1__77
PGA_scaffold2__36
PGA_scaffold3__111
PGA_scaffold4__129
PGA_scaffold5__109
PGA_scaffold6__104
PGA_scaffold7__69
PGA_scaffold8__63
PGA_scaffold9__45

If I modify the code to only print the first three fields, the sorting does what I expect:

IN: awk -F"_" '{print $1"_"$2"_"$3}' sort_test.txt | sort

OUT:

PGA_scaffold1_
PGA_scaffold10_
PGA_scaffold11_
PGA_scaffold12_
PGA_scaffold13_
PGA_scaffold14_
PGA_scaffold15_
PGA_scaffold16_
PGA_scaffold17_
PGA_scaffold18_
PGA_scaffold2_
PGA_scaffold3_
PGA_scaffold4_
PGA_scaffold5_
PGA_scaffold6_
PGA_scaffold7_
PGA_scaffold8_
PGA_scaffold9_

So, it appears that there's something about the fourth field that impacts the sorting, but it's not clear why.

The problem is, I need the initial sorting, but with lines 9 and 10 swapped.

Does anyone have any thoughts on why the sorting is happening like this and how I can modify it so that produces the expected output?


回答1:


Thanks so much to @Barmar for pointing me to the Unix & Linux forum!

I managed to indirectly find the answer to my problem in this post:

Is gnu coreutils sort broken?

The solution was to change my locale!

My locale looked like this:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

After running:

$ export LC_COLLATE=C

I was able to get my desired sorting output.



来源:https://stackoverflow.com/questions/58615269/output-from-sort-does-not-appear-to-be-sorted

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!