问题
I have the following text file (sort_test.txt):
PGA_scaffold1__77
PGA_scaffold2__36
PGA_scaffold3__111
PGA_scaffold4__129
PGA_scaffold5__109
PGA_scaffold6__104
PGA_scaffold7__69
PGA_scaffold8__63
PGA_scaffold9__45
PGA_scaffold10__49
PGA_scaffold11__79
PGA_scaffold12__71
PGA_scaffold13__52
PGA_scaffold14__91
PGA_scaffold15__101
PGA_scaffold16__33
PGA_scaffold17__51
PGA_scaffold18__69
When I try to sort the file with the following code, the sort output seems to be out of order (specifically, lines 9 and 10):
IN: awk -F"_" '{print $1"_"$2"_"$3"_"$4}' sort_test.txt | sort
OUT:
PGA_scaffold10__49
PGA_scaffold11__79
PGA_scaffold12__71
PGA_scaffold13__52
PGA_scaffold14__91
PGA_scaffold15__101
PGA_scaffold16__33
PGA_scaffold17__51
PGA_scaffold1__77
PGA_scaffold18__69
PGA_scaffold2__36
PGA_scaffold3__111
PGA_scaffold4__129
PGA_scaffold5__109
PGA_scaffold6__104
PGA_scaffold7__69
PGA_scaffold8__63
PGA_scaffold9__45
Why do lines 9 and 10 seem to be out of order?
Desired output:
PGA_scaffold10__49
PGA_scaffold11__79
PGA_scaffold12__71
PGA_scaffold13__52
PGA_scaffold14__91
PGA_scaffold15__101
PGA_scaffold16__33
PGA_scaffold17__51
PGA_scaffold18__69
PGA_scaffold1__77
PGA_scaffold2__36
PGA_scaffold3__111
PGA_scaffold4__129
PGA_scaffold5__109
PGA_scaffold6__104
PGA_scaffold7__69
PGA_scaffold8__63
PGA_scaffold9__45
If I modify the code to only print the first three fields, the sorting does what I expect:
IN: awk -F"_" '{print $1"_"$2"_"$3}' sort_test.txt | sort
OUT:
PGA_scaffold1_
PGA_scaffold10_
PGA_scaffold11_
PGA_scaffold12_
PGA_scaffold13_
PGA_scaffold14_
PGA_scaffold15_
PGA_scaffold16_
PGA_scaffold17_
PGA_scaffold18_
PGA_scaffold2_
PGA_scaffold3_
PGA_scaffold4_
PGA_scaffold5_
PGA_scaffold6_
PGA_scaffold7_
PGA_scaffold8_
PGA_scaffold9_
So, it appears that there's something about the fourth field that impacts the sorting, but it's not clear why.
The problem is, I need the initial sorting, but with lines 9 and 10 swapped.
Does anyone have any thoughts on why the sorting is happening like this and how I can modify it so that produces the expected output?
回答1:
Thanks so much to @Barmar for pointing me to the Unix & Linux forum!
I managed to indirectly find the answer to my problem in this post:
Is gnu coreutils sort broken?
The solution was to change my locale!
My locale looked like this:
$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
After running:
$ export LC_COLLATE=C
I was able to get my desired sorting output.
来源:https://stackoverflow.com/questions/58615269/output-from-sort-does-not-appear-to-be-sorted