What's the difference between --general-numeric-sort and --numeric-sort options in gnu sort

前端 未结 3 791
野趣味
野趣味 2020-12-02 11:55

sort provides two kinds of numeric sort. This is from the man page:

   -g, --general-numeric-sort
          compare according to general numeric         


        
3条回答
  •  余生分开走
    2020-12-02 12:14

    In addition to the accepted answer which mention -g allow scientific notation, I want to shows the part which most likely causes undesirable behavior.

    With -g:

    $ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -g myfile
    baa
    --inf
    --inf  
    --inf- 
    --inf--
    --inf-a
    --nnf
    nnf--
       nnn  
    tnan
    zoo
       naN
    Nana
    nani lol
    -inf
    -inf--
    -11
    -2
    -1
    1
    +1
    2
    +2
    0xa
    11
    +11
    inf
    

    Look at the zoo, three important things here:

    • Line starts with NAN(e.g. Nana and nani lol) or -INF(single dash, not --INF) move to end but before digits. While INF move to the last after digits because it means infinity.

    • The NAN, INF, and -INF are case insensitive.

    • The lines always ignore whitespace from either side of NAN, INF, -INF (regardless of LC_CTYPE). Other alphabetic may ignore whitespace from either side depends on locale LC_COLLATE (e.g. LC_COLLATE=fr_FR.UTF-8 ignore but LC_COLLATE=us_EN.UTF-8 not ignore).

    So if you are sorting arbitrary alphanumeric then you probably don't want -g. If you really need scientific notation comparison with -g, then you probably want to extract alphabet and numeric data and do comparison separately.

    If you only need ordinary number(e.g. 1, -1) sorting, and feel that 0x/E/+ sorting not important, just use -n enough:

    $ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -n myfile
    -1000
    -22
    -13
    -11
    -010
    -10
    -5
    -2
    -1
    -0.2
    -0.12
    -0.11
    -0.1
    0x1
    0x11
    0xb
    +1
    +11
    +2
    -a
    -aa
    --aa
    -aaa
    -b
    baa
    BAA
    bbb
    +ignore
    inf
    -inf
    --inf
    --inf  
    --inf- 
    --inf--
    -inf--
    --inf-a
       naN
    Nana
    nani lol
    --nnf
    nnf--
       nnn  
    None         
    uum
    Zero cool
    -zzz
    1
    1.1
    1.234E10
    5
    11
    

    Either of -g or -n, be aware of locale effect. You may want to specify LC_NUMERIC as us_EN.UTF-8 to avoid fr_FR.UTF-8 sort - with floating number failed:

    $ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=fr_FR.UTF-8 sort -n myfile
    -10
    -5
    -2
    -1
    -1.1
    -1.2
    -0.1
    -0.11
    -0.12
    -0.2
    -a
    +b
    middle
    -wwe
    +zoo
    1
    1.1
    

    With LC_NUMERIC=en_US.UTF-8:

    $ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=en_US.UTF-8 sort -n myfile
    -10
    -5
    -2
    -1.2
    -1.1
    -1
    -0.2
    -0.12
    -0.11
    -0.1
    -a
    +b
    middle
    -wwe
    +zoo
    1
    1.1
    

    Or LC_NUMERIC=us_EN.UTF-8 to group +|-|space with alpha:

    $ LC_COLLATE=fr_FR.UTF-8 LC_NUMERIC=us_EN.UTF-8 sort -n myfile
    -0.1
        a
        b
     a
     b
    +b
    +zoo
    -a
    -wwe
    middle
    1
    

    You probably want to specify locale when using sort if want to write portable script.

提交回复
热议问题