Linux join utility complains about input file not being sorted

前端 未结 3 2023
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-03 01:35

I have two files:

file1 has the format:

field1;field2;field3;field4

(file1 is initially unsorted)

file2 has the format:

相关标签:
3条回答
  • 2021-01-03 01:53

    To complement Wumpus Q. Wumbley's helpful answer with a broader perspective (since I found this post researching a slightly different problem).

    • When using join, the input files must be sorted by the join field ONLY, otherwise you may see the warning reported by the OP.

    There are two common scenarios in which more than the field of interest is mistakenly included when sorting the input files:

    • If you do specify a field, it's easy to forget that you must also specify a stop field - even if you target only 1 field - because sort uses the remainder of the line if only a start field is specified; e.g.:

      • sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
      • sort -t, -k1,1 ... # Field 1 only
    • If your sort field is the FIRST field in the input, it's tempting to not specify any field selector at all.

      • However, if field values can be prefix substrings of each other, sorting whole lines will NOT (necessarily) result in the same sort order as just sorting by the 1st field:
      • sort ... # NOT always the same as 'sort -k1,1'! see below for example

    Pitfall example:

    #!/usr/bin/env bash
    
    # Input data: fields separated by '^'.
    # Note that, when properly sorting by field 1, the order should
    # be "nameA" before "nameAA" (followed by "nameZ").
    # Note how "nameA" is a substring of "nameAA".
    read -r -d '' input <<EOF
    nameA^other1
    nameAA^other2
    nameZ^other3
    EOF
    
    # NOTE: "WRONG" below refers to deviation from the expected outcome
    #       of sorting by field 1 only, based on mistaken assumptions.
    #       The commands do work correctly in a technical sense.
    
    echo '--- just sort'
    sort <<<"$input" | head -1 # WRONG: 'nameAA' comes first
    
    echo '--- sort FROM field 1'
    sort -t^ -k1 <<<"$input" | head -1 # WRONG: 'nameAA' comes first
    
    echo '--- sort with field 1 ONLY'
    sort -t^ -k1,1 <<<"$input" | head -1 # ok, 'nameA' comes first
    

    Explanation:

    • When NOT limiting sorting to the first field, it is the relative sort order of chars. ^ and A (column index 6) that matters in this example. In other words: the field separator is compared to data, which is the source of the problem: ^ has a HIGHER ASCII value than A, and therefore sorts after 'A', resulting in the line starting with nameAA^ sorting BEFORE the one with nameA^.

    • Note: It is possible for problems to surface on one platform, but be masked on another, based on locale and character-set settings and/or the sort implementation used; e.g., with a locale of en_US.UTF-8 in effect, with , as the separator and - permissible inside fields:

      • sort as used on OSX 10.10.2 (which is an old GNU sort version, 5.93) sorts , before - (in line with ASCII values)
      • sort as used on Ubuntu 14.04 (GNU sort 8.21) does the opposite: sorts - before ,[1]

    [1] I don't know why - if somebody knows, please tell me. Test with sort <<<$'-\n,'

    0 讨论(0)
  • 2021-01-03 01:57

    ... or the gnu sort is just as buggy as every other GNU command

    try and sort Gi1/0/11 vs Gi1/0/1 and you'll never be able to get an actual regular textual sort suitable for join input because someone added some extra intelligence in sort which will happily use numeric or human numeric sorting automagically in such cases without even bothering to add a flag to force the regular behavior

    what is suitable for humans is seldom suitable for scripting

    0 讨论(0)
  • 2021-01-03 02:07

    sort -k1 uses all fields starting from field 1 as the key. You need to specify a stop field.

    sort -t\; -k1,1
    
    0 讨论(0)
提交回复
热议问题