properly join two files based on 2 columns in common

后端 未结 3 1492
栀梦
栀梦 2020-11-30 01:28

I have two files I\'m trying to join/merge based on columns 1 and 2. They look something like this, with file1 (58210 lin

3条回答
  •  南方客
    南方客 (楼主)
    2020-11-30 01:56

    You can use the join command but you need to create a single join field in each data table. Assuming that you do have values other that 2L in column 1, then this code should work regardless of the sorted or unsorted nature of the two input files:

    tmp=${TMPDIR:-/tmp}/tmp.$$
    trap "rm -f $tmp.?; exit 1" 0 1 2 3 13 15
    
    awk '{print $1 ":" $2, $0}' file1 | sort > $tmp.1
    awk '{print $1 ":" $2, $0}' file2 | sort > $tmp.2
    
    join -o 2.2,2.3,2.4,2.5,1.4 $tmp.1 $tmp.2
    
    rm -f $tmp.?
    trap 0
    

    If you have bash and 'process substitution', or if you know that the data is already sorted appropriately, you can simplify the processing.


    I'm not entirely sure why your code wasn't working, but I'd probably be using a[$1,$2] for the subscripts; it will give you less trouble if some of your column 1 values are pure numeric and can therefore be confused when you concatenate columns 1 and 2. That's why the 'key creation' awk scripts used a colon between the fields.


    With revised data files as shown:

    file1

    2L      5753   33158
    2L      8813   33158
    2L      7885   33158
    2L      7885   33159
    2L      1279   33158
    2L      5095   33158
    2L      3256   33158
    2L      5372   33158
    2L      7088   33161
    2L      5762   33161
    

    file2

    2L      5095    0.666666666666667       1
    2L      5372    0.5     0.925925925925926
    2L      5762    0.434782608695652       0.580645161290323
    2L      5904    0.571428571428571       0.869565217391304
    2L      5974    0.434782608695652       0.694444444444444
    2L      6353    0.785714285714286       0.84
    2L      7088    0.590909090909091       0.733333333333333
    2L      7885    0.714285714285714       0.864864864864865
    2L      7902    0.642857142857143       0.810810810810811
    2L      8263    0.833333333333333       0.787878787878788
    

    (Unchanged from the question.)

    Output

    2L 5095 0.666666666666667 1 33158
    2L 5372 0.5 0.925925925925926 33158
    2L 5762 0.434782608695652 0.580645161290323 33161
    2L 7088 0.590909090909091 0.733333333333333 33161
    2L 7885 0.714285714285714 0.864864864864865 33158
    2L 7885 0.714285714285714 0.864864864864865 33159
    

提交回复
热议问题