bash: clean outer join of three files, preserving file-membership

问题

Consider the following three files with headers in the first row:

file1:

id name in1
1 jon 1
2 sue 1

file2:

id name in2
2 sue 1
3 bob 1

file3:

id name in3
2 sue 1
3 adam 1

I want to merge these files to get the following output, merged_files:

id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1

This request has several special features that I have not found implemented in a handy way in grep/sed/awk/join etc. Edit: You may assume, for simplicity, that the three files have already been sorted.

回答1:

This is very similar to the problem solved in Bash script to find matching rows from multiple CSV files. It's not identical, but it is very similar. (So similar that I only had to remove three sort commands, change the three sed commands slightly, change the file names, change the 'missing' value from no to 0, and change the replacement in the final sed from comma to space.)

The join command with sed (usually sort too, but the data is already sufficiently sorted) are the primary tools needed. Assume that : does not appear in the original data. To record the presence of a row in a file, we want a 1 field in the file (it's almost there); we'll have join supply the 0 when there isn't a match. The 1 at the end of each non-heading line needs to become :1, and the last field in the heading also needs to be preceded by the :. Then, using bash's process substitution, we can write:

$ sed 's/[ ]\([^ ]*\)$/:\1/' file1 |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,2.2     - <(sed 's/[ ]\([^ ]*\)$/:\1/' file2) |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,1.3,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file3) |
> sed 's/:/ /g'
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 adam 0 0 1
3 bob 0 1 0
$

The sed command (three times) adds the : before the last field in each line of the files. The joins are very nearly symmetric. The -t: specifies that the field separator is the colon; the -a 1 and -a 2 mean that when there isn't a match in a file, the line will still be included in the output; the -e 0 means that if there isn't a match in a file, a 0 is generated in the output; and the -o option specifies the output columns. For the first join, -o 0,1.2,2.2 the output is the join column (0), then the second column (the 1) from the two files. The second join has 3 columns in the input, so it specifies -o 0,1.2,1.3,2.2. The argument - on its own means 'read standard input'. The <(...) notation is 'process substitution', where a file name (usually /dev/fd/NN) is provided to the join command, and it contains the output of the command inside the parentheses. The output is then filtered through sed once more to replace the colons with spaces, yielding the desired output.

The only difference from the desired output is the sequencing of 3 bob after 3 adam; it is not particularly clear on what basis you ordered them in reverse in your desired output. If it is crucial, a means can be devised for resolving the order differently (such as sort -k1,1 -k3,5, except that sorts the label line after the data; there are workarounds for that if necessary).

回答2:

Code for GNU awk:

{
if ($1=="id") { v[i++]=$3; next }
b[$1,$2]=$1" "$2
c[i-1,$1" "$2]=$3
}

END {
printf ("id name")
for (x in v) printf (" %s", v[x]); printf ("\n")
for (y in b)  {
    printf ("%s", b[y])
    for (z in v) if (c[z,b[y]]==0) {printf (" 0")} else printf (" %s", c[z,b[y]])
    printf ("\n")
    }
}

$cat file?
id name in1
1 jon 1
2 sue 1
id name in2
2 sue 1
3 bob 1
id name in3
2 sue 1
3 adam 1

$awk -f prog.awk file?
id name in1 in2 in3
3 bob 0 1 0
3 adam 0 0 1
1 jon 1 0 0
2 sue 1 1 1

回答3:

This awk script will do what you want:

$1=="id"&&$2=="name"{
    ins[$3]= 1;
    lastin = $3;
}
$1!="id"||$2!="name" {
    ids[$1] = 1;
    names[$2] = 1;
    a[$1,$2,lastin]= $3
    used[$1,$2] = 1;
}
END {
    printf "id name"
    for (i in ins) {
        printf " %s", i
    }
    printf "\n"
    for (id in ids) {
        for (name in names) {
            if (used[id,name]) {
                printf "%s %s", id, name
                for (i in ins) {
                    printf " %d", a[id,name,i]
                }
                printf "\n"
            }
        }
    }
}

Assuming your files are called list1, list2, etc., and the awk file is script.awk, you can run it like this

$ cat list* | awk -f script.awk
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1

I am sure that is a much shorter and simpler way to do it, but this is all I could come up with at 1:30 am :)

回答4:

I wrote this a while back. Posted it online and posting it here so the next time I look this up I can find it. It's a bit of a kludge, but it supports outer, left, exclusive, etc joins, duplicate handling (remove, or multiply), etc.

https://code.google.com/p/ea-utils/source/browse/trunk/clipper/xjoin

TODO: handle headers better, handle streaming input.

Usage: xjoin [options] [:]<operator> <f1> <f2> [...*]

Joins file 1 and file 2 by the first column, suitable
for arbitratily large files (disk-based sort).

Operator is one of:

# Pasted ops, combines rows:

  in[ner]   return rows in common
  le[ft]    return rows in common, left joined
  ri[ght]   return rows in common, right joined
  ou[ter]   return all rows, outer joined

# Exclusive (not pasted) ops, only return rows from 1 file:

  ex[clude] return only those rows with nothing in common (see -f)
  xl[eft]   return left file rows that are not in right file
  xr[ight]  return right file rows that are not in left file

Common options:

  -1,-2=N     per file, column number to join on (def 1)
  -k=N        set the key column to N (for both files)
  -d    STR   column delimiter (def tab)
  -q    STR   quote char (def none)
  -h    [N]   files have headers (optionally, N is the file number)
  -u    [N]   files may contain duplicate entries, only output first match
  -s    [N]   files are already sorted, don't sort first
  -n          numeric sort key columns
  -p          prefix headers with filename/
  -f          prefix rows with the input file name (op:ex only)

来源：https://stackoverflow.com/questions/17507765/bash-clean-outer-join-of-three-files-preserving-file-membership

标签

bash

awk

outer-join