问题
Consider the following three files with headers in the first row:
file1:
id name in1
1 jon 1
2 sue 1
file2:
id name in2
2 sue 1
3 bob 1
file3:
id name in3
2 sue 1
3 adam 1
I want to merge these files to get the following output, merged_files:
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1
This request has several special features that I have not found implemented in a handy way in grep/sed/awk/join etc. Edit: You may assume, for simplicity, that the three files have already been sorted.
回答1:
This is very similar to the problem solved in Bash script to find matching rows from multiple CSV files. It's not identical, but it is very similar. (So similar that I only had to remove three sort
commands, change the three sed
commands slightly, change the file names, change the 'missing' value from no
to 0
, and change the replacement in the final sed
from comma to space.)
The join
command with sed
(usually sort
too, but the data is already sufficiently sorted) are the primary tools needed. Assume that :
does not appear in the original data. To record the presence of a row in a file, we want a 1
field in the file (it's almost there); we'll have join
supply the 0
when there isn't a match. The 1
at the end of each non-heading line needs to become :1
, and the last field in the heading also needs to be preceded by the :
. Then, using bash
's process substitution, we can write:
$ sed 's/[ ]\([^ ]*\)$/:\1/' file1 |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file2) |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,1.3,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file3) |
> sed 's/:/ /g'
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 adam 0 0 1
3 bob 0 1 0
$
The sed
command (three times) adds the :
before the last field in each line of the files. The joins are very nearly symmetric. The -t:
specifies that the field separator is the colon; the -a 1
and -a 2
mean that when there isn't a match in a file, the line will still be included in the output; the -e 0
means that if there isn't a match in a file, a 0
is generated in the output; and the -o
option specifies the output columns. For the first join, -o 0,1.2,2.2
the output is the join column (0), then the second column (the 1
) from the two files. The second join has 3 columns in the input, so it specifies -o 0,1.2,1.3,2.2
. The argument -
on its own means 'read standard input'. The <(...)
notation is 'process substitution', where a file name (usually /dev/fd/NN
) is provided to the join command, and it contains the output of the command inside the parentheses. The output is then filtered through sed
once more to replace the colons with spaces, yielding the desired output.
The only difference from the desired output is the sequencing of 3 bob
after 3 adam
; it is not particularly clear on what basis you ordered them in reverse in your desired output. If it is crucial, a means can be devised for resolving the order differently (such as sort -k1,1 -k3,5
, except that sorts the label line after the data; there are workarounds for that if necessary).
回答2:
Code for GNU awk:
{
if ($1=="id") { v[i++]=$3; next }
b[$1,$2]=$1" "$2
c[i-1,$1" "$2]=$3
}
END {
printf ("id name")
for (x in v) printf (" %s", v[x]); printf ("\n")
for (y in b) {
printf ("%s", b[y])
for (z in v) if (c[z,b[y]]==0) {printf (" 0")} else printf (" %s", c[z,b[y]])
printf ("\n")
}
}
$cat file? id name in1 1 jon 1 2 sue 1 id name in2 2 sue 1 3 bob 1 id name in3 2 sue 1 3 adam 1 $awk -f prog.awk file? id name in1 in2 in3 3 bob 0 1 0 3 adam 0 0 1 1 jon 1 0 0 2 sue 1 1 1
回答3:
This awk
script will do what you want:
$1=="id"&&$2=="name"{
ins[$3]= 1;
lastin = $3;
}
$1!="id"||$2!="name" {
ids[$1] = 1;
names[$2] = 1;
a[$1,$2,lastin]= $3
used[$1,$2] = 1;
}
END {
printf "id name"
for (i in ins) {
printf " %s", i
}
printf "\n"
for (id in ids) {
for (name in names) {
if (used[id,name]) {
printf "%s %s", id, name
for (i in ins) {
printf " %d", a[id,name,i]
}
printf "\n"
}
}
}
}
Assuming your files are called list1
, list2
, etc., and the awk file is script.awk
, you can run it like this
$ cat list* | awk -f script.awk
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1
I am sure that is a much shorter and simpler way to do it, but this is all I could come up with at 1:30 am :)
回答4:
I wrote this a while back. Posted it online and posting it here so the next time I look this up I can find it. It's a bit of a kludge, but it supports outer, left, exclusive, etc joins, duplicate handling (remove, or multiply), etc.
https://code.google.com/p/ea-utils/source/browse/trunk/clipper/xjoin
TODO: handle headers better, handle streaming input.
Usage: xjoin [options] [:]<operator> <f1> <f2> [...*]
Joins file 1 and file 2 by the first column, suitable
for arbitratily large files (disk-based sort).
Operator is one of:
# Pasted ops, combines rows:
in[ner] return rows in common
le[ft] return rows in common, left joined
ri[ght] return rows in common, right joined
ou[ter] return all rows, outer joined
# Exclusive (not pasted) ops, only return rows from 1 file:
ex[clude] return only those rows with nothing in common (see -f)
xl[eft] return left file rows that are not in right file
xr[ight] return right file rows that are not in left file
Common options:
-1,-2=N per file, column number to join on (def 1)
-k=N set the key column to N (for both files)
-d STR column delimiter (def tab)
-q STR quote char (def none)
-h [N] files have headers (optionally, N is the file number)
-u [N] files may contain duplicate entries, only output first match
-s [N] files are already sorted, don't sort first
-n numeric sort key columns
-p prefix headers with filename/
-f prefix rows with the input file name (op:ex only)
来源:https://stackoverflow.com/questions/17507765/bash-clean-outer-join-of-three-files-preserving-file-membership