问题
I am trying to combine data from two different files. In each file, some data is linked to some ID. I want to 'combine' both files in the sense that all ID's must be printed to a new file, and data from both files must be correctly matched to the ID. Example:
cat file_1
1.01 data_a
1.02 data_b
1.03 data_c
1.04 data_d
1.05 data_e
1.06 data_f
cat file_2
1.01 data_aa
1.03 data_cc
1.05 data_ee
1.09 data_ii
The desired result is:
cat files_combined
1.01 data_a data_aa
1.02 data_b
1.03 data_c data_cc
1.04 data_d
1.05 data_e data_ee
1.06 data_f
1.09 data_ii
I know how to do it the long, slow way through looping over each ID. Somewhat pseudocode example:
awk -F\\t '{print $1}' file_1 > files_combined
awk -F\\t '{print $1}' file_2 >> files_combined
sort -u -n files_combined > tmp && mv tmp files_combined
count=0
while read line; do
count++
ID=$line
value1=$(grep "$ID" file_1 | awk -F\\t '{print $2}')
value2=$(grep "$ID" file_2 | awk -F\\t '{print $2}')
awk -F\\t 'NR=='$count' {$2='$value1' && $3='$value2'} 1' OFS="\t" files_combined > tmp && mv tmp files_combined
done < files_combined
This does the job for a file with 10 lines, but with 100000 lines it simply takes too long. I'm just looking for that magic awk solution that is there without a doubt.
Solution provided by bob dylan:
join -j -a 1 -a 2 -t $'\t' -o auto file_1 file_2
回答1:
Does it have to be awk, or did you choose this because you think that's the best - easiest way?
You can do this via join
$join -j 1 -a 1 -a 2 -o auto file_1 file_2 | column -t -s' ' -o' '
1.01 data_a data_aa
1.02 data_b
1.03 data_c data_cc
1.04 data_d
1.05 data_e data_ee
1.06 data_f
1.09 data_ii
edit: As per the excellent suggestion from KamilCuk you can preserve the output afterwards.
回答2:
1st Solution: In case you do have duplicate values of $1 in your Input_file(s) then following will take care of that case also.
awk '
BEGIN{
OFS="\t"
}
FNR==NR{
a[$1]=$2
next
}
$1 in a{
print $1,a[$1],$2
c[$1]
next
}
{
b[$1]=$2
}
END{
for(i in a){
if(!(i in c)){
print i,a[i],"\t"
}
}
for(j in b){
print j,"\t",b[j]
}
}
' Input_file2 Input_file1
2nd solution: Could you please try following in case you are NOT worried about order of output. You need not to run these many commands, you could simply pass your Input_files to this code.
awk '
BEGIN{
OFS="\t"
}
FNR==NR{
a[$1]=$2
next
}
$1 in a{
print $1,a[$1],$2
delete a[$1]
next
}
{
b[$1]=$2
}
END{
for(i in a){
print i,a[i],"\t"
}
for(j in b){
print j,"\t",b[j]
}
}
' file2 file1
来源:https://stackoverflow.com/questions/59308132/matching-data-to-correct-id-from-two-files-in-awk