I want to make pairs of words based on the third column (identifier). My file is similar to this example:

A ID.1
B ID.2
C ID.1
D ID.1
E ID.2
F ID.3

The result I want is:

A C ID.1
A D ID.1
B E ID.2
C D ID.1

Note that I don't want to obtain the same word pair in the opposite order. In my real file some words appear more than one time with different identifiers.

I tried this code which works well but requires a lot of time (and I don't know if there are redundancies):

counter=2
cat filtered_go_annotation.txt | while read f1 f2; do 
tail -n +$counter go_annotation.txt | grep $f2 | awk '{print "'$f1' " $1}'; 
((counter++))
done > go_network2.txt

The 'tail' is used to delete a line when it's read.

in two steps

$ sort -k2 file > file.s
$ join -j2 file.s{,} | awk '!(a[$2,$3]++ + a[$3,$2]++){print $2,$3,$1}'

A C ID.1
A D ID.1
C D ID.1
B E ID.2

Awk solution:

awk '{ a[$2] = ($2 in a? a[$2] FS : "") $1 }
     END {
         for (k in a) {
             len = split(a[k], items);
             for (i = 1; i <= len; i++)
                 for (j = i+1; j <= len; j++)
                     print items[i], items[j], k 
         }
     }' filtered_go_annotation.txt

The output:

A C ID.1
A D ID.1
C D ID.1
B E ID.2

With GNU awk for sorted_in and true multi-dimensional arrays:

$ cat tst.awk
{ vals[$2][$1] }
END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (i in vals) {
        for (j in vals[i]) {
            for (k in vals[i]) {
                if (j != k) {
                    print j, k, i
                }
            }
            delete vals[i][j]
        }
    }
}

$ awk -f tst.awk file
A C ID.1
A D ID.1
C D ID.1
B E ID.2

I wonder if this would work (in GNU awk):

$ awk '
($2 in a) && !($1 in a[$2]) {  # if ID.x is found in a and A not in a[ID.X]
    for(i in a[$2])            # loop all existing a[ID.x] 
        print i,$1,$2          # and output combination of current and all previous matching
}
{
    a[$2][$1]                  # hash to a
}' file
A C ID.1
A D ID.1
C D ID.1
B E ID.2

If your input is large, it may be faster to solve it in steps, e.g.:

# Create temporary directory for generated data
mkdir workspace; cd workspace

# Split original file
awk '{ print $1 > $2 }' ../infile

# Find all combinations
perl -MMath::Combinatorics \
     -n0777aE              \
     '
       $c=Math::Combinatorics->new(count=>2, data=>[@F]);
       while(@C = $c->next_combination) { 
         say join(" ", @C) . " " . $ARGV
       }
     ' *

Output:

C D ID.1
C A ID.1
D A ID.1
B E ID.2

Perl

solution using regex backtracking

perl -n0777E '/^([^ ]*) (.*)\n(?:.*\n)*?([^ ]*) (\2)\n(?{say"$1 $3 $2"})(?!)/mg' foo.txt

flags see perl -h.
^([^ ]*) (.*)\n : matches a line with at least one space first capturing group at the left side of first space, second capturing group the right side.
(?:.*\n)*?: matches (without capturing) 0 or more lines lazily to try following pattern first before matching more lines.
([^ ]*) (\2)\n : similar to first match using backreference \2 to match a line with the same key.
(?{say"$1 $3 $2"}) : code to display the captured groups
(?!) : to make the match fail to backtrack.

Note that it could be shortened a bit

perl -n0777E '/^(\S+)(.+)[\s\S]*?^((?1))(\2)$(?{say"$1 $3$2"})(?!)/mg' foo.txt

Yet another awk making use of the redefinition of $0. This makes the solution of RomanPerekhrest a bit shorter :

{a[$2]=a[$2] FS $1}
END { for(i in a) { $0=a[i]; for(j=1;j<NF;j++)for(k=j+1;k<=NF;++k) print $j,$k,i} }

来源：https://stackoverflow.com/questions/50565688/making-pairs-of-words-based-on-one-column

标签

Linux

bash

awk

while-loop

grep

Making pairs of words based on one column

Perl