Making pairs of words based on one column

两盒软妹~` 提交于 2019-12-06 08:44:43

in two steps

$ sort -k2 file > file.s
$ join -j2 file.s{,} | awk '!(a[$2,$3]++ + a[$3,$2]++){print $2,$3,$1}'

A C ID.1
A D ID.1
C D ID.1
B E ID.2

Awk solution:

awk '{ a[$2] = ($2 in a? a[$2] FS : "") $1 }
     END {
         for (k in a) {
             len = split(a[k], items);
             for (i = 1; i <= len; i++)
                 for (j = i+1; j <= len; j++)
                     print items[i], items[j], k 
         }
     }' filtered_go_annotation.txt

The output:

A C ID.1
A D ID.1
C D ID.1
B E ID.2

With GNU awk for sorted_in and true multi-dimensional arrays:

$ cat tst.awk
{ vals[$2][$1] }
END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (i in vals) {
        for (j in vals[i]) {
            for (k in vals[i]) {
                if (j != k) {
                    print j, k, i
                }
            }
            delete vals[i][j]
        }
    }
}

$ awk -f tst.awk file
A C ID.1
A D ID.1
C D ID.1
B E ID.2

I wonder if this would work (in GNU awk):

$ awk '
($2 in a) && !($1 in a[$2]) {  # if ID.x is found in a and A not in a[ID.X]
    for(i in a[$2])            # loop all existing a[ID.x] 
        print i,$1,$2          # and output combination of current and all previous matching
}
{
    a[$2][$1]                  # hash to a
}' file
A C ID.1
A D ID.1
C D ID.1
B E ID.2

If your input is large, it may be faster to solve it in steps, e.g.:

# Create temporary directory for generated data
mkdir workspace; cd workspace

# Split original file
awk '{ print $1 > $2 }' ../infile

# Find all combinations
perl -MMath::Combinatorics \
     -n0777aE              \
     '
       $c=Math::Combinatorics->new(count=>2, data=>[@F]);
       while(@C = $c->next_combination) { 
         say join(" ", @C) . " " . $ARGV
       }
     ' *

Output:

C D ID.1
C A ID.1
D A ID.1
B E ID.2

Perl

solution using regex backtracking

perl -n0777E '/^([^ ]*) (.*)\n(?:.*\n)*?([^ ]*) (\2)\n(?{say"$1 $3 $2"})(?!)/mg' foo.txt
  • flags see perl -h.
  • ^([^ ]*) (.*)\n : matches a line with at least one space first capturing group at the left side of first space, second capturing group the right side.
  • (?:.*\n)*?: matches (without capturing) 0 or more lines lazily to try following pattern first before matching more lines.
  • ([^ ]*) (\2)\n : similar to first match using backreference \2 to match a line with the same key.
  • (?{say"$1 $3 $2"}) : code to display the captured groups
  • (?!) : to make the match fail to backtrack.

Note that it could be shortened a bit

perl -n0777E '/^(\S+)(.+)[\s\S]*?^((?1))(\2)$(?{say"$1 $3$2"})(?!)/mg' foo.txt

Yet another awk making use of the redefinition of $0. This makes the solution of RomanPerekhrest a bit shorter :

{a[$2]=a[$2] FS $1}
END { for(i in a) { $0=a[i]; for(j=1;j<NF;j++)for(k=j+1;k<=NF;++k) print $j,$k,i} }
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!