This is VERY clearly a case for a recursive descent solution:
$ cat tst.awk
function descend(node) {return (map[node] in map ? descend(map[node]) : map[node])}
{ map[$1] = $2 }
END { for (key in map) print key, descend(key) }
$ awk -f tst.awk file
<a> <e>
<b> <e>
<c> <e>
<d> <e>
If infinite recursion in your input is a possibility, here;s an approach that will print as the 2nd field the last node before the recursion starts and put a "*" next to it so you know it's happening:
$ cat tst.awk
function descend(node, child, descendant) {
stack[node]
child = map[node]
if (child in map) {
if (child in stack) {
descendant = node "*"
}
else {
descendant = descend(child)
}
}
else {
descendant = child
}
delete stack[node]
return descendant
}
{ map[$1] = $2 }
END { for (key in map) print key, descend(key) }
.
$ cat file
<w> <w>
<x> <y>
<y> <z>
<z> <x>
<a> <b>
<d> <e>
<b> <c>
<c> <e>
$ awk -f tst.awk file
<w> <w>*
<x> <z>*
<y> <x>*
<z> <y>*
<a> <e>
<b> <e>
<c> <e>
<d> <e>
If you need the output order to match the input order and/or or to print duplicate lines twice, change the bottom 2 lines of the script to:
{ keys[++numKeys] = $1; map[$1] = $2 }
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
print key, descend(key)
}
}
Perl to the rescue:
#!/usr/bin/perl
use warnings;
use strict;
my (@buff);
sub output {
my $last = pop @buff;
print map "$_ $last\n", @buff;
@buff = ();
}
while (<>) {
my @F = split;
output() if @buff and $F[0] ne $buff[-1]; # End of a group.
push @buff, $F[0] unless @buff; # Start a new group.
push @buff, $F[1];
}
output(); # Don't forget to print the last buffer.
Explanation: Read the input line by line. Keep a list of words to be printed with the same second word. If the first word is different than the second word of the previous line, print the buffered output.
awk '{i++;a[i]=$1;b[i]=$2;next}
END{
for(i=1;i in a;i++)
{
f=1;
while (f==1)
{
f=0;
for(j=i+1;j in a;j++)
{
if(b[i]==a[j])
{
b[i]=b[j];
f=1;
}
}
}
}
for(i=1;i in a;i++)
{
print a[i],b[i];
}
}' input.txt
Input:
<a> <b>
<d> <e>
<b> <c>
<c> <e>
Output:
<a> <e>
<d> <e>
<b> <e>
<c> <e>
Input:
<a> <b>
<e> <z>
<b> <e>
Output:
<a> <z>
<e> <z>
<b> <e>
If you need to get
<a> <z>
<e> <z>
<b> <z>
As output from the second input you can change this line:
if(b[i]==a[j])
to:
if(j!=i&&b[i]==a[j])
and this:
for(j=i+1;j in a;j++)
to:
for(j=1;j in a;j++)
Also note that this code assumes there is not a case where second word of a line is equal to both first word of a line and its second word i.e:
<a> <b>
<e> <z>
<b> <b>
In that case the execution of the code will never ends.