awk: preserve row order and remove duplicate strings (mirrors) when generating data

问题

I have two text files

g1.txt

 alfa beta;www.google.com
 Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;

g2.txt

Jack to ride.zip;http://alfa.org;
JKr.rui.rar;http://gamma.org;
Nofj ogk.png;http://gamma.org;

I use this command to run my awk script

awk -f ./join2.sh g1.txt g2.txt > "g3.txt"

and I obtain this output

Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;;Jack to ride.zip;http://alfa.org;JKr.rui.rar;http://gamma.org;Nofj ogk.png;http://gamma.org;
alfa beta;www.google.com;

What are the problems?

1. row order is not conservated, for example in the output file g3.txt, the line alfa beta;www.google.com; is after the line Light.... when it should be first, as you can see in g1.txt
2. I have many mirror strings in Light.. line, you can see that in g3.txt

http://alfa.org
http://gamma.org
http://gamma.org

are repeated in same row.

What kind of output for rows, instead, do I want? Like this:

alfa beta;www.google.com
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;Jack to ride.zip;JKr.rui.rar;Nofj ogk.png;

First: I try to implement a function that check if there are ugual strings inside a row, for example do you see in my row output Light Dweller - CR, Technical Metal... that there are identical string inside that row? For example http://alfa.org and http://gamma.org ? Ok, I don't want this. I want each string, enclosed within delimiters; is present only once and only once for each row.
This rule should only apply to the output file, g3.txt

Second: I want that original order of rows in g1.txt must be maintained in the g3.txt output file. For example, in g1.txt I have

alfa beta ... 
Light Dweller ...

but my script returns to me a different ordering

Light Dweller ...
alfa beta ...

I want to prevent reordering of rows

My join2.sh script is this

#! /usr/bin/awk  -f

BEGIN {
  OFS=FS=";"
  C=0;
}
{
  if (ARGIND == 1) {
     X = $NF
     T0[$NF] = C++
     $NF = ""
     if (T1[X]) {
        T1[X] = T1[X] $0
     } else {
        T1[X] = $0
     }
  } else {
     X = $NF
     T0[$NF] = C++
     $NF = ""
     if (T2[X]) {
        T2[X] = T2[X] $0
     } else {
        T2[X] = $0
     }
  }
}

END {
  for (X in T0) {
    # concatenate T1[X] and X, since T1[X] ends with ";"
    print T1[X]  X, T2[X]
  }
}

SOLUTION:

回答1:

You should process g2.txt first like this:

cat join2.awk

BEGIN {
  OFS=FS=";"
}
ARGIND == 1 {
   map[$2] = ($2 in map ? map[$2] OFS : "") $1
   next
}
{
   r = $0;
   for (i=1; i<=NF; ++i)
      if ($i in map)
         r = r OFS map[$i]
   $0 = r
}
1

Then use it as:

awk -f join2.awk g2.txt g1.txt

alfa beta;www.google.com
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;;Jack to ride.zip;JKr.rui.rar;Nofj ogk.png

来源：https://stackoverflow.com/questions/64733653/awk-preserve-row-order-and-remove-duplicate-strings-mirrors-when-generating-d

标签

awk

comparison

batch-processing