awk script for replacing multiple occurances of string pattern in the same line in different files with number matching the string

雨燕双飞 提交于 2020-12-13 03:31:46

问题


I need a awk script that searches for any string inside <>, if it finds one that it hasn't found before it should replace it with the current value of the index counter (0 at the beginning) and increment the counter. If it finds a string inside <> that it already knows, it should look up the index of the string and replace it with the index. This should be done across multiple files, meaning the counter does not reset when multiple files are searched for the patterns, only at program startup For example: file_a.txt:

123abc<abc>xyz
efg
<b>ah
a<c>, <abc>
<c>b
(<abc>, <b>)

file_b.txt:

xyz(<c>, <b>)
xyz<b>xy<abc>z

should become

file_a_new.txt:

123abc<0>xyz
efg
<1>ah
a<2>, <0>
<2>b
(<0>, <1>)

file_b_new.txt:

xyz(<2>, <1>)
xyz<1>xy<0>z

What I got so far:

awk 'match($0, /<[^>]+>/) {
   k = substr($0, RSTART, RLENGTH)
   if (!(k in freq))
      freq[k] = n++
   $0 = substr($0, 1, RSTART-1) freq[k] substr($0, RSTART+RLENGTH)
}
{
   print $0 > (FILENAME ".tmp")
}' files

But this can only detect one <> pattern per line, but there can be multiple <> patterns per line. So how should I change the code?

Edit: The files should not be editet, instead new files should be created


回答1:


Using gnu-awk it is easier this way using RS as <key> string:

awk -v RS='<[^>]+>' '{ ORS="" }  # init ORS to ""
RT {                                        # when RT is set
   if (!(RT in freq))                       # if RT is not in freq array
      freq[RT] = n++                        # save n in freq & increment n
   ORS="<" freq[RT] ">"                     # set ORS to < + n + >
}
{
   print $0 > ("/tmp/" FILENAME)
}' file_{a,b}.txt



回答2:


Using any awk:

$ cat tst.awk
FNR == 1 {
    close(out)
    out = FILENAME ".tmp"
}
{
    head = ""
    tail = $0
    while ( match(tail,/<[^>]+>/) ) {
        tgt = substr(tail,RSTART+1,RLENGTH-2)
        if ( !(tgt in map) ) {
            map[tgt] = cnt++
        }
        head = head substr(tail,1,RSTART) map[tgt]
        tail = substr(tail,RSTART+RLENGTH-1)
    }
    print head tail > out
}

$ head file_*.tmp
==> file_a.txt.tmp <==
123abc<0>xyz
efg
<1>ah
a<2>, <0>
<2>b
(<0>, <1>)

==> file_b.txt.tmp <==
xyz(<2>, <1>)
xyz<1>xy<0>z


来源:https://stackoverflow.com/questions/65024964/awk-script-for-replacing-multiple-occurances-of-string-pattern-in-the-same-line

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!