How do I use awk to extract data within nested delimiters using non-greedy regexps

前端未结

关注

 2  1928

借酒劲吻你 2020-12-20 22:22

This question occurs repeatedly in many forms with many different multi-character delimiters and so IMHO is worth a canonical answer.

Given an input file like:

2条回答

伪装坚强ぢ (楼主)

2020-12-20 23:06

My (current version of) solution approaches the problem from the front so the output is not exactly the same:

 .. 1                   # second
   .. a<2 ..  ..  # first in my approach
 
 .. @{<>}@              # fourth
   .. 4 ..  ..    # third
 
 .. 5 ..          # fifth

if the program would traverse the arrays arr and seps backwards, the output would be the same (probably), but I just ran out of time temporarily.

In Gnu awk (for using split with four params to parse the data).

EDIT For compatibility with others than Gnu awk I added function gsplit() which is a crude Gnu awk split replacement.

$ cat program.awk
{ data=data $0 }                         # append all records to one var
END {
    n=gsplit(data, arr, "", seps) # split by every tag
    for(i=1;i<=n;i++) {                  # atm iterate arrays from front to back
        if(seps[i]=="")             # if element opening tag
            stack[++j]=seps[i] arr[i+1]  # store tag ang wait for closing tag
        else {
            stack[j]=stack[j] (seps[i]==prev ? arr[i] : "")
            print stack[j--] seps[i] 
        } 
        prev = seps[i]
    }
}

# elementary gnu awk split compatible replacement
function gsplit(str, arr, pat, seps,    i) {
    delete arr; delete seps; i=0
    while(match(str, pat)) {
        arr[++i]=substr(str,1,(RSTART-1))
        seps[i]=substr(str,RSTART,RLENGTH)
        str=substr(str,(RSTART+RLENGTH))
    }
    arr[++i]=substr(str,1)
    return i
}

Run it:

$ awk -f program.awk file
 .. a<2 .. 
 .. 1  .. 
 .. 4 .. 
 .. @{<>}@  .. 
 .. 5 ..

0 讨论(0)

查看其它2个回答