How do I use awk to extract data within nested delimiters using non-greedy regexps

前端 未结 2 1928
借酒劲吻你
借酒劲吻你 2020-12-20 22:22

This question occurs repeatedly in many forms with many different multi-character delimiters and so IMHO is worth a canonical answer.

Given an input file like:

2条回答
  •  伪装坚强ぢ
    2020-12-20 23:06

    My (current version of) solution approaches the problem from the front so the output is not exactly the same:

     .. 1                   # second
       .. a<2 ..  ..  # first in my approach
     
     .. @{<>}@              # fourth
       .. 4 ..  ..    # third
     
     .. 5 ..          # fifth
    

    if the program would traverse the arrays arr and seps backwards, the output would be the same (probably), but I just ran out of time temporarily.

    In Gnu awk (for using split with four params to parse the data).

    EDIT For compatibility with others than Gnu awk I added function gsplit() which is a crude Gnu awk split replacement.

    $ cat program.awk
    { data=data $0 }                         # append all records to one var
    END {
        n=gsplit(data, arr, "", seps) # split by every tag
        for(i=1;i<=n;i++) {                  # atm iterate arrays from front to back
            if(seps[i]=="")             # if element opening tag
                stack[++j]=seps[i] arr[i+1]  # store tag ang wait for closing tag
            else {
                stack[j]=stack[j] (seps[i]==prev ? arr[i] : "")
                print stack[j--] seps[i] 
            } 
            prev = seps[i]
        }
    }
    
    # elementary gnu awk split compatible replacement
    function gsplit(str, arr, pat, seps,    i) {
        delete arr; delete seps; i=0
        while(match(str, pat)) {
            arr[++i]=substr(str,1,(RSTART-1))
            seps[i]=substr(str,RSTART,RLENGTH)
            str=substr(str,(RSTART+RLENGTH))
        }
        arr[++i]=substr(str,1)
        return i
    }
    

    Run it:

    $ awk -f program.awk file
     .. a<2 .. 
     .. 1  .. 
     .. 4 .. 
     .. @{<>}@  .. 
     .. 5 .. 
    

提交回复
热议问题