How do I use awk to extract data within nested delimiters using non-greedy regexps

前端 未结 2 1929
借酒劲吻你
借酒劲吻你 2020-12-20 22:22

This question occurs repeatedly in many forms with many different multi-character delimiters and so IMHO is worth a canonical answer.

Given an input file like:

相关标签:
2条回答
  • 2020-12-20 23:06

    My (current version of) solution approaches the problem from the front so the output is not exactly the same:

    <foo> .. 1                   # second
      <foo> .. a<2 .. </foo> ..  # first in my approach
    </foo> 
    <foo> .. @{<>}@              # fourth
      <foo> .. 4 .. </foo> ..    # third
    </foo> 
    <foo> .. 5 .. </foo>         # fifth
    

    if the program would traverse the arrays arr and seps backwards, the output would be the same (probably), but I just ran out of time temporarily.

    In Gnu awk (for using split with four params to parse the data).

    EDIT For compatibility with others than Gnu awk I added function gsplit() which is a crude Gnu awk split replacement.

    $ cat program.awk
    { data=data $0 }                         # append all records to one var
    END {
        n=gsplit(data, arr, "</?foo>", seps) # split by every tag
        for(i=1;i<=n;i++) {                  # atm iterate arrays from front to back
            if(seps[i]=="<foo>")             # if element opening tag
                stack[++j]=seps[i] arr[i+1]  # store tag ang wait for closing tag
            else {
                stack[j]=stack[j] (seps[i]==prev ? arr[i] : "")
                print stack[j--] seps[i] 
            } 
            prev = seps[i]
        }
    }
    
    # elementary gnu awk split compatible replacement
    function gsplit(str, arr, pat, seps,    i) {
        delete arr; delete seps; i=0
        while(match(str, pat)) {
            arr[++i]=substr(str,1,(RSTART-1))
            seps[i]=substr(str,RSTART,RLENGTH)
            str=substr(str,(RSTART+RLENGTH))
        }
        arr[++i]=substr(str,1)
        return i
    }
    

    Run it:

    $ awk -f program.awk file
    <foo> .. a<2 .. </foo>
    <foo> .. 1  .. </foo>
    <foo> .. 4 .. </foo>
    <foo> .. @{<>}@  .. </foo>
    <foo> .. 5 .. </foo>
    
    0 讨论(0)
  • 2020-12-20 23:10

    The main challenge is that since awk only supports greedy matching you can't write any variation of <foo>.*</foo> that will stop at the first </foo> on the line instead of the last </foo>. The solution is to convert each start and end string into a single character that cannot appear in the input so you can write x[^xy]*y where x and y are those start/end characters but how do you choose a character that can't appear in the input? You don't - you make one:

    $ cat nonGreedy.awk
    {
        $0 = encode($0)
        while ( match($0,/({[^{}]*})/) ) {
            print decode(substr($0,RSTART,RLENGTH))
            $0 = substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
        }
    }
    function encode(str) {
        gsub(/@/,"@A",str)
        gsub(/{/,"@B",str); gsub(/}/,"@C",str)
        gsub(/<foo>/,"{",str); gsub(/<\/foo>/,"}",str)
        return str
    }
    function decode(str) {
        gsub(/}/,"</foo>",str); gsub(/{/,"<foo>",str)
        gsub(/@C/,"}",str); gsub(/@B/,"{",str)
        gsub(/@A/,"@",str)
        return str
    }
    
    $ awk -f nonGreedy.awk file
    <foo> .. a<2 .. </foo>
    <foo> .. 1  .. </foo>
    <foo> .. 4 .. </foo>
    <foo> .. @{<>}@  .. </foo>
    <foo> .. 5 .. </foo>
    

    The above works by you picking any character that can't appear JUST IN THE START/END STRINGS (note it doesn't have to be a character that can't appear in the input at all, just not in those strings), in this case I'm choosing @, and appending an A after each occurrence of it in the input. At this point every occurrence of @A represents an @ character and there are guaranteed to be no occurrences of @B or @ followed by anything else anywhere in the input.

    Now we can pick 2 other characters that we want to use to represent the start/end strings, in this case I'm choosing { and }, and convert them to some @-prefixed strings like @B and @C and at this point every occurrence of @B represents a { character and @C represents a } character and there are no {s or }s anywhere in the input.

    Now all that's left to do to find the strings we want to extract is convert every start string <foo> to the start character we've chosen, {, and every end string </foo> to the end character } and then we can use a simple regexp of {[^{}]*} to represent a non-greedy version of <foo>.*</foo>.

    As we find each string we just unwind the conversions we did above in reverse order (note you must unwind the substitutions to each matched string in exactly the reverse order you applied them to the whole record) so { goes back to <foo> and @B goes back to {, and @A goes back to @, etc. and we have the original text for that string.

    The above will work in any awk. If your start/end strings contain RE metacharacters then you'd have to escape those or use a while(index(substr())) loop instead of gsub() to replace them.

    Note that if you do use gawk and the labels aren't nested then you can keep the 2 functions exactly as above and change the rest of the script to just:

    BEGIN { FPAT="{[^{}]*}" }
    {
        $0 = encode($0)
        for (i=1; i<=NF; i++) {
            print decode($i)
        }
    }
    

    Obviously you don't really need to put the encode/decode functionality in separate functions, I just separated that out here to make that functionality explicit and separate from the loop that uses it for clarity.

    For another example of when/how to apply the above approach, see https://stackoverflow.com/a/40540160/1745001.

    0 讨论(0)
提交回复
热议问题