Is there a field that stores the exact field separator FS used when in a regular expression, equivalent to RT for RS?

问题

In GNU Awk's 4.1.2 Record Splitting with gawk we can read:

When RS is a single character, RT contains the same single character. However, when RS is a regular expression, RT contains the actual input text that matched the regular expression.

This variable RT is very useful in some cases.

Similarly, we can set a regular expression as the field separator. For example, in here we allow it to be either ";" or "|":

$ gawk -F';' '{print NF}' <<< "hello;how|are you"
2  # there are 2 fields, since ";" appears once
$ gawk -F'[;|]' '{print NF}' <<< "hello;how|are you"
3  # there are 3 fields, since ";" appears once and "|" also once

However, if we want to pack the data again, we don't have a way to know which separator appeared between two fields. So if in the previous example I want to loop through the fields and print them together again by using FS, it prints the whole expression in every case:

$ gawk -F'[;|]' '{for (i=1;i<=NF;i++) printf ("%s%s", $i, FS)}' <<< "hello;how|are you"
hello[;|]how[;|]are you[;|]  # a literal "[;|]" shows in the place of FS

Is there a way to "repack" the fields using the specific field separator used to split each one of them, similarly to what RT would allow to do?

(the examples given in the question are rather simple, but just to show the point)

回答1:

Is there a way to "repack" the fields using the specific field separator used to split each one of them

Using gnu-awk split() that has an extra 4th parameter for the matched delimiter using supplied regex:

s="hello;how|are you"
awk 'split($0, flds, /[;|]/, seps) {for (i=1; i in seps; i++) printf "%s%s", flds[i], seps[i]; print flds[i]}' <<< "$s"

hello;how|are you

A more readable version:

s="hello;how|are you"
awk 'split($0, flds, /[;|]/, seps) {
   for (i=1; i in seps; i++)
      printf "%s%s", flds[i], seps[i]
   print flds[i]
}' <<< "$s"

Take note of 4th seps parameter in split that stores an array of matched text by regular expression used in 3rd parameter i.e. /[;|]/.

Of course it is not as short & simple as RS, ORS and RT, which can be written as:

awk -v RS='[;|]' '{ORS = RT} 1' <<< "$s"

回答2:

As @anubhava mentions, gawk has split() (and patsplit() which is to FPAT as split() is to FS - see https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions) to do what you want. If you want the same functionality with a POSIX awk then:

$ cat tst.awk
function getFldsSeps(str,flds,fs,seps,  nf) {
    delete flds
    delete seps
    str = $0

    if ( fs == " " ) {
        fs = "[[:space:]]+"
        if ( match(str,"^"fs) ) {
            seps[0] = substr(str,RSTART,RLENGTH)
            str = substr(str,RSTART+RLENGTH)
        }
    }

    while ( match(str,fs) ) {
        flds[++nf] = substr(str,1,RSTART-1)
        seps[nf]   = substr(str,RSTART,RLENGTH)
        str = substr(str,RSTART+RLENGTH)
    }

    if ( str != "" ) {
        flds[++nf] = str
    }

    return nf
}

{
    print
    nf = getFldsSeps($0,flds,FS,seps)
    for (i=0; i<=nf; i++) {
        printf "{%d:[%s]<%s>}%s", i, flds[i], seps[i], (i<nf ? "" : ORS)
    }
}

Note the specific handling above of the case where the field separator is " " because that means 2 things different from all other field separator values:

Fields are actually separated by chains of any white space, and
Leading white space is to be ignored when populating $1 (or flds[1] in this case) and so that white space, if it exists, must be captured in seps[0]` for our purposes since every seps[N] is associated with the flds[N] that precedes it.

For example, running the above on these 3 input files:

$ head file{1..3}
==> file1 <==
hello;how|are you

==> file2 <==
hello how are_you

==> file3 <==
    hello how are_you

we'd get the following output where each field is displayed as the field number then the field value within [...] then the separator within <...>, all within {...} (note that seps[0] is populated IFF the FS is " " and the record starts with white space):

$ awk -F'[,|]' -f tst.awk file1
hello;how|are you
{0:[]<>}{1:[hello;how]<|>}{2:[are you]<>}

$ awk -f tst.awk file2
hello how are_you
{0:[]<>}{1:[hello]< >}{2:[how]< >}{3:[are_you]<>}

$ awk -f tst.awk file3
    hello how are_you
{0:[]<    >}{1:[hello]< >}{2:[how]< >}{3:[are_you]<>}

回答3:

An alternative option to split is to use match to find the field separators and read them into an array:

awk -F'[;|]' '{
    str=$0; # Set str to the line
    while (match(str,FS)) { # Loop through rach match of the field separator
      map[cnt+=1]=substr(str,RSTART,RLENGTH); # Create an array of the field separators
      str=substr(str,RSTART+RLENGTH) # Set str to the rest of the string after the match string
    }
    for (i=1;i<=NF;i++) { 
      printf "%s%s",$i,map[i] # Loop through each record, printing it along with the field separator held in the array map.
    } 
    printf "\n" 
   }' <<< "hello;how|are you"

来源：https://stackoverflow.com/questions/65560326/is-there-a-field-that-stores-the-exact-field-separator-fs-used-when-in-a-regular

标签

awk

gnu