In awk, why are “” and “\n\n” treated the same for the RS parameter?

99封情书 提交于 2021-02-08 09:32:18

问题


Here are the contents of the file:

Person Name
123 High Street
(222) 466-1234

Another person
487 High Street
(523) 643-8754

And these two things give the same result:

$ awk 'BEGIN{FS="\n"; RS="\n\n"} {print $1, $3}' file_contents
$ awk 'BEGIN{FS="\n"; RS=""} {print $1, $3}' file_contents

The result given in both cases is:

Person Name (222) 466-1234
Another person (523) 643-8754

RS="\n\n" actually makes sense, but why is RS="" also treated the same way?


回答1:


They aren't treated the same.

  • RS="" invokes paragraph mode in all awks and so the input is split into records separated by contiguous sequences of empty lines and a newline is added to the FS if the existing FS is a single character (note: the POSIX standard is incorrect in this area as it implies \n would get added to any FS but that's not the case, see https://lists.gnu.org/archive/html/bug-gawk/2019-04/msg00029.html).
  • RS="\n\n" works in GNU awk to set the record separator to a single blank line and does not affect FS. In all other awks the 2nd \n will be ignored (more than 1 char in a RS is undefined behavior per POSIX so they COULD do anything but that's by far the most common implementation).

Look what happens when you have 3 blank lines between your 2 blocks of text and use a FS other than \n (e.g. ,):

$ cat file
Person Name
123 High Street
(222) 466-1234



Another person
487 High Street
(523) 643-8754

.

$ gawk 'BEGIN{FS=","; RS=""} {print NR, NF, "<" $0 ">\n"}' file
1 3 <Person Name
123 High Street
(222) 466-1234>

2 3 <Another person
487 High Street
(523) 643-8754>

.

$ gawk --posix 'BEGIN{FS=","; RS=""} {print NR, NF, "<" $0 ">\n"}' file
1 3 <Person Name
123 High Street
(222) 466-1234>

2 3 <Another person
487 High Street
(523) 643-8754>

.

$ gawk 'BEGIN{FS=","; RS="\n\n"} {print NR, NF, "<" $0 ">\n"}' file
1 1 <Person Name
123 High Street
(222) 466-1234>

2 0 <>

3 1 <Another person
487 High Street
(523) 643-8754>

.

$ gawk --posix 'BEGIN{FS=","; RS="\n\n"} {print NR, NF, "<" $0 ">\n"}' file
1 1 <Person Name>

2 1 <123 High Street>

3 1 <(222) 466-1234>

4 0 <>

5 0 <>

6 0 <>

7 1 <Another person>

8 1 <487 High Street>

9 1 <(523) 643-8754>

10 0 <>

Note the different values for NR and NF and different $0 contents being printed.




回答2:


Because POSIX awk specification says so.

If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.



来源:https://stackoverflow.com/questions/57851531/in-awk-why-are-and-n-n-treated-the-same-for-the-rs-parameter

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!