how can I preserve an embedded TAB character

问题

EDIT 2019-Oct-11 - Simple example
- removed previous example

I want awk to resepect a TAB character embedded in $0 as a content when it reprocesses the $0 input record after a field value change ($1, $2, ..).

Here a short example. In the output below, "t @ 48" for example, means there is a TAB at position 48 in the $0 data record. Please note that "\t" is expanded to: TAB chr(9) as initial processing of the input (labeled raw).

Example output:

 $ ./tmp.awk   tmp.input 

raw $0:      '    line with spaces here     a tab between AAA\tBBB', t @ 0, NF = 8, len = 52.
$1:          'line', len = 4.
unescape $0: '    line with spaces here     a tab between AAA   BBB', t @ 48, NF = 9, len = 51.

$1 = $1, $0: 'line with spaces here a tab between AAA BBB', t @ 0, NF = 9, len = 43.

unescape $0: '    line with spaces here     a tab between AAA   BBB', t @ 48, NF = 9, len = 51.
$1 = "", $0: ' with spaces here a tab between AAA BBB', t @ 0, NF = 9, len = 39.

final $0:    ' with spaces here a tab between AAA BBB', t @ 0, NF = 9, len = 39.

When "\t" is expanded and $0 updated, awk correctly rebuilds and gives 9 fields (no longer 8). (tick)

Input record is:

line with spaces here     a tab between AAA\tBBB

Desired result:

The end goal, is to be able to remove the content of field $1 while preserving all the formatting and spacing would be as shown.

 $0:  '     with spaces here     a tab between AAA  BBB', t @ 44, NF = 8, len = 47.

With only the specified $1-characters removed, which is "line". Including the TAB between "AAA" and "BBB". I have shown one less field (NF = 8). Awk itself appears to retain the empty $1 cell so NF = 9 would also be acceptable.

Following the line labeled: $1 = $1', when we change the value of $12.

{
     :
print "    unescape $0: '" $0 "', t @ " index( $0, "\t" ) ", NF = " NF ", len = " length( $0 ) ".";

    $1 = $1;  # force record to be reconstituted

print "    $1 = $1, $0: '" $0 "', t @ " index( $0, "\t" ) ", NF = " NF ", len = " length( $0 ) ".";

}   

output ...

unescape $0: '    line with spaces here     a tab between AAA   BBB', t @ 48, NF = 9, len = 51.
$1 = $1, $0: 'line with spaces here a tab between AAA BBB', t @ 0, NF = 9, len = 43.

Please note that while I still have 7 fields on this line. There is NO LONGER a TAB character and the multiple-spaces after "here" have been removed. These formatting changes are undesirable for this use-case.

I get this result consistiently NO matter what values I enter for field separator, FS (even a line-feed) and OFS. Actually changing OFS make things much worse.

The behaviour was not anticipated. However after some comments, it migh be that this is prescribed no matter what.

Sample awk script:

{
    print "";
    print "raw $0:      '" $0 "', t @ " index( $0, "\t" ) ", NF = " NF ", len = " length( $0 ) ".";
    print "$1:          '" $1 "', len = " length( $1 ) ".";

    gsub(/\\t/, "\t", $0);      #  expand any embedded TAB-s
    print "unescape $0: '" $0 "', t @ " index( $0, "\t" ) ", NF = " NF ", len = " length( $0 ) ".";
    preserve = $0;

    print "";
    $1 = $1;  # force record to be reconstituted
    print "$1 = $1, $0: '" $0 "', t @ " index( $0, "\t" ) ", NF = " NF ", len = " length( $0 ) ".";

    print "";
    $0 = preserve;
    print "unescape $0: '" $0 "', t @ " index( $0, "\t" ) ", NF = " NF ", len = " length( $0 ) ".";

    $1 = "";   

    print "$1 = \"\", $0: '" $0 "', t @ " index( $0, "\t" ) ", NF = " NF ", len = " length( $0 ) ".";
    print "";

    print "final $0:    '" $0 "', t @ " index( $0, "\t" ) ", NF = " NF ", len = " length( $0 ) ".";
    print "";

}

Questions ...

How can I get the desired behaviour? Meaning, no edit of the record when a field is removed?
- If that is not possible -- Is there a method which preserves the integrity and spacing of the 'current' $0 record?
- For example I was looking for an array that maps all the fields to the $0 record, but didn't find it.
How is it possible to preserve the TAB in the example.
Can this editing of the $0 record be prevented?

Characters have been deleted. Examination shows that awk has edited-out repeated spaces (deleted them) and TAB.

The single space is not the culprit, it would appear to be the reconstitution or manufacture of the $0 record.

Reference:

The areas from the UG ... Gnu Awk User Guide:

Fields are normally separated by whitespace sequences (spaces, TABs, and newlines), not by single spaces. Two spaces in a row do not delimit an empty field. The default value of the field separator FS is a string containing a single space, " ".

I get that FS space is special. However even when I put a strange FS like "W" and "\n" characers are still deleted from $0 following the $1 = $1 rebuild step.

Conclusion the FS is not used when reprocessing $0

A change in the value of $0 = new string, has worked as expected. The number of fields goes up because awk recognised the Tab character. I must point-out that awk did not delete the Tab in this case (as desired).

Changing Fields (Gnu Awk UG):

Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current values of the fields and OFS. To do this, use the seemingly innocuous assignment:

  $1 = $1   # force record to be reconstituted
  print $0  # or whatever else with $0

This forces awk to rebuild the record. It does help to add a comment, as we’ve shown here.

The version used:

gawk -V
GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2018 Free Software Foundation

Ubuntu 19.04

This instruction doesn't warn that the $0 can have 9 characters removed, or even hint that $0 will be affected.

Other unexplained aspects:

I there an explaination for this?
Is this gawk only or is it common across alternative awk-s?

Personally, I was very happy with $0 does not change. There are many times when I want awk for its ability to structure data and preserve unstructured source for output.

Looking forward to your thoughts.

回答1:

Here is my attempt to answer your question.

1st Answer(why tabs are NOT getting preserved): In awk what is meaning of $1=$1: When we are doing $1=$1 for any line it means we are asking awk to rebuild the line, now it is this means actually? It means take OFS(output field separator) in picture. Whose value out of the box(default) will be a space. Here is an example for it:

Let's have we have following Input_file:

cat Input_file
a       b       c       d e

1st Scenario: Now when I run first code without mentioning any OFS value then see what happens:

awk '1' Input_file
a       b       c       d e

It prints line as mentioned in Input_file with NO changes.

2nd scenario: Now lets define an OFS value to \t here and run program:

awk 'BEGIN{OFS="\t"};1' 
a       b       c       d e

You could see still NO change in Output though we have set OFS="\t".

3rd Scenario: Now lets take an example of 3rd scenario where we are setting value of OFS="\t" and re-building line:

awk 'BEGIN{OFS="\t"} {$1=$1} 1' Input_file
a       b       c       d       e

You could see TAB has occurred between character d and e now, why because when we asked awk to re-build line it has taken OFS into consideration and implemented it for whole line's fields so hence TAB came into existence.

From man awk page:

Assigning a value to an existing field causes the whole record to be rebuilt when $0 is referenced. Similarly, assigning a value to $0 causes the record to be resplit, creating new values for the fields.

2nd Answer(How to preserve tabs and spaces as it is for a line): Now take an example of same Input_file mentioned above. Let's say you want to substitute character e in it without inserting TAB between d and e then we could simply do substitution for it and it shouldn't insert a TAB between d and e like as follows:

cat Input_file
a       b       c       d e
awk 'BEGIN{OFS="\t"}{sub(/e/,"f")}1' Input_file
a       b       c       d f

3rd Answer(about assigning a value to whole line itself): Lets see these examples.

awk 'BEGIN{OFS="\t"} {$0="1 2 3 4 5"} 1' Input_file
1 2 3 4 5

We could see assigning a new variable to while line didn't set TAB as a separator since re-build of line never happened, now lets see what happens in re-building of line.

awk 'BEGIN{OFS="\t"} {$0="1 2 3 4 5";$1=$1} 1' Input_file
1       2       3       4       5

I hope I got your question correctly, if any more queries then feel free to comment in this post. Also I have tested with this sample file that length of Input_file has not changed, you need to provide samples in your post(question) for same to understand it better.

来源：https://stackoverflow.com/questions/58315036/how-can-i-preserve-an-embedded-tab-character

标签

Linux

awk