I want to print the texts inside of " "
. for example I have the following strings:
gfdg "jkfgh" "jkfd fdgj fd-" ghjhgj
gfggf "kfdjfdgfhbg" "fhfghg" jhgj
jhfjhg "dfgdf" fgf
fgfdg "dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd" hgjghj
And I want to print only the following:
"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd"
I have tried awk with the following regular expression:
awk '{for(i = 1; i <= NF; i++) if($i ~ /^\"[A-Za-z.$]*([A-Za-z.$][[:space:]]*[A-Za-z.$])*\"$/) print $i}' sample.txt
but it prints everything before space and actually does not recognize the spaces I have defined in my regular expression. My current output is:
"jkfgh"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj
as you can see, only the ones without any space are printed correctly.
I have also tried [[:blank:]]
, \t
and also ' '
but did not work.
I appreciate if someone can tell me how to change this regular expression and include space.
You are just getting those without any space because you loop through fields and they are space separated. Thus, you need to change the approach to something handling the spaces differently. Assuming there are no nested quotes, you can use for example:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "\"%s\"", $i; print ""}' file
That is, use "
as field separator and print the even fields.
This is equivalent to using FS
more elegantly:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s", FS, $i, FS; print ""}' file
Note in the previous approaches the output has no space in between fields. If you need it, you can use:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s%s", FS, $i, FS, (i>NF-2?"\n":" ")}' file
The trick (i>NF-2?"\n":" ")
is a matter of printing the whole field together with a separator. If we are in the last field, we set it as new line; otherwise, as a space. More idiomatically, you can also say (i>NF-2?RS:OFS)
using the default values of RS
(record separator, new line) and OFS
(output field separator, space).
Test
$ awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s%s", FS, $i, FS, (i>NF-2?"\n":" ")}' file
"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd"
The question's title is misleading and based on a fundamental misconception about awk
.
The naïve answer is that a space can simply be represented as itself (a literal) in regular expressions in awk
.
More generally, you can use [[:space:]]
to match a space, a tab or a newline (GNU Awk also supports \s
), and [[:blank:]]
to match a space or a tab.
However, the crux of the problem is that Awk, by default, splits each input line into fields by whitespace, so that, by definition, no input field itself contains whitespace, so any attempt to match a space in a field value will invariably fail.
The input at hand has fields that are a mix of unquoted and quoted strings, but POSIX Awk has no support for recognizing quoted strings as fields.
@fedorqui has made a valiant attempt to work around the problem by splitting input into fields by double quotes, but it's no substitute for proper recognition of quoted strings, because it doesn't preserve the true field boundaries.
If you have GNU Awk, you can approximate recognition of quoted strings using the special FPAT
variable, which, rather than defining a separator to split lines by, allows defining a regex that describes fields (and ignores tokens not recognized as such):
re='[[:alpha:]][[:alpha:] ]*[[:alpha:]]' # aux. shell variable
gawk -v FPAT="\"$re\"|'$re'" '{
for(i=1;i<=NF;++i) printf "%s%s", $i, (i==NF ? "\n" : " ")
}' sample.txt
This will work with single- and double-quoted strings.
Explanation:
FPAT="\"$re\"|'$re'"
defines fields to be either double- or single-quoted strings consisting only of letters and spaces, with at least one letter on either end (as in the OP's code).- Note that this automatically excludes the UNquoted tokens on each input line - they will not be reflected in
$1
, ... andNF
. - Therefore, the loop
for(i=1;i<=NF;++i)
is already limited to enumerating only the matching fields.
Note that, generally, the restrictions placed on the contents of the quoted strings in this case luckily bypass limitations inherent in this approach, namely the inability to deal with escaped nested quotes (of the same type).
If this limitation is acceptable, you can use the following idiom to tokenize input that is a mix of barewords (unquoted tokens) and quoted strings:
gawk -v "FPAT=[^[:blank:]]+|\"[^\"]*\"|'[^']*'" ...
来源:https://stackoverflow.com/questions/29512854/how-to-define-a-space-in-a-regular-expression-in-awk