Extract text between two strings on different lines

匿名 (未验证) 提交于 2019-12-03 02:38:01

问题:

I have a big email file with the following random hosts:

...... HOSTS: test-host,host2.domain.com, host3.domain.com,another-testing-host,host.domain. com,host.anotherdomain.net,host2.anotherdomain.net, another-local-host, TEST-HOST  DATE: August 11 2015 9:00 ....... 

The hosts are always delimited with a comma but they can be split on one, two or multiple lines (I can't control this, it's what email clients do, unfortunately).

So I need to extract all the text between the string "HOSTS:" and the string "DATE:", wrap it, and replace the commas with new lines, like this:

test-host host2.domain.com host3.domain.com another-testing-host host.domain.com host.anotherdomain.net host2.anotherdomain.net another-local-host TEST-HOST 

So far I came up with this, but I lose everything that's on the same line with "HOSTS":

sed '/HOST/,/DATE/!d;//d' ${file} | tr -d '\n' | sed -E "s/,\s*/\n/g" 

回答1:

Something like this might work for you:

sed -n '/HOSTS:/{:a;N;/DATE/!ba;s/[[:space:]]//g;s/,/\n/g;s/.*HOSTS:\|DATE.*//g;p}' "$file" 

Breakdown:

-n                       # Disable printing /HOSTS:/ {               # Match line containing literal HOSTS:   :a;                    # Label used for branching (goto)   N;                     # Added next line to pattern space   /DATE/!ba              # As long as literal DATE is not matched goto :a   s/.*HOSTS:\|DATE.*//g; # Remove everything in front of and including literal HOSTS:                          # and remove everything behind and including literal DATE    s/[[:space:]]//g;      # Replace spaces and newlines with nothing   s/,/\n/g;              # Replace comma with newline   p                      # Print pattern space } 


回答2:

this awk one-liner may help:

awk -v RS='HOSTS: *|DATE:' 'NR==2{gsub(/\n/,"");gsub(/,/,"\n");print}' input 


回答3:

another awk with tr

$ awk '/^HOSTS:/{$1="";p=1} /^DATE:/{p=0} p' file | tr -d ' \n' | tr ',' '\n'; echo ""  test-host host2.domain.com host3.domain.com another-testing-host host.domain.com host.anotherdomain.net host2.anotherdomain.net another-local-host TEST-HOST 


回答4:

Here is another sed script, that might work for you:

script.sed

/HOSTS:/,/DATE/ {      /DATE/! H;                        # append to HOLD space     /DATE/ { g;                       # exchange HOLD and PATTERN space              s/([\n ])|(HOSTS:)//g;   # remove unwanted strings              s/,/\n/g;                # replace comma with newline              p;                       # print     } } 

Use it this way: sed -nrf script.sed yourfile.

The middle block is applied to line that are in the range between HOSTS: and DATE. In the middle block lines that do not match DATE are appended to the Hold-Space and the line matching DATE triggers the longer action.



回答5:

Perl to the rescue!

perl -ne '     if (my $l = (/^HOSTS:/ .. /^DATE:/)) {         chomp;         s/^HOSTS:\s+// if 1 == $l;         s/DATE:.*// if $l =~ /E/;         s/,\s*/\n/g;         print;     }' input-file > output-file 

The flip-flop operator .. returns a number, in this case indicating the line number in the current block. We can therefore easily remove the HOSTS: from the first line (1 == $l). The last line can be recognised by the E0 appended to the number, that's how we remove the DATE:...



回答6:

cat ${file} | awk 'BEGIN {A=0;} /^HOST/ {A=1;} /^DATE/ {A=0} {if (A==1) print;}' | tr -d '\n' | sed -E "s/,\s*/\n/g" | sed -e 's/^HOSTS\s*://\s*// 


回答7:

awk 'sub(/^HOSTS: /,""){rec=""} /^DATE/{gsub(/ *, */,"\n",rec); print rec; exit} {rec = rec $0}' file test-host host2.domain.com host3.domain.com another-testing-host host.domain.com host.anotherdomain.net host2.anotherdomain.net another-local-host TEST-HOST 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!