RegEx - find a word inside a specific section of a file

隐身守侯 提交于 2021-01-28 06:26:29

问题


I am trying to set up an alarm in a piece of weather software to look at a forecast for my area and tell me if the word "severe" appears in the upcoming forecast. I am looking at the following text file (shortened down a bit):

000
FPUS55 KBOU 301529
ZFPBOU

Zone Forecast Product for Northeast Colorado
National Weather Service Denver/Boulder CO
929 AM MDT Sat Jun 30 2018

COZ042-044-010615-
Northeast Weld County-Morgan County-
including Briggsdale, Grover, Pawnee Buttes, Raymer, Stoneham,
Brush, Fort Morgan, Goodrich, and Wiggins
929 AM MDT Sat Jun 30 2018

.REST OF TODAY...Chance of thunderstorms early in the afternoon.
Thunderstorms likely late in the afternoon. Some thunderstorms
may be severe with large hail. Highs 68 to 74. Northeast winds 10
to 15 mph with gusts to around 25 mph. Chance of thunderstorms 70
percent.
.TONIGHT...Mostly cloudy with a 30 percent chance of
thunderstorms in the evening, then mostly clear after midnight.
Some thunderstorms may be severe. Lows near 50. North winds 10 to
15 mph with gusts to around 25 mph in the evening becoming light.
.SUNDAY...Mostly sunny. Warmer. Highs in the 80s.
.SUNDAY NIGHT...Mostly clear. Lows in the mid to upper 50s. South
winds 10 to 15 mph.
.MONDAY...Mostly sunny. Highs near 90.
.MONDAY NIGHT AND TUESDAY...Partly cloudy with a 10 percent
chance of thunderstorms. Lows near 60. Highs in the lower to mid
90s.
.TUESDAY NIGHT AND Independence Day...Partly cloudy. Lows near
60. Highs in the 90s.
.WEDNESDAY NIGHT AND THURSDAY...Partly cloudy with a 10 percent
chance of thunderstorms. Lows near 60. Highs in the lower to mid
90s.
.THURSDAY NIGHT...Partly cloudy with a 30 percent chance of
thunderstorms. Lows near 60.
.FRIDAY...Partly cloudy with a 10 percent chance of
thunderstorms. Highs in the lower to mid 90s.

$$

COZ048>051-010615-
Logan County-Washington County-Sedgwick County-Phillips County-
including Crook, Merino, Sterling, Peetz, Akron, Cope,
Last Chance, Otis, Julesburg, Ovid, Sedgwick, Amherst, Haxtun,
and Holyoke
929 AM MDT Sat Jun 30 2018

.REST OF TODAY...Chance of showers and slight chance of
thunderstorms early in the afternoon. Showers likely and chance
of thunderstorms late in the afternoon. Highs in the lower 70s.
North winds 10 to 20 mph. Chance of precipitation 60 percent.
.TONIGHT...Mostly cloudy with a 50 percent chance of
thunderstorms in the evening, then mostly clear after midnight.
Some thunderstorms may be severe. Lows in the lower to mid 50s.
North winds 10 to 15 mph with gusts to around 25 mph in the
evening becoming light.
.SUNDAY...Mostly sunny. Highs in the mid 80s.
.SUNDAY NIGHT...Mostly clear. Lows near 60. South winds 10 to
15 mph.
.MONDAY...Partly cloudy with a 10 percent chance of
thunderstorms. Highs in the lower 90s. South winds 10 to 15 mph.
.MONDAY NIGHT...Partly cloudy with a 10 percent chance of
thunderstorms. Lows near 60.
.TUESDAY...Partly cloudy. Highs in the mid 90s.
.TUESDAY NIGHT...Partly cloudy with a 10 percent chance of
thunderstorms. Lows in the lower to mid 60s.
.INDEPENDENCE DAY...Partly cloudy. Highs in the mid 90s.
.WEDNESDAY NIGHT...Partly cloudy with a 10 percent chance of
thunderstorms. Lows in the lower to mid 60s.
.THURSDAY...Partly cloudy with a chance of rain showers and
slight chance of thunderstorms. Highs in the lower 90s. Chance of
precipitation 30 percent.
.THURSDAY NIGHT...Partly cloudy with a 30 percent chance of
thunderstorms. Lows in the lower to mid 60s.
.FRIDAY...Partly cloudy. Highs in the lower 90s.

$$

COZ046-010615-
North and Northeast Elbert County Below 6000 Feet/North Lincoln
County-
including Agate, Hugo, Limon, and Matheson
929 AM MDT Sat Jun 30 2018

.REST OF TODAY...Mostly cloudy. Chance of rain showers and slight
chance of thunderstorms early in the afternoon. Chance of
thunderstorms late in the afternoon. Some thunderstorms may be
severe late in the afternoon. Highs in the mid 70s. North winds
15 to 25 mph. Chance of precipitation 40 percent.
.TONIGHT...Mostly cloudy with a 50 percent chance of
thunderstorms in the evening, then partly cloudy after midnight.
Lows around 50. North winds 10 to 20 mph in the evening becoming
light.
.SUNDAY...Mostly sunny. Highs in the lower 80s. South winds 10 to
15 mph in the afternoon.
.SUNDAY NIGHT...Partly cloudy with a 10 percent chance of
thunderstorms. Lows in the mid to upper 50s. South winds 10 to
15 mph.
.MONDAY...Partly cloudy with a 10 percent chance of
thunderstorms. Highs near 90. South winds 10 to 15 mph.
.MONDAY NIGHT...Partly cloudy with a 10 percent chance of
thunderstorms. Lows in the mid 50s to lower 60s.
.TUESDAY THROUGH INDEPENDENCE DAY...Partly cloudy. Highs in the
lower to mid 90s. Lows in the mid 50s to lower 60s.
.WEDNESDAY NIGHT...Mostly cloudy with a 20 percent chance of
thunderstorms. Lows near 60.
.THURSDAY...Partly cloudy with a 10 percent chance of
thunderstorms. Highs around 90.
.THURSDAY NIGHT...Partly cloudy with a 30 percent chance of
thunderstorms. Lows near 60.
.FRIDAY...Partly cloudy with a 10 percent chance of
thunderstorms. Highs in the upper 80s.

$$

So, I want to look inside the group for Washington County, which is the second section of the above forecast. The phrase "Washington County" will always appear in the heading for my county's section of the forecast, and "$$" will always conclude each section of the forecast. As an example, I have figured out that the RegEx expression

Washington County([\D\S]*?)\${2}

will find all of the text in my portion of the forecast. Then, specifically inside my county's portion of the forecast, I'm interested in the "TONIGHT" forecast period. I have figured out that the RegEx expression

\.TONIGHT[\D\S]*?(?=\s\.)

will find the "TONIGHT" forecast period for all of the forecast sections. And, of course, the RegEx expression

severe

will find all of the instances of "severe" throughout the file. Where I am having trouble is trying to put all three together and get a result only when the word "severe" occurs in the "TONIGHT" forecast period inside the "Washington County" forecast section. When I try putting these all together, I find that RegEx will find the words that I'm looking for, but it will reach out into adjacent forecast sections. Is there a way to make this only search between "Washington County" and the very next instance of "$$" to be sure that I don't spill over into the next forecast section and return a false positive?

Many thanks to anybody that can help me with this. I'm pretty new to RegEx, so I just don't have a good idea for how to limit down the area that I am searching.


回答1:


You can achieve what you want by using negative lookahead assertions.

For example,

Ab(?!c).

matches Ab followed by any character other than c

Ab((?!c).)+

matches Ab and then keeps matching any character until it hits a c


In your case, we want to keep matching unless we hit the $$ on a newline at the end of the section. To do that, we can use Washington County((?!\R\$\$)[\s\S])+. The [\s\S] matches any character, but the (?!\R\$\$) forces it to stop matching if it hits the $$.

Expanding that concept out a bit, you can come up with a final expression to match severe only in the .TONIGHT section of your text block.


Solution

Washington County((?!\R\$\$)[\s\S])+\R\.TONIGHT((?!\R\.)[\s\S])+severe

Explanation

Washington County((?!\R\$\$)[\s\S])+\R\.TONIGHT
Match everything in the Washington County block until we hit the TONIGHT section.

((?!\R\.)[\s\S])+
Keep matching from that point forward until we hit a linebreak followed by a period. That would signify that we're leaving the TONIGHT section. We need this part of the regex to limit the query to only matching in the TONIGHT section and not spilling over beyond it.

severe
Match "severe" in the TONIGHT section.




回答2:


You started this well, but at the puting together part, you have to write two more RegEx and replace

[Regex one for the city] [Regex two for the TONIGHT] [RegEx 3 for severe]

with

[Regex one for the city] [Plus one for Any but no city] [Regex two for the TONIGHT] [Plus One for Any but new section] [RegEx 3 for severe]

Thats for start ...




回答3:


As a practical matter, you can separate this file into blocks separated by \n$$\n as a delimiter. Any of sed, awk, perl etc can do that and then a simple regex against the block will do what you wish.

Example in awk:

awk 'BEGIN {RS="\n\\$\\$\n"} /Washington County/ && /severe/ {print $0}' file

That will print the entire block between the two $$ if that block contains both 'Washington County' and 'severe'.

If you wanted to only print the header of the section (the location) and the particular time with 'severe' in it, you can further subdivide into sections like so:

awk 'BEGIN {RS="\n\\$\\$\n"; FS="\n\\."} /Washington County/ && /severe/
     {print $1; for (i=1;i<=NF;i++) if(match($i, /severe/)) print $i}' file

That prints:

COZ048>051-010615- Logan County-Washington County-Sedgwick County-Phillips County- including Crook, Merino, Sterling, Peetz, Akron, Cope, Last Chance, Otis, Julesburg, Ovid, Sedgwick, Amherst, Haxtun, and Holyoke 929 AM MDT Sat Jun 30 2018

TONIGHT...Mostly cloudy with a 50 percent chance of thunderstorms in the evening, then mostly clear after midnight. Some thunderstorms may be severe. Lows in the lower to mid 50s. North winds 10 to 15 mph with gusts to around 25 mph in the evening becoming light.



来源:https://stackoverflow.com/questions/51117398/regex-find-a-word-inside-a-specific-section-of-a-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!