How match a paragraph using regex

試著忘記壹切 提交于 2019-11-27 14:25:29

问题


I have been struggling with python regex for a while trying to match paragraphs within a text, but I haven't been successful. I need to obtain the start and end positions of the paragraphs.

An example of a text:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod
tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum. 

Stet clita kasd gubergren,
no sea takimata sanctus est Lorem ipsum dolor sit amet.

Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod
tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren,
no sea takimata sanctus est Lorem ipsum dolor sit amet.

In this example case, I would want to separately match all the paragraphs starting with Lorem, Stet and Ipsum respectively (without the empty lines). Does anyone have any idea how to do this?


回答1:


You can split on double-newline like this:

paragraphs = re.split(r"\n\n", DATA)

Edit: To capture the paragraphs as matches, so you can get their start and end points, do this:

for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', DATA):
   print match.start(), match.end()

# Prints:
# 0 214
# 215 298
# 299 589



回答2:


Using split is one way, you can do so with regular expression also like this:

paragraphs = re.search('(.+?\n\n|.+?$)',TEXT,re.DOTALL)

The .+? is a lazy match, it will match the shortest substring that makes the whole regex matched. Otherwise, it will just match the whole string.

So basically here we want to find a sequence of characters (.+?) which ends by a blank line (\n\n) or the end of string ($). The re.DOTALL flag makes the dot to match newline also (we also want to match a paragraph consisting of three lines without blank lines within)




回答3:


What is the newline symbol? Let us suppose the newline symbol is '\r\n', if you want to match the paragraphs starting with Lorem, you can do like this:

pattern = re.compile('\r\nLorem.*\r\n')
str = '...'    # your source text
matchlist = re.findall(pattern, str)

The matchlist will contain all the paragragh start with Lorem. And the other two words are the same.




回答4:


Try

^(.+?)\n\s*\n

or

^(.+?)\r\n\s*\r\n

just do not forget append extra new line at the end of text




回答5:


i tried to use the recommended RegEx with the default Java RegEx engine. That gave me several times a StackOverflowException, so in the end i rewrote the RegEx and optimized it a little more.

So this is working fine for me in Java:

(?s)(.*?[^\:\-\,])(?:$|\n{2,})

This also handles the end of document without new lines and tries to concat lines which ends with ':', '-' or ',' to the next paragraph.

And to avoid that trailing blanks (whitespace or tabs) breaks the above described feature i am stripping them before with following regex:

(?m)[[:blank:]]+$


来源:https://stackoverflow.com/questions/18568105/how-match-a-paragraph-using-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!