How match a paragraph using regex

早过忘川 提交于 2019-11-29 00:28:33

You can split on double-newline like this:

paragraphs = re.split(r"\n\n", DATA)

Edit: To capture the paragraphs as matches, so you can get their start and end points, do this:

for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', DATA):
   print match.start(), match.end()

# Prints:
# 0 214
# 215 298
# 299 589

Using split is one way, you can do so with regular expression also like this:

paragraphs = re.search('(.+?\n\n|.+?$)',TEXT,re.DOTALL)

The .+? is a lazy match, it will match the shortest substring that makes the whole regex matched. Otherwise, it will just match the whole string.

So basically here we want to find a sequence of characters (.+?) which ends by a blank line (\n\n) or the end of string ($). The re.DOTALL flag makes the dot to match newline also (we also want to match a paragraph consisting of three lines without blank lines within)

What is the newline symbol? Let us suppose the newline symbol is '\r\n', if you want to match the paragraphs starting with Lorem, you can do like this:

pattern = re.compile('\r\nLorem.*\r\n')
str = '...'    # your source text
matchlist = re.findall(pattern, str)

The matchlist will contain all the paragragh start with Lorem. And the other two words are the same.

Try

^(.+?)\n\s*\n

or

^(.+?)\r\n\s*\r\n

just do not forget append extra new line at the end of text

i tried to use the recommended RegEx with the default Java RegEx engine. That gave me several times a StackOverflowException, so in the end i rewrote the RegEx and optimized it a little more.

So this is working fine for me in Java:

(?s)(.*?[^\:\-\,])(?:$|\n{2,})

This also handles the end of document without new lines and tries to concat lines which ends with ':', '-' or ',' to the next paragraph.

And to avoid that trailing blanks (whitespace or tabs) breaks the above described feature i am stripping them before with following regex:

(?m)[[:blank:]]+$
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!