问题
I have text in this form :
<text>
some text efdg
some text abcd
</text>
I am writing a regex to extract :
some text efdg
some text abcd
Since it is multiline I am using <text>\n+ (^+?) \n+<text> , however It is not working. How can this be done?
I tried using r'^.*?' but doesn't seem to be working.
Code : Input file is :
<doc>
<id1>123</id1>
<text>
abc
def
</text></doc><doc>
<id1>1234</id1>
<text>
abcdd
defdd
</text></doc>
for line in f.read().split('</doc>\n'):
tag = re.findall(r'<id1>\s*(.+)\s*</id1>',line)
print tag[0]
texttag = re.findall(r'<text>\s*(.+)\s*</text>',line,re.MULTILINE)
print texttag
回答1:
x="""<text>
some text efdg
some text abcd
</text> """
print [i for i in re.findall(r"<text>([\s\S]*?)<\/text>",x)[0].split("\n") if i]
You can fetch the text between the markers and then split to get your result.
回答2:
You could achieve this simply through BeautifulSoup parser.
>>> from bs4 import BeautifulSoup
>>> s = '''<doc>
<id1>123</id1>
<text>
abc
def
</text>
</doc>
<doc> <id1>1234</id1>
<text>
abcdd
defdd
</text>
</doc> '''
>>> soup = BeautifulSoup(s)
>>> [i.text.strip() for i in soup.findAll('text')]
['abc\ndef', 'abcdd\ndefdd']
来源:https://stackoverflow.com/questions/29483584/regex-to-read-between-multiline-tags