Regex to read between multiline tags?

萝らか妹 提交于 2021-02-05 08:39:23

问题


I have text in this form :
<text>
some text efdg
some text abcd
</text>

I am writing a regex to extract :
some text efdg
some text abcd

Since it is multiline I am using <text>\n+ (^+?) \n+<text> , however It is not working. How can this be done?

I tried using r'^.*?' but doesn't seem to be working.

Code : Input file is :

<doc>
<id1>123</id1>
<text>
abc
def
</text>
</doc>
<doc> <id1>1234</id1>
<text>
abcdd
defdd
</text>
</doc>

for line in f.read().split('</doc>\n'):

    tag = re.findall(r'<id1>\s*(.+)\s*</id1>',line)  
    print tag[0]
    texttag = re.findall(r'<text>\s*(.+)\s*</text>',line,re.MULTILINE)
    print texttag 

回答1:


x="""<text>
some text efdg
some text abcd
</text> """

print [i for i in re.findall(r"<text>([\s\S]*?)<\/text>",x)[0].split("\n") if i]

You can fetch the text between the markers and then split to get your result.




回答2:


You could achieve this simply through BeautifulSoup parser.

>>> from bs4 import BeautifulSoup
>>> s = '''<doc>
<id1>123</id1>
<text>
abc
def
</text>
</doc>
<doc> <id1>1234</id1>
<text>
abcdd
defdd
</text>
</doc> '''
>>> soup = BeautifulSoup(s)
>>> [i.text.strip() for i in soup.findAll('text')]
['abc\ndef', 'abcdd\ndefdd']


来源:https://stackoverflow.com/questions/29483584/regex-to-read-between-multiline-tags

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!