Parsing unclosed `<br>` tags with BeautifulSoup

元气小坏坏 提交于 2020-01-02 05:20:14

问题


BeautifulSoup has logic for closing consecutive <br> tags that doesn't do quite what I want it to do. For example,

>>> from bs4 import BeautifulSoup
>>> bs = BeautifulSoup('one<br>two<br>three<br>four')

The HTML would render as

one
two
three
four

I'd like to parse it into a list of strings, ['one','two','three','four']. BeautifulSoup's tag-closing logic means that I get nested tags when I ask for all the <br> elements.

>>> bs('br')
[<br>two<br>three<br>four</br></br></br>,
 <br>three<br>four</br></br>,
 <br>four</br>]

Is there a simple way to get the result I want?


回答1:


import bs4 as bs
soup = bs.BeautifulSoup('one<br>two<br>three<br>four')
print(soup.find_all(text=True))

yields

[u'one', u'two', u'three', u'four']

Or, using lxml:

import lxml.html as LH
doc = LH.fromstring('one<br>two<br>three<br>four')
print(list(doc.itertext()))

yields

['one', 'two', 'three', 'four']


来源:https://stackoverflow.com/questions/13481408/parsing-unclosed-br-tags-with-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!