Is it possible to use bs4 soup object with lxml?

二次信任 提交于 2021-01-29 09:35:44

问题


I am trying to use both BS4 and lxml so instead of parsing html page twice, is there any way to use soup object in lxml or vice versa?

self.soup = BeautifulSoup(open(path), "html.parser")

i tried using this object with lxml like this

 doc = html.fromstring(self.soup)

this is throwing error TypeError: expected string or bytes-like object

is there anyway to get this type of usage ?


回答1:


I don't think there is a way without going through a string object.

from bs4 import BeautifulSoup
import lxml.html

html = """
<html><body>
<div>
<p>test</p>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
# Soup to lxml.html
doc = lxml.html.fromstring(soup.prettify())
print (type(doc))
print (lxml.html.tostring(doc))
#lxml.html to soup
soup = BeautifulSoup(lxml.html.tostring(doc), 'html.parser')
print (type(soup))
print (soup.prettify())

Outputs:

<class 'lxml.html.HtmlElement'>
b'<html>\n <body>\n  <div>\n   <p>\n    test\n   </p>\n  </div>\n </body>\n</html>'
<class 'bs4.BeautifulSoup'>
<html>
 <body>
  <div>
   <p>
    test
   </p>
  </div>
 </body>
</html>

Updated in response to comment:

You can use lxml.etree to iterate through the doc object:

# Soup to lxml.etree
doc = etree.fromstring(soup.prettify())
it = doc.getiterator()
for  element in it:
    print("%s - %s" % (element.tag, element.text.strip()))


来源:https://stackoverflow.com/questions/52316737/is-it-possible-to-use-bs4-soup-object-with-lxml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!