Parsing HTML documents using lxml in python

问题

I just downloaded lxml to parse broken HTML documents. I was reading through the documentation of lxml but could not find that given a HTML document, how do we just retrieve the text in the document using lxml. I will be obliged if someone could help me with this.

回答1:

It's very simple:

from lxml import html
html_document = ... #Get your document contents here from a file or whatever

tree = html.fromstring(html_document)
text_document = tree.text_content()

If you only want the content from specific blocks (e.g. the body block), then you can access them using xpath expressions:

body_tags = tree.xpath('//body')
if body_tags:
  body = body_tags[0]
  text_document = body.text_content()
else:
  text_document = ''

来源：https://stackoverflow.com/questions/12073781/parsing-html-documents-using-lxml-in-python

标签

python

lxml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!