python [lxml] - cleaning out html tags

别来无恙 提交于 2019-11-27 22:59:31

Not sure if this method existed around the time you made your question, but if you go through

document = lxml.html.document_fromstring(html_text)
raw_text = document.text_content()

That should return you all the text content in the html document, minus all the markup.

Robert Lujo

solution from David concatenates the text with no separator:

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

but this one helped me - concatenation the way I needed:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))
KushalP

I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

Where page is your string of html.

Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!