python [lxml] - cleaning out html tags

匿名 (未验证) 提交于 2019-12-03 01:20:02

问题:

from lxml.html.clean import clean_html, Cleaner     def clean(text):         try:                     cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,                       remove_tags = ['a', 'li', 'td'])             print (len(cleaner.clean_html(text))- len(text))             return cleaner.clean_html(text)          except:             print 'Error in clean_html'             print sys.exc_info()             return text 

I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True

any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

回答1:

Not sure if this method existed around the time you made your question, but if you go through

document = lxml.html.document_fromstring(html_text) raw_text = document.text_content() 

That should return you all the text content in the html document, minus all the markup.



回答2:

solution from David concatenates the text with no separator:

   import lxml.html    document = lxml.html.document_fromstring(html_string)    # internally does: etree.XPath("string()")(document)    print document.text_content() 

but this one helped me - concatenation the way I needed:

   from lxml import etree    print "\n".join(etree.XPath("//text()")(document)) 


回答3:

I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:

from BeautifulSoup import BeautifulSoup  ''.join(BeautifulSoup(page).findAll(text=True)) 

Where page is your string of html.

Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!