可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

from lxml.html.clean import clean_html, Cleaner     def clean(text):         try:                     cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,                       remove_tags = ['a', 'li', 'td'])             print (len(cleaner.clean_html(text))- len(text))             return cleaner.clean_html(text)          except:             print 'Error in clean_html'             print sys.exc_info()             return text

I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True

any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

回答1:

Not sure if this method existed around the time you made your question, but if you go through

document = lxml.html.document_fromstring(html_text) raw_text = document.text_content()

That should return you all the text content in the html document, minus all the markup.

回答2:

solution from David concatenates the text with no separator:

   import lxml.html    document = lxml.html.document_fromstring(html_string)    # internally does: etree.XPath("string()")(document)    print document.text_content()

but this one helped me - concatenation the way I needed:

   from lxml import etree    print "\n".join(etree.XPath("//text()")(document))

回答3:

I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:

from BeautifulSoup import BeautifulSoup  ''.join(BeautifulSoup(page).findAll(text=True))

Where page is your string of html.

Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.

文章来源: python [lxml] - cleaning out html tags

标签

lxml

python