from lxml.html.clean import clean_html, Cleaner
def clean(text):
try:
cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
remove_tags = ['a', 'li', 'td'])
print (len(cleaner.clean_html(text))- len(text))
return cleaner.clean_html(text)
except:
print 'Error in clean_html'
print sys.exc_info()
return text
I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags
and links=True
any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?
Not sure if this method existed around the time you made your question, but if you go through
document = lxml.html.document_fromstring(html_text)
raw_text = document.text_content()
That should return you all the text content in the html document, minus all the markup.
solution from David concatenates the text with no separator:
import lxml.html
document = lxml.html.document_fromstring(html_string)
# internally does: etree.XPath("string()")(document)
print document.text_content()
but this one helped me - concatenation the way I needed:
from lxml import etree
print "\n".join(etree.XPath("//text()")(document))
I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:
from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(page).findAll(text=True))
Where page
is your string of html.
Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.
来源:https://stackoverflow.com/questions/2950131/python-lxml-cleaning-out-html-tags