python [lxml] - cleaning out html tags

前端 未结 3 2034
梦谈多话
梦谈多话 2020-12-06 05:15
from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_         


        
相关标签:
3条回答
  • 2020-12-06 05:52

    I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:

    from BeautifulSoup import BeautifulSoup
    
    ''.join(BeautifulSoup(page).findAll(text=True))
    

    Where page is your string of html.

    Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.

    0 讨论(0)
  • 2020-12-06 06:01

    solution from David concatenates the text with no separator:

       import lxml.html
       document = lxml.html.document_fromstring(html_string)
       # internally does: etree.XPath("string()")(document)
       print document.text_content()
    

    but this one helped me - concatenation the way I needed:

       from lxml import etree
       print "\n".join(etree.XPath("//text()")(document))
    
    0 讨论(0)
  • 2020-12-06 06:02

    Not sure if this method existed around the time you made your question, but if you go through

    document = lxml.html.document_fromstring(html_text)
    raw_text = document.text_content()
    

    That should return you all the text content in the html document, minus all the markup.

    0 讨论(0)
提交回复
热议问题