python [lxml] - cleaning out html tags

前端 未结 3 2037
梦谈多话
梦谈多话 2020-12-06 05:15
from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_         


        
3条回答
  •  一生所求
    2020-12-06 05:52

    I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:

    from BeautifulSoup import BeautifulSoup
    
    ''.join(BeautifulSoup(page).findAll(text=True))
    

    Where page is your string of html.

    Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.

提交回复
热议问题