Extracting text from HTML file using Python

后端 未结 30 2794
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  暖寄归人
    2020-11-22 04:16

    if you need more speed and less accuracy then you could use raw lxml.

    import lxml.html as lh
    from lxml.html.clean import clean_html
    
    def lxml_to_text(html):
        doc = lh.fromstring(html)
        doc = clean_html(doc)
        return doc.text_content()
    

提交回复
热议问题