Extracting text from HTML file using Python

后端 未结 30 2836
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  说谎
    说谎 (楼主)
    2020-11-22 04:28

    There is Pattern library for data mining.

    http://www.clips.ua.ac.be/pages/pattern-web

    You can even decide what tags to keep:

    s = URL('http://www.clips.ua.ac.be').download()
    s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
    print s
    

提交回复
热议问题