Extracting text from HTML file using Python

后端 未结 30 2814
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  生来不讨喜
    2020-11-22 04:19

    Instead of the HTMLParser module, check out htmllib. It has a similar interface, but does more of the work for you. (It is pretty ancient, so it's not much help in terms of getting rid of javascript and css. You could make a derived class, but and add methods with names like start_script and end_style (see the python docs for details), but it's hard to do this reliably for malformed html.) Anyway, here's something simple that prints the plain text to the console

    from htmllib import HTMLParser, HTMLParseError
    from formatter import AbstractFormatter, DumbWriter
    p = HTMLParser(AbstractFormatter(DumbWriter()))
    try: p.feed('hello
    there'); p.close() #calling close is not usually needed, but let's play it safe except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

提交回复
热议问题