python method to extract content (excluding navigation) from an HTML page

前端 未结 5 544
无人及你
无人及你 2021-01-31 23:13

Of course an HTML page can be parsed using any number of python parsers, but I\'m surprised that there don\'t seem to be any public parsing scripts to extract meaningful content

5条回答
  •  暖寄归人
    2021-01-31 23:56

    What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.

    If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.

    If you want to do it, go with a regexp on

    , or parse the DOM.

提交回复
热议问题