How can I extract only the main textual content from an HTML page?

前端 未结 9 1577
旧巷少年郎
旧巷少年郎 2021-01-31 04:48

Update

Boilerpipe appears to work really well, but I realized that I don\'t need only the main content because many pages don\'t have an article, but only links with s

9条回答
  •  感动是毒
    2021-01-31 05:39

    Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

    There are a few ways to feed HTML into Boilerpipe and extract HTML.

    You can use a URL:

    ArticleExtractor.INSTANCE.getText(url);
    

    You can use a String:

    ArticleExtractor.INSTANCE.getText(myHtml);
    

    There are also options to use a Reader, which opens up a large number of options.

提交回复
热议问题