How can I extract only the main textual content from an HTML page?

前端 未结 9 1493
旧巷少年郎
旧巷少年郎 2021-01-31 04:48

Update

Boilerpipe appears to work really well, but I realized that I don\'t need only the main content because many pages don\'t have an article, but only links with s

9条回答
  •  無奈伤痛
    2021-01-31 05:43

    You can use some libs like goose. It works best on articles/news. You can also check javascript code that does similar extraction as goose with the readability bookmarklet

提交回复
热议问题