How can I extract only the main textual content from an HTML page?

前端 未结 9 1494
旧巷少年郎
旧巷少年郎 2021-01-31 04:48

Update

Boilerpipe appears to work really well, but I realized that I don\'t need only the main content because many pages don\'t have an article, but only links with s

9条回答
  •  自闭症患者
    2021-01-31 05:30

    You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:

    Tag Soup

    HTML Unit

提交回复
热议问题