Text extraction with java html parsers

你。 提交于 2019-11-28 12:55:03

问题


I want to use an html parser that does the following in a nice, elegant way

  1. Extract text (this is most important)
  2. Extract links, meta keywords
  3. Reconstruct original doc (optional but nice feature to have)

From my investigation so far jericho seems to fit. Any other open source libraries you guys would recommend?


回答1:


I recently experimented with HtmlCleaner and CyberNekoHtml. CyberNekoHtml is a DOM/SAX parser that produces predictable results. HtmlCleaner is a tad faster, but quite often fails to produce accurate results.

I would recommend CyberNekoHtml. CyberNekoHtml can do all of the things you mentioned. It is very easy to extract a list of all elements, and their attributes, for example. It would be possible to traverse the DOM tree building each element back into HTML if you wanted to reconstruct the page.

There's a list of open source java html parsers here: http://java-source.net/open-source/html-parsers




回答2:


I would definitely go for JSoup.

Very elegant library and does exactly what you need.

See Example Here




回答3:


I ended up using HtmlCleaner http://htmlcleaner.sourceforge.net/ for something similar. It's really easy to use and was quick for what I needed.



来源:https://stackoverflow.com/questions/2609948/text-extraction-with-java-html-parsers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!