HTML data extract in Java

微笑、不失礼 提交于 2019-12-02 04:30:51
ryanprayogo

As described in this post, you should not be using regex to parse HTML.

Use an XML/HTML parser instead.

Assuming the html is well formed, you can parse the html using HtmlUnit.

You could also write you own regular expression to process the page if there is just a single table but I would highly recommend against this as regular expressions might give strange results if the page added additional tables whereas with HtmlUnit you could validate that the page has only a single table before you start to parse or just target the table you wish.

http://htmlcleaner.sourceforge.net/

http://jsoup.org/

http://jericho.htmlparser.net/docs/index.html

are the well-known html parser for java. You can use any of them.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!