HTML data extract in Java

可紊 提交于 2019-12-02 15:13:47

问题


I have HTML code similar to :

<tr><td >1    </td>
<td class="tab-links">Value 1</td>
</tr>
<tr><td >2    </td>
<td class="tab-links">Value 2</td>
</tr>
<tr><td >3    </td>
<td class="tab-links">Value 3</td>
</tr>
<tr><td >4    </td>
<td class="tab-links">Value 4</td>
</tr>

now I want to extract the data as follow please :

1 : Value 1
2 : Value 2
3 : Value 3
4 : Value 4

any ideas please ?


回答1:


As described in this post, you should not be using regex to parse HTML.

Use an XML/HTML parser instead.




回答2:


Assuming the html is well formed, you can parse the html using HtmlUnit.

You could also write you own regular expression to process the page if there is just a single table but I would highly recommend against this as regular expressions might give strange results if the page added additional tables whereas with HtmlUnit you could validate that the page has only a single table before you start to parse or just target the table you wish.




回答3:


http://htmlcleaner.sourceforge.net/

http://jsoup.org/

http://jericho.htmlparser.net/docs/index.html

are the well-known html parser for java. You can use any of them.



来源:https://stackoverflow.com/questions/6496134/html-data-extract-in-java

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!