HTML data extract in Java

问题

I have HTML code similar to :

<tr><td >1    </td>
<td class="tab-links">Value 1</td>
</tr>
<tr><td >2    </td>
<td class="tab-links">Value 2</td>
</tr>
<tr><td >3    </td>
<td class="tab-links">Value 3</td>
</tr>
<tr><td >4    </td>
<td class="tab-links">Value 4</td>
</tr>

now I want to extract the data as follow please :

1 : Value 1
2 : Value 2
3 : Value 3
4 : Value 4

any ideas please ?

回答1:

As described in this post, you should not be using regex to parse HTML.

Use an XML/HTML parser instead.

回答2:

Assuming the html is well formed, you can parse the html using HtmlUnit.

You could also write you own regular expression to process the page if there is just a single table but I would highly recommend against this as regular expressions might give strange results if the page added additional tables whereas with HtmlUnit you could validate that the page has only a single table before you start to parse or just target the table you wish.

回答3:

http://htmlcleaner.sourceforge.net/

http://jsoup.org/

http://jericho.htmlparser.net/docs/index.html

are the well-known html parser for java. You can use any of them.

来源：https://stackoverflow.com/questions/6496134/html-data-extract-in-java

标签

java

html-helper

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!