问题
I have HTML code similar to :
<tr><td >1 </td>
<td class="tab-links">Value 1</td>
</tr>
<tr><td >2 </td>
<td class="tab-links">Value 2</td>
</tr>
<tr><td >3 </td>
<td class="tab-links">Value 3</td>
</tr>
<tr><td >4 </td>
<td class="tab-links">Value 4</td>
</tr>
now I want to extract the data as follow please :
1 : Value 1
2 : Value 2
3 : Value 3
4 : Value 4
any ideas please ?
回答1:
As described in this post, you should not be using regex to parse HTML.
Use an XML/HTML parser instead.
回答2:
Assuming the html is well formed, you can parse the html using HtmlUnit.
You could also write you own regular expression to process the page if there is just a single table but I would highly recommend against this as regular expressions might give strange results if the page added additional tables whereas with HtmlUnit you could validate that the page has only a single table before you start to parse or just target the table you wish.
回答3:
http://htmlcleaner.sourceforge.net/
http://jsoup.org/
http://jericho.htmlparser.net/docs/index.html
are the well-known html parser for java. You can use any of them.
来源:https://stackoverflow.com/questions/6496134/html-data-extract-in-java