Parse Web Site HTML with JAVA [duplicate]

十年热恋 提交于 2019-11-26 11:17:59

There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Or if you want the body:

Elements body = doc.select("body");

Or if you want all links:

Elements links = doc.select("body a");

You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.

Definitely JSoup is the answer. ;-)

HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:

http://java-source.net/open-source/html-parsers

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!